Moving Big data in and around Azure using Azure Data Factory.

Author by Siddharth Bhola

There are numerous data storage options available on Azure, each one designed and developed for different modern data storage scenarios. These storage options could be in the form of database, data warehouse, data caches and data lakes. Usage of these depends on the application and the scale that they serve. Within databases, some applications might need relational database, some might need NOSQL, or a key-value storage, or in-memory database (for caching), or blob storage (for media and large files). Another criteria to keep in mind when selecting a database for your application is the required read-write throughput and latency. Azure has a wide array of fully-managed database services which frees up the development teams valuable time in managing, scaling and configuring these databases.

Whatever database you choose, you should also keep in mind how easy or difficult it is to move the data in and out of that database. You might have a situation in future where you need to move to a new database solution because of reasons like change in application architecture, scale, performance, or even cost. Microsoft Azure has a very powerful ETL tool called Azure Data Factory to easily move data in and around Azure at scale. It has over 80 native connectors which can serve both as source and sink.  In this blog, I would like to highlight a few features and concepts of Azure Data Factory which will serve as a quick start guide for anyone looking to do data movement and transformation on Azure.

 

Overview of Azure Data Factory

Azure Data Factory is a data integration service that allows you to create data-driven workflows in the cloud for automating data movement and data transformation. Using the intuitive UI on Azure, we can create and schedule data pipelines that ingest and process data from various sources without having to write a single line of code. No code or maintenance is required to build hybrid ETL pipelines within the Data Factory visual environment. It is a fully-managed data integration tool that scales automatically on demand. Another great feature of Azure Data factory is that we can write our own code in the choice of our programming language to build and run pipelines. We can even insert custom code as a processing step in any pipeline to do advanced transformations. We can continuously monitor and manage pipeline performance and set alerts alongside applications from a single console with Azure Monitor.

 

Key Components of Azure Data Factory

  1. Pipeline
    A Pipeline is a logical grouping of activities which are to be performed to achieve a desired task. Multiple activities in a pipeline together perform a task. For example, a pipeline can contain a group of activities that ingest data from Azure Cosmos Db, and then runs a Hive query on HDInsight cluster to partition the data. The activities in a pipeline can be chained together to run sequentially, or they can even operate independently in parallel.
  2. Activity
    Activity is a processing step in a pipeline. For example, you might use a copy activity to copy data from one data store to another. Data factory supports 3 types of activities: data movement, data transformation, and data control.
  3. Datasets
    Datasets represents data structures within data stores, which points to the data you want to use in your activities as inputs or outputs.
  4. Linked Services
    Linked services are similar to connection strings, which holds the connection information needed to connect to external resources. For example, Azure storage linked service specifies a connection string to connect to Azure storage account.  Linked services can be used to represent either a data store (like SQL server, Azure blog storage etc), or a compute resource (like HDInsight) to run Hive queries.
  5. Triggers
    Triggers are basically events that determines when a pipeline execution needs to be kicked off. There are different types of triggers for different activities.
  6. Parameters
    Parameters are key-value pairs of configuration that are defined in the pipeline. The arguments for the defined parameters are passed during execution from the run context that was created by a trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
  7. Control Flow
    Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.

I hope that you will now have a good idea about the capabilities of Azure data factory and its characteristics. I have also recently started utilizing Azure data factory to migrate data from Azure Cosmos DB to Azure Table storage and found it very useful especially because it saves a lot of development time and effort. In my next blog, I will publish a step-by-step guide on creating your first copy pipeline on Azure data factory and show you how to configure it to migrate data from Cosmos DB to Azure Table storage.