Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Last Updated: Jun 14, 2024

Azure Data Factory

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

Introduction

Welcome readers! We hope you are doing well.

Did you ever try to learn Azure Data Factory but due to some circumstances, you could not make it? Don’t worry, Coding Ninjas is there to help you out.

Today in this article, we will discuss the Azure Data Factory with a proper explanation. This article will give you enough knowledge about the Azure Data Factory. So follow the article till the end.

Azure Data Factory Classification

 

So, without wasting more time, let’s start our discussion.

What is Azure Data Factory

Azure Data Factory is a cloud-based, serverless and fully managed data integration service offered by the Azure platform of Microsoft to enable data integration from many different sources. It allows us to create and schedule data-driven workflows(also known as pipelines)  in the cloud. The Azure Data Factory is perfect for building hybrid ETL(extract-transform-load),  ELT(extract-load-transform) and data integration pipelines on the cloud.

data integration pipelines
Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

How does Azure Data Factory Work?

The Azure Data Factory(ADF) works in four steps:

  • Connect & Collect: The organisations have various types of data stored in different locations such as on-premises, in the cloud etc. They all arrive at different intervals and speeds. 
    The first step is connecting to all the data and processing sources, such as databases, Software as a Service(SaaS) services, file shares etc. Then, move the data to a centralised location for further processing.
     
  • Transform: After moving the data to a centralised data store in the cloud, transform the collected data using the ADF(Azure Data Factory) mapping data flow or by using compute services such as HDInsight Hadoop, Spark, Machine Learning and Data Lake Analytics.
     
  • Publish: After refining the raw data into a business-ready consumable form, send it to a destination data storage such as Azure Data Warehouse, Azure SQL Database, Azure Cosmos DB or analytics engine.
     
  • Monitor: Azure Data Factory(ADF) has in-built support for monitoring the data integration pipeline on the Azure portal via Azure monitor, Azure Monitor logs, health panels, API and PowerShell.
Pipeline of Azure Data Factory

                                                                                

Key Components 

Following are the key components of the Azure Data Factory:

  • Pipelines
  • Activities
  • Datasets
  • Linked Services
  • Triggers
  • Data Flows
  • Integration Runtimes
     

The components work together to perform data-driven workflows to move and transform data with different steps mentioned earlier.

 

Key Components of Azure Data Factory

Let’s now discuss each of these components separately.

Pipelines

Pipelines are the logical grouping of activities that perform a unit of work. An Azure data factory can have one or more pipelines. The activities in the pipeline altogether perform a task. 

For example, a pipeline containing a group of activities ingest data from the Azure blob. Then it runs a Hive query on an HDInsight cluster for data partitioning.

The main benefit of performing a task by the activities is that it allows you to manage a group of activities together instead of managing them individually. Either you can chain the activities in a pipeline to operate sequentially, or they can operate parallelly.

Activities

The activity denotes a processing step in a pipeline. For example, a copy activity copies data from one data store to another, and a hive activity runs a Hive query on an Azure HDInsight cluster to transform the data.

The Azure Data Factory(ADF) supports three types of activities:

  • Data movement activities
  • Data transformation activities
  • Control activities

Datasets

The datasets represent the data structures within the data stores. It just references the data that you want to use in your activities.

Linked Services

The linked services define the information needed for Data Factory to connect to external resources. They are much similar to connection strings.

The linked services in the Data Factory are mainly used for two purposes:

  • To represent a data store.
  • To represent a compute resource for hosting the execution of an activity.

Triggers

In Azure Data Factory, Triggers represent the unit of processing. It initiates the execution of the pipeline. It also determines when a pipeline execution needs to be kicked off. 

There are three different types of triggers in Azure Data Factory:

  • Schedule Trigger: This type of trigger invokes a pipeline at a specific time and frequency.
  • Tumbling Window Trigger: This type of trigger works on a periodic interval.
  • Event-based Trigger: This type of trigger invokes the pipeline to respond to a blob-related event.

Data Flows

Data Flows are the special activities that allow the development of a data transformation logic visually without writing code. Using the visual editor, you can transform the data in multiple steps. They are executed inside the ADF pipeline on the Azure Databricks cluster for scaled-out processing using Spark. ADF controls all the data flow execution and code translation.
 

Data Flows

 

  • Mapping data flows: In Azure Data Factory(ADF), the mapping data flows are the visually designed data transformations. It is used to create and manage graphs of data transformation logic that can be used to transform data of any size without writing code. You can build and execute a reusable library of data transformation routines from your ADF pipelines.
     
  • Control flow: It is an orchestration of the pipeline activities. It includes chaining activities in a sequence, branching, defining parameters at the pipeline level and passing the arguments while invoking the pipeline on-demand. It also includes custom state passing and looping containers, i.e. (for-each iterators).

Integration Runtimes

In Azure Data Factory, we previously discussed the activity, which is nothing but an action to be performed. We also discussed the linked services as a target data store or a compute service. 

An integration runtime provides the bridge between the activity and the linked services. It provides a computing environment where the activity either runs on or gets dispatched from.

Benefits

Some of the benefits of the Azure Data Factory are mentioned below:

  • Azure Data Factory is a cloud-based solution that works with both on-premise and cloud-based data stores, giving cost-effective and scalable solutions.
     
  • You can easily migrate the ETL workloads to the Azure cloud from the on-premises data stores.
     
  • Azure Data Factory comes up with many default connectors with close to all on-premise data sources, including MySQL, SQL Server, and Oracle DBs, which makes it effortless for different activities.
     
  • Azure Data Factory has Mapping Data Flows to create and manage graphs of data transformation logic that can be used to transform data of any size without writing code
     
  • It can run the SSIS(SQL Server Integration Service) package in an Azure-SSIS integration runtime.
     
  • Azure Data Factory gives managed virtual networks to simplify your networking and protect against data exfiltration.

Frequently Asked Questions

What do you mean by Azure Data Factory?

Azure Data Factory is a cloud-based, serverless and fully managed data integration service offered by the Azure platform of Microsoft to enable data integration from many different sources.  

Is Azure Data Factory an ETL tool?

Yes, Azure Data Factory is a cloud-based ETL and data integration service offered by the Azure platform of Microsoft to enable data integration from many different sources.  

Is Azure Data Factory PaaS or SaaS?

The Azure Data Factory is a Microsoft PaaS solution that enables data transformation and load. It also supports data movement between the on-premise and cloud data sources.

What is the main difference between the Azure Data Factory and SSIS?

The main difference between the Azure Data Factory and the SSIS(SQL Server Integration Service) is that SSIS is an on-premises tool and is mostly suited for on-premises use cases.
On the other hand, the Azure Data Factory(ADF) is a cloud-based tool typically suited for cloud-based use cases.

Conclusion

In this article, we have extensively discussed the Azure Data Factory.

We started with a brief introduction to the article. Then we discussed the followings:

  • What is Azure Data Factory
  • How Does Azure Data Factory Work
  • Key Components
  • Benefits
     

We hope that this blog gives you some ideas regarding Azure Data Factory. If you want to learn more, follow our articles on ETL in Big Data, AWS vs Azure and Google CloudAzure Data Factory Interview Questions and Azure Data Engineers Interview QuestionsExplore our practice platform Coding Ninjas Studio to practice top problems, attempt mock tests, read interview experiences, interview bundle, follow guided paths for placement preparations and much more.!

Do upvote this blog to help other ninjas grow.

Happy Reading!

Topics covered
1.
Introduction
2.
What is Azure Data Factory
3.
How does Azure Data Factory Work?
4.
Key Components 
4.1.
Pipelines
4.2.
Activities
4.3.
Datasets
4.4.
Linked Services
4.5.
Triggers
4.6.
Data Flows
4.7.
Integration Runtimes
5.
Benefits
6.
Frequently Asked Questions
6.1.
What do you mean by Azure Data Factory?
6.2.
Is Azure Data Factory an ETL tool?
6.3.
Is Azure Data Factory PaaS or SaaS?
6.4.
What is the main difference between the Azure Data Factory and SSIS?
7.
Conclusion