Professional Documents
Culture Documents
2 Data Literacy Essentials of Azure Data Factory
2 Data Literacy Essentials of Azure Data Factory
ETL: transform before loading, designed for reliability (Data is fully processed before loading), easily
queried
ELT: transform after loading (on data warehouse), designed for agility.Synapse fast processing and
easily storage
There are two versions of Azure Data Factory. Version 1 mostly works via configuration json files
ADF v2 is the current and improved version, far easier to work with, there is no need to work with
v2
If there is a question about if Azure Data Factory is integrated with other system, the answer is
most likely to be YES ***
Activity: they represent each one of the steps performed on the data, which could be a copy, move,
transform, enrich etc.
Pipeline: They are the logical grouping of all activities needed to perform a unit of work to assemble
the toy. Pipelines are actually the work you do on Data Factory. They can be created graphically on
the Data Factory site or programmatically through code.
Integration runtime: They are the computing infrastructure of Azure Data Factory, like the engine of
the conveyor belt.
Trigger: They are the power switch and represent when you want this pipeline to execute. It can be a
specific time
Linked service: They are like the tray. It tells you where it is your data, blob storage or Data Lake/SQL
database.
Dataset: They are the representations of the data that we’re working with.
Element’s relationship
Activities and Pipelines
Pipeline allows a group of activities run as a set.
Data movement (copy/delete in ADF): It can work with 87 different data stores at the time. Those
range from obvious ones such as SQL Server all the way to more exotic options such as
QuickBook/Concur. You can also use generic connectors such as ODBC or HTTP in case there is no
connector specific to the system that you are working with.
Data Transformation: Mapping data flow activity can be an option to transform your data. If you
would like to call external service, you have 13 different options, ranging from HDInsight, which
could call Hive, Hadoop Streaming, Spark, Pig and MapReduce
Data Control: It define the logic that your pipeline is going to follow with options such as For Each,
set and Weight. It also allows you to invoke other pipelines or assess package
Integration Runtimes
Actually. it is what you are paying for on Azure Data Factory.
Integration runtime provides 4 main capabilities: Data Flow Execution, Data Movement execution,
Dispatch of Activities, SSIS package Execution
1. Azure IR (Auto Resolve Integration Runtime): For data movement between public endpoints.
a. For instance, Amazon, GCP, or Azure itself
b. Software as a service resources such as Salesforce and SAP
c. Mapping flows on Azure
2. Self-Hosted IR: For connection to private and on premises resources, via regular HTTP
connection
3. Azure-SSIS IR: Exclusively for executing SSIS packages. It is useful if you have done a lot of
work on SSIS in the past and you want to immigrate all of that to Azure Data Factory. (When
you are creating a virtual machine on SQL, you should use SSIS packages)
For example, the linked service will point to the SQL database and datasets are Tables
If the linked service is blob storage, files are the datasets
Triggers
There are 3 types of trigger:
SSIS doesn’t have an event-based trigger, Azure Data Factory also has an option to connect with Git