Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

2 Data Literacy: Essentials of Azure Data Factory

This course is focusing on general concepts, for implementation level courses

1 What is Azure Data Factory


From L.O.B, CRM, ERP, Social data, IoT dataset to data warehouse using ETL

Concept: ETL vs ELT

ETL: transform before loading, designed for reliability (Data is fully processed before loading), easily
queried

ELT: transform after loading (on data warehouse), designed for agility.Synapse fast processing and
easily storage

Azure Data factory can do ETL and ELT

2 Data Factory within Microsoft

There are two versions of Azure Data Factory. Version 1 mostly works via configuration json files

ADF v2 is the current and improved version, far easier to work with, there is no need to work with
v2

Build options: PowerShell, .NET, Python, Rest, ARM

Highly integrated: DevOps, Key Vault, Monitor, Automation

If there is a question about if Azure Data Factory is integrated with other system, the answer is
most likely to be YES ***

No data storage, need to persist data by the end

Security standards: HTTP/TLS connection whenever possible ***


3 Main Data Factory Elements

There are 6 main elements in the Data Factory

Activity: they represent each one of the steps performed on the data, which could be a copy, move,
transform, enrich etc.

Pipeline: They are the logical grouping of all activities needed to perform a unit of work to assemble
the toy. Pipelines are actually the work you do on Data Factory. They can be created graphically on
the Data Factory site or programmatically through code.

Integration runtime: They are the computing infrastructure of Azure Data Factory, like the engine of
the conveyor belt.

Trigger: They are the power switch and represent when you want this pipeline to execute. It can be a
specific time

Linked service: They are like the tray. It tells you where it is your data, blob storage or Data Lake/SQL
database.

Dataset: They are the representations of the data that we’re working with.

Element’s relationship
Activities and Pipelines
Pipeline allows a group of activities run as a set.

There are three kinds of activities in Azure Data Factory.

Data movement (copy/delete in ADF): It can work with 87 different data stores at the time. Those
range from obvious ones such as SQL Server all the way to more exotic options such as
QuickBook/Concur. You can also use generic connectors such as ODBC or HTTP in case there is no
connector specific to the system that you are working with.

Data Transformation: Mapping data flow activity can be an option to transform your data. If you
would like to call external service, you have 13 different options, ranging from HDInsight, which
could call Hive, Hadoop Streaming, Spark, Pig and MapReduce

Data Control: It define the logic that your pipeline is going to follow with options such as For Each,
set and Weight. It also allows you to invoke other pipelines or assess package

Integration Runtimes
Actually. it is what you are paying for on Azure Data Factory.

Integration runtime provides 4 main capabilities: Data Flow Execution, Data Movement execution,
Dispatch of Activities, SSIS package Execution

Integration Runtime Types:

1. Azure IR (Auto Resolve Integration Runtime): For data movement between public endpoints.
a. For instance, Amazon, GCP, or Azure itself
b. Software as a service resources such as Salesforce and SAP
c. Mapping flows on Azure
2. Self-Hosted IR: For connection to private and on premises resources, via regular HTTP
connection
3. Azure-SSIS IR: Exclusively for executing SSIS packages. It is useful if you have done a lot of
work on SSIS in the past and you want to immigrate all of that to Azure Data Factory. (When
you are creating a virtual machine on SQL, you should use SSIS packages)

Choosing the right integration runtime

Linked Service and dataset


Linked Services: similar to Connection Strings. There are two types of Linked Services, which are
data stores and External Compute Services

Datasets are about the data structure

For example, the linked service will point to the SQL database and datasets are Tables
If the linked service is blob storage, files are the datasets

Triggers
There are 3 types of trigger:

Schedule: Focus on ON/AT (for instance on Sunday, at midnight)

Tumbling window: Focus on EVERY (every two hours)

Event-based: Fired based on an event (file arrival)

SSIS doesn’t have an event-based trigger, Azure Data Factory also has an option to connect with Git

You might also like