2 Data Literacy Essentials of Azure Data Factory

2 Data Literacy: Essentials of Azure Data Factory
This course is focusing on general concepts, for implementation level courses
1 What is Azure Data Factory

From L.O.B, CRM, ERP, Social data, IoT dataset to data warehouse using ETL
Concept: ETL vs ELT
ETL: transform before loading, designed for reliability (Data is fully processed before loading), easily
queried
ELT: transform after loading (on data warehouse), designed for agility.Synapse fast processing and
easily storage
Azure Data factory can do ETL and ELT
2 Data Factory within Microsoft
There are two versions of Azure Data Factory. Version 1 mostly works via configuration json files
ADF v2 is the current and improved version, far easier to work with, there is no need to work with
v2
Build options: PowerShell, .NET, Python, Rest, ARM
Highly integrated: DevOps, Key Vault, Monitor, Automation
If there is a question about if Azure Data Factory is integrated with other system, the answer is
most likely to be YES ***
No data storage, need to persist data by the end
Security standards: HTTP/TLS connection whenever possible ***

3 Main Data Factory Elements
There are 6 main elements in the Data Factory
Activity: they represent each one of the steps performed on the data, which could be a copy, move,
transform, enrich etc.
Pipeline: They are the logical grouping of all activities needed to perform a unit of work to assemble
the toy. Pipelines are actually the work you do on Data Factory. They can be created graphically on
the Data Factory site or programmatically through code.
Integration runtime: They are the computing infrastructure of Azure Data Factory, like the engine of
the conveyor belt.
Trigger: They are the power switch and represent when you want this pipeline to execute. It can be a
specific time
Linked service: They are like the tray. It tells you where it is your data, blob storage or Data Lake/SQL
database.
Dataset: They are the representations of the data that we’re working with.
Element’s relationship
Activities and Pipelines
Pipeline allows a group of activities run as a set.
There are three kinds of activities in Azure Data Factory.
Data movement (copy/delete in ADF): It can work with 87 different data stores at the time. Those
range from obvious ones such as SQL Server all the way to more exotic options such as
QuickBook/Concur. You can also use generic connectors such as ODBC or HTTP in case there is no
connector specific to the system that you are working with.
Data Transformation: Mapping data flow activity can be an option to transform your data. If you
would like to call external service, you have 13 different options, ranging from HDInsight, which
could call Hive, Hadoop Streaming, Spark, Pig and MapReduce
Data Control: It define the logic that your pipeline is going to follow with options such as For Each,
set and Weight. It also allows you to invoke other pipelines or assess package
Integration Runtimes
Actually. it is what you are paying for on Azure Data Factory.
Integration runtime provides 4 main capabilities: Data Flow Execution, Data Movement execution,
Dispatch of Activities, SSIS package Execution
Integration Runtime Types:
1. Azure IR (Auto Resolve Integration Runtime): For data movement between public endpoints.
a. For instance, Amazon, GCP, or Azure itself
b. Software as a service resources such as Salesforce and SAP
c. Mapping flows on Azure
2. Self-Hosted IR: For connection to private and on premises resources, via regular HTTP
connection
3. Azure-SSIS IR: Exclusively for executing SSIS packages. It is useful if you have done a lot of
work on SSIS in the past and you want to immigrate all of that to Azure Data Factory. (When
you are creating a virtual machine on SQL, you should use SSIS packages)
Choosing the right integration runtime
Linked Service and dataset

Linked Services: similar to Connection Strings. There are two types of Linked Services, which are
data stores and External Compute Services
Datasets are about the data structure
For example, the linked service will point to the SQL database and datasets are Tables
If the linked service is blob storage, files are the datasets
Triggers
There are 3 types of trigger:
Schedule: Focus on ON/AT (for instance on Sunday, at midnight)
Tumbling window: Focus on EVERY (every two hours)
Event-based: Fired based on an event (file arrival)
SSIS doesn’t have an event-based trigger, Azure Data Factory also has an option to connect with Git

2 Data Literacy Essentials of Azure Data Factory

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Data Literacy Essentials of Azure Data Factory

Uploaded by

Copyright:

Available Formats

2 Data Literacy: Essentials of Azure Data Factory

This course is focusing on general concepts, for implementation level courses

1 What is Azure Data Factory

Concept: ETL vs ELT

Azure Data factory can do ETL and ELT

2 Data Factory within Microsoft

Build options: PowerShell, .NET, Python, Rest, ARM

Highly integrated: DevOps, Key Vault, Monitor, Automation

No data storage, need to persist data by the end

Security standards: HTTP/TLS connection whenever possible ***

There are 6 main elements in the Data Factory

There are three kinds of activities in Azure Data Factory.

Integration Runtime Types:

Choosing the right integration runtime

Linked Service and dataset

Datasets are about the data structure

Schedule: Focus on ON/AT (for instance on Sunday, at midnight)

Tumbling window: Focus on EVERY (every two hours)

Event-based: Fired based on an event (file arrival)

You might also like