ADF Copy Data

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 81

ADF Copy Data: Copy Data From Azure

Blob Storage To A SQL Database Using


Azure Data Factory
In this blog, we are going to cover the case study of ADF copy data from Blob storage to
a SQL Database with Azure Data Factory (ETL service) which we will be discussing in
detail in our Microsoft Azure Data Engineer Certification [DP-203] FREE CLASS.

The following diagram shows the logical components such as the Storage account (data
source), SQL database (sink), and Azure data factory that fit into a copy activity.

Topics, we’ll cover:

 Overview of Azure Data Factory


 Overview of Azure Blob Storage
 Overview of Azure SQL Database
 How to perform Copy Activity with Azure Data Factory
Before performing the copy activity in the Azure data factory, we should understand the
basic concept of the Azure data factory, Azure blob storage, and Azure SQL database.
Overview Of Azure Data Factory
 Azure Data Factory is defined as a cloud-based ETL and data integration service.
 The aim of Azure Data Factory is to fetch data from one or more data sources and
load them into a format that we process.
 The data sources might contain noise that we need to filter out. Azure Data Factory
enables us to pull the interesting data and remove the rest.
 Azure Data Factory to ingest data and load the data from a variety of sources into a
variety of destinations i.e. Azure data lake.
 It can create data-driven pipelines for orchestrating data movement and transforming
data at scale.
To download the complete DP-203 Azure Data Engineer Associate Exam Questions
guide click here.
Overview Of Azure Blob Storage
 Azure Blob storage is Microsoft’s Azure object storage solution for the cloud. It is
designed for optimizing and storing massive amounts of unstructured data.
 It is used for Streaming video and audio, writing to log files, and Storing data for
backup and restore disaster recovery, and archiving.
 Azure Blob storage offers three types of resources:
 The storage account
 A container in the storage account
 A blob in a container
 Objects in Azure Blob storage are accessible via the Azure PowerShell, Azure
Storage REST API, Azure CLI, or an Azure Storage client library.

Overview Of Azure SQL Database


 It is a fully-managed platform as a service. Here the platform manages aspects such
as database software upgrades, patching, backups, the monitoring.
 Using Azure SQL Database, we can provide a highly available and performant
storage layer for our applications.
 Types of Deployment Options for the SQL Database:
 Single Database
 Elastics Pool
 Managed Instance
 Azure SQL Database offers three service tiers:
 General Purpose or Standard
 Business Purpose or Premium
 Hyperscale

Note: If you want to learn more about it, then check our blog on Azure SQL Database
ADF Copy Data From Blob Storage To SQL Database
1. Create a blob and a SQL table
2. Create an Azure data factory
3. Use the Copy Data tool to create a pipeline and Monitor the pipeline
STEP 1: Create a blob and a SQL table
1) Create a source blob, launch Notepad on your desktop. Copy the following text and
save it in a file named input Emp.txt on your disk.
FirstName|LastName
John|Doe
Jane|Doe
2) Create a container in your Blob storage. Container named adftutorial.

Read: Reading and Writing Data In DataBricks


3) Upload the emp.txt file to the adfcontainer folder.
4) Create a sink SQL table, Use the following SQL script to create a table
named dbo.emp in your SQL Database.
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);
Note: Ensure that Allow access to Azure services is turned ON for your SQL Server so that
Data Factory can write data to your SQL Server. To verify and turn on this setting, go to
logical SQL server > Overview > Set server firewall> set the Allow access to Azure
services option to ON.
Also read: Azure Stream Analytics is the perfect solution when you require a fully
managed service with no infrastructure setup hassle.
STEP 2: Create a data factory
1) Sign in to the Azure portal. Select Analytics > Select Data Factory.

2) On The New Data Factory Page, Select Create

3) On the Basics Details page, Enter the following details. Then Select Git Configuration
4) On the Git configuration page, select the check box, and then Go To Networking.
Then select Review+Create
5) After the creation is finished, the Data Factory home page is displayed. select
the Author & Monitor tile.

Read: Azure Data Engineer Interview Questions September 2022


STEP 3: Use the ADF Copy Data tool to create a pipeline
1) Select the + (plus) button, and then select Pipeline.

2) In the General panel under Properties, specify CopyPipeline for Name. Then collapse
the panel by clicking the Properties icon in the top-right corner.
 

3) In the Activities toolbox, expand Move & Transform. Drag the Copy Data activity from
the Activities toolbox to the pipeline designer surface. You can also search for activities in
the Activities toolbox. Specify CopyFromBlobToSql for Name.

4)  Go to the Source tab. Select + New to create a source dataset.


5) In the New Dataset dialog box, select Azure Blob Storage to copy data from azure blob
storage, and then select Continue.
6) In the Select Format dialog box, choose the format type of your data, and then select
Continue.

Read: DP 203 Exam: Azure Data Engineer Study Guide


7) In the Set Properties dialog box, enter SourceBlobDataset for Name. Select the
checkbox for the first row as a header. Under the Linked service text box, select + New.
8) In the New Linked Service (Azure Blob Storage) dialog box, enter
AzureStorageLinkedService as name, select your storage account from the Storage
account name list. Test connection, select Create to deploy the linked service.
9) After the linked service is created, it’s navigated back to the Set properties page.
Nextto File path, select Browse. Navigate to the adftutorial/input folder, select
the emp.txt file, and then select OK

10) Select OK. It automatically navigates to the pipeline page. In the Source tab, confirm


that SourceBlobDataset is selected. To preview data on this page, select Preview data.
11) Go to the Sink tab, and select + New to create a sink dataset. . In the New Dataset
dialog box, input “SQL” in the search box to filter the connectors, select Azure SQL
Database, and then select Continue.
12) In the Set Properties dialog box, enter OutputSqlDataset for Name. From the Linked
service dropdown list, select + New.
Read: Microsoft Azure Data Engineer Associate [DP-203] Exam Questions
13) In the New Linked Service (Azure SQL Database) dialog box, fill the following details.

14) Test Connection may be failed. Go to your Azure SQL database, Select your


database. Go to Set Server Firewall setting page. On the Firewall settings page,
Select yes in Allow Azure services and resources to access this server.
Then Save settings.

15) On the New Linked Service (Azure SQL Database) Page, Select Test connection to
test the connection. Then Select Create to deploy the linked service.
16) It automatically navigates to the Set Properties dialog box. In Table, select [dbo].
[emp].Then select OK.
17) To validate the pipeline, select Validate from the toolbar.

18) Once the pipeline can run successfully, in the top toolbar, select Publish all.
Publishes entities (datasets, and pipelines) you created to Data Factory. Select Publish.
19) Select Trigger on the toolbar, and then select Trigger Now.  On the Pipeline Run
page, select OK.

20) Go to the Monitor tab on the left. You see a pipeline run that is triggered by a manual
trigger. You can use links under the PIPELINE NAME column to view activity details and to
rerun the pipeline.

21) To see activity runs associated with the pipeline run, select the CopyPipeline link
under the PIPELINE NAME column.
22) Select All pipeline runs at the top to go back to the Pipeline Runs view. To refresh the
view, select Refresh.
23) Verify that you create a Copy data from Azure Blob storage to a database in Azure
SQL Database by using Azure Data Factory is Succeeded

Congratulations! You just use the Copy Data tool to create a pipeline and Monitor the pipeline and
activity run successfully.

Overview

We've defined linked services and datasets, so now we can work on pipelines. A pipeline is a
container that will execute one or more activities.

The Purpose of Activities


An activity inside a pipeline corresponds with a single task. If we want to compare this with
Integration Services (SSIS), an activity in ADF is like a task on the control flow. If you create a
pipeline, you can find the list of activities in the left menu bar.
If you want to add an activity, you can drag it from the menu onto the canvas. If you have multiple
activities on a pipeline, they will all be executed in parallel unless they are connected with a
dependency. For example, in the following pipeline, all three pipelines will be executed at the
same time:

When you click on the little green box at the right an activity, you can draw an arrow from the
activity to another activity. This is how we define dependencies. In the following pipeline, Script1
will be executed first. If it is successful, Script2 will be executed. If that one is successful as well,
Script3 will be executed.
Dependencies are very similar to precedence constraints in SSIS, but not as flexible. In SSIS you
can define expressions on a constraint so it can be evaluated conditionally. This is not possible in
ADF. In the following example, Script3 depends on Script1 and Script2 with an AND constraint.

This means Script1 and Script2 will execute in parallel. Script3 will only be executed if both
succeed. There's currently no possibility to define an OR constraint, meaning Script3 will
execute if Script1 or Script2 succeeds. This is possible in SSIS.

When you right-click on a dependency, you can change its type.

 Success is the default dependency. Another task will only start if the previous task completed
successfully.
 Failure is the opposite. The task will only start if the previous task has failed. This is useful for error
handling and monitoring.
 Completion means a task will start when the previous task has completed. It doesn't matter if that
task was successful or not.
 Skipped is probably not used that much. A task will execute if the previous task has been skipped.
For example, in the following pipeline, Script3 will only execute if Script2 is skipped (meaning never
executed). This can happen when Script1 fails.

Which Types of Activities are there?


There are three types of activities:

 Data movement activities. This is basically only the Copy Activity. This activity was used by the
Copy Data tool in the earlier parts of the tutorial.
 Data transformation activities. Most of these are external activities, meaning the actual
computation doesn't happen at ADF itself but rather at another data store. A bit of an exception are
data flows, which we will cover later in the tutorial. Examples of transformation activities are
the stored procedure (executed on the database), Script (also executed on the database, which is a
fairly new addition to ADF), Azure Function, Hive/Pig/MapReduce/Spark (all on HDInsight)
and Databricks Notebook/JAR/Python (these are executed on an Azure Databricks Cluster).
 Control Flow activities. These activities are almost all native to ADF, meaning the computation is
done in ADF itself. You have activities for different purposes:
o to deal with variables (Append Variable, Set Variable and Filter)
o looping (ForEach and Until)
o branching (the If Condition activity)
o executing other pipelines with Execute Pipeline activity, or SSIS packages with Execute SSIS
Package activity.
o Handling metadata or reference data with the Get Metadata activity or the Lookup activity.
o You can do web calls with the Web and Webhook activities.
o You can validate other pipelines with the Validation activity. The Wait activity is used when you need
to wait for a specific amount of time.
With all these different activities, you can build intricate data pipelines that will support many
different use cases.

Activity Best Practices


 Give an activity a proper name, so if someone else opens your pipeline, they can easily understand
what the purpose of each activity is.
 Have a modular design. A pipeline should have only one purpose. For example, load data from a
data lake to SQL Server. If you for example need to do other steps, such as loading dimension and
fact tables, these should be done in another pipeline.
 You can have only maximum 40 activities in a pipeline. If you need more, you'll need to create
child pipelines and call them using the Execute Pipeline activity. If you have too many activities,
maybe you violated the previous best practice.
 Try to minimize the number of activities needed. For every started activity, ADF incurs the cost
of at least one minute and they round up. For example, if an activity runs for 65 seconds, you will
pay for two full minutes. If a pipeline has 5 activities, you will pay for at least 5 full minutes, even if
the total execution time of the pipeline is way below 5 minutes.
 Ideally pipelines are idempotent, meaning that if you execute one repeatedly, the result should be
the same. If a pipeline fails, you should be able to restart it without doing manual cleanup. If there's
any cleanup needed, it should be taken care off by the pipeline itself.
Additional Information
 You can find an overview of the control flow activities in the tip Azure Data Factory Control Flow
Activities Overview.
 The tip Build Azure Data Factory Pipeline Dependencies gives another example of how to use
dependencies in a pipeline.
 You can find a good overview of all the activities in the official documentation.
Building an Azure Data Factory Pipeline
Manually
Overview

In the previous parts of the tutorial, we've covered all the building blocks for a pipeline: linked
services, datasets and activities. Now let's create a pipeline from scratch.

Prerequisites
We'll be using objects that were created in the previous steps of the tutorial. If you haven't created
these yet, it is best you do so if you want to follow along.

We will be reading the Excel file from the Azure blob container and store the data in a table in the
Azure SQL database. We're also going to log some messages into a log table.

When logged in into the database, execute the following script to create the destination table:

DROP TABLE IF EXISTS dbo.Tutorial_Excel_Customer;

CREATE TABLE dbo.Tutorial_Excel_Customer(


[Title] [NVARCHAR](10) NULL,
[FirstName] [NVARCHAR](50) NULL,
[MiddleName] [NVARCHAR](50) NULL,
[LastName] [NVARCHAR](50) NULL,
[Suffix] [NVARCHAR](10) NULL,
[CompanyName] [NVARCHAR](100) NULL,
[EmailAddress] [NVARCHAR](250) NULL,
[Phone] [NVARCHAR](25) NULL
);

We're explicitly creating the table ourselves, because if ADF reads data from an semi-structured
file like Excel or CSVs, it cannot determine the correct data types and it will set all columns to
NVARCHAR(MAX). For example, this is the table that was created with the Copy Data tool:

We're also going to create a logging table in a schema called "etl". First execute this script:

CREATE SCHEMA etl;

Then execute the following script for the log table:


CREATE TABLE etl.logging(
ID INT IDENTITY(1,1) NOT NULL
,LogMessage VARCHAR(500) NOT NULL
,InsertDate DATE NOT NULL DEFAULT SYSDATETIME()
);

Since we have a new destination table, we also need a new dataset. In the Author section, go to
the SQL dataset that was created as part of the Copy Data tool (this should be
"DestinationDataset_eqx"). Click on the ellipsis and choose Clone.

This will make an exact copy of the dataset, but with a different name. Change the name to
"SQL_ExcelCustomers" and select the newly created table from the dropdown:

In the Schema tab, we can import the mapping of the table.


Publish the new dataset.

Building the Pipeline


Go to the Author section of ADF Studio and click on the blue "+"-icon. Go to pipeline > pipeline to
create a new pipeline.
Start by giving the new pipeline a decent name.

Next, add a Script activity to the canvas and name it "Log Start".

In the General tab, set the timeout to 10 minutes (the default is 7 days!). You can also set the
number of retries to 1. This means if the Script activity fails, it will wait for 30 seconds and then try
again. If it fails again, then the activity will actually fail. If it succeeds on the second attempt, the
activity will be marked as succeeded.

In the Settings tab, choose the linked service for the Azure SQL DB and set the script type
to NonQuery. The Query option means the executed SQL script will return one or more result
sets. The NonQuery option means no result set is returned and is typically used to execute DDL
statements (such as CREATE TABLE, ALTER INDEX, TRUNCATE TABLE …) or DML
statements that modify data (INSERT, UPDATE, DELETE). In the Script textbox, enter the
following SQL statement:

INSERT INTO etl.logging(LogMessage)


VALUES('Start reading Excel');

The settings should now look like this:

Next, drag a Copy Data activity to the canvas. Connect the Script activity with the new activity.
Name it "Copy Excel to SQL".
In the General tab, change the timeout and the number of retries:

In the Source tab, choose the Excel dataset we created earlier. Disable


the Recursively checkbox.
In this example we're reading from one single Excel file. However, if you have multiple Excel files
of the same format, you can read them all at the same time by changing the file path type to a
wildcard, for example "*.xlsx".

In the Sink tab, choose the SQL dataset we created in the prerequisites section. Leave the
defaults for the properties and add the following SQL statement to the pre-copy script:

TRUNCATE TABLE dbo.Tutorial_Excel_Customer;

The Sink tab should now look like this:

In the Mapping tab, we can explicitly map the source columns with the sink columns. Hit
the Import Schemas button to let ADF do the mapping automatically.
In this example, doing the mapping isn't necessary since the columns from the source map 1-to-1
to the sink columns. They have the same names and data types. If we would leave the mapping
blank, ADF will do the mapping automatically when the pipeline is running. Specifying an explicit
mapping is more important when the column names don't match, or when the source data is more
complex, for example a hierarchical JSON file.

In the Settings tab we can specify some additional properties.


An important property is the number of data integration units (DIU), which are a measure of the
power of the compute executing the copy. As you can see in the informational message, this
directly influences the cost of the Copy data activity. The price is calculated as $0.25 (this might
vary on your subscription and currency) * the copy duration (remember this is always at least one
minute and rounded up to the next full minute!) * # used DIUs. The default value for DIU is set to
Auto, meaning ADF will scale the number of DIUs for you automatically. Possible values are
between 2 and 256. For small data loads ADF will start with minimum 4 DIUs. But, for a small
Excel file like ours this is already overkill. If you know your dataset is going to be small, change the
property from Auto to 2. This will reduce the price of your copy data activities by half!

As a final step, copy/paste the Script activity. Change the name to "Log End" and connect the
Copy Data activity with this new activity.

In the Settings tab, change the SQL script to the following statement:

INSERT INTO etl.logging(LogMessage)


VALUES('Finish copying Excel');

The pipeline is now finished. Hit the debug button to start executing the pipeline in debug mode.
After a while the pipeline will finish. You can see in the Output pane how long each activity has
been running:

If you hover with your mouse over a line in the output, you will get icons for the input & output, and
in the case of the Copy Data activity you will get an extra "glasses" icon for more details.

When we click on the output for the "Log End" activity, we get the following:
We can see 1 row was inserted. When we go to the details of the Copy Data, we get the following
information:
A lot of information has been kept, such as the number of rows read, how many connections were
used, how many KB were written to the database and so on. Back in the Output pane, there's link
to the debug run consumption.

This will tell us exactly how many resources the debug run of the pipeline consumed:

0.0333 corresponds with two minutes (1 minute of execution rounded up * 2 DIU). Since our
debug run was successful, we can publish everything.

Why do we need to Publish?


When you create new objects such as linked services, datasets and pipelines, or when you modify
existing ones, those changes are not automatically persisted on the server. You can first debug
your pipelines to make sure your changes are working. Once everything works fine and validations
succeeds, you can publish your changes to the server. If you do not publish your changes and you
close your browser sessions, your changes will be lost.

Building Flexible and Dynamic Azure Data


Factory Pipelines
Overview

In the previous part we built a pipeline manually, along with the needed datasets and linked services. But
what if you need to load 20 Excel files? Or 100 tables from a source database? Are you going to create 100
datasets? And 100 different pipelines? That would be too much (repetitive) work! Luckily, we can have
flexible and dynamic pipelines where we just need two datasets (one for the source, one for the sink) and
one pipeline. Everything else is done through metadata and some parameters.

Prerequisites
Previously we uploaded an Excel file from Azure Blob Storage to a table in Azure SQL Database. A new
requirement came in and now we must upload another Excel file to a different table. Instead of creating a
new dataset and a new pipeline (or add another Copy Data activity to the existing pipeline), we're going to
reuse our existing resources.

The new Excel file contains product data, and it has the following structure:

As you can see from the screenshot, the worksheet name is the default "Sheet1". You can download the
sample workbook here. Upload the Excel workbook to the blob container we used earlier in the tutorial.

Since we want to store the data in our database, we need to create a new staging table:

CREATE TABLE dbo.Tutorial_StagingProduct


(
[Name] NVARCHAR(50)
,[ProductNumber] NVARCHAR(25)
,[Color] NVARCHAR(15)
,[StandardCost] NUMERIC(10,2)
,[ListPrice] NUMERIC(10,2)
,[Size] NVARCHAR(5)
,[Weight] NUMERIC(8,2)
);
Implement Parameters
Instead of creating two new datasets and another Copy Data activity, we're going to use parameters in the
existing ones. This will allow us to use one single dataset for both our Excel files. Open
the Excel_Customers dataset, go to properties and rename it to Excel_Generic.

Then go to the Parameters tab, and create the following two parameters:

Back in the Connection tab, click on Customers.xlsx and then on "Add dynamic content".


This will take us to the expression builder of ADF. Choose the parameter WorkbookName from the list
below.

The file path should now look like this:

Repeat the same process for the sheet name:


Both Excel files have the first row as a header, so the checkbox can remain checked, but this is something
that can be parameterized as well. Finally, go to the Schema tab and click the Clear button to remove all
metadata information from the dataset:

The schema is different for each Excel file, so we cannot have any column information here. It will be
fetched on the fly when the Copy Data activity runs.

We're going to do the exact same process for our SQL dataset. First, we rename it to SQL_Generic and then
we add two parameters: SchemaName and TableName. We're going to map these in the connection tab. If
you enable the "Edit" checkbox, two text fields appear (one for the schema and one for the table) which you
can parameterize:
Don't forget to clear the schema! Go to the StageExcelCustomers pipeline and rename it to "StageExcel". If
we open the Copy Data activity, we can see ADF asks us now to provide values for the parameters we just
added.

You can enter them manually, but that would defeat the purpose of our metadata-driven pipeline.

Creating and Mapping Metadata


We're going to store the metadata we need for our parameters in a table. We're going to read this metadata
and use it to drive a ForEach loop. For each iteration of the loop, we're going to copy the data from one
Excel file to a table in Azure SQL DB. Create the metadata table with the following script:

CREATE TABLE etl.ExcelMetadata(


ID INT IDENTITY(1,1) NOT NULL
,ExcelFileName VARCHAR(100) NOT NULL
,ExcelSheetName VARCHAR(100) NOT NULL
,SchemaName VARCHAR(100) NOT NULL
,TableName VARCHAR(100) NOT NULL
);

Insert the following two rows of data:

INSERT INTO etl.ExcelMetadata


(
ExcelFileName,
ExcelSheetName,
SchemaName,
TableName
)
VALUES ('Customers.xlsx','Customers','dbo','Tutorial_Excel_Customer')
,('Products.xlsx' ,'Sheet1' ,'dbo','Tutorial_StagingProduct');

In the pipeline, add a Lookup activity to the canvas after the first Script activity. Give the activity a decent
name, set the timeout to 10 minutes and set the retry to 1.

In the Settings, choose the generic SQL dataset. Disable the checkbox for "First row only" and choose the
Query type. Enter the following query:

SELECT
ExcelFileName
,ExcelSheetName
,SchemaName
,TableName
FROM etl.ExcelMetadata;

Since we're specifying a query, we don't actually need to provide (real) values for the dataset parameters;
we're just using the dataset for its connection to the Azure SQL database.
Preview the data to make sure everything has been configured correctly.

Next, we're going to add a ForEach to the canvas. Add it after the Lookup and before the second Script
activity.
Select the Copy Data activity, cut it (using ctrl-x), click the pencil icon inside the ForEach activity. This will
open a pipeline canvas inside the ForEach loop. Paste the Copy Data activity there. At the top left corner of
the canvas, you can see that we're inside the loop, which is in the StageExcel pipeline. It seems like there's a
"mini pipeline" inside the ForEach. However, functionality is limited. You can't for example put another
ForEach loop inside the existing ForEach. If you need to nest loops, you'll need to put the second ForEach in
a separate pipeline and call this pipeline from the first ForEach using the Execute Pipeline activity. Go back
to the pipeline by clicking on its name.

Go to the Settings pane of the ForEach. Here we need to configure over which items we're going to iterate.
This can be an array variable, or a result set such as the one from our Lookup activity.

Click on "Add dynamic content" for the Items. In the "Activity outputs" node, click on the Lookup activity.
This will add the following expression:

@activity('Get Metadata').output

However, to make this actually work, we need to add value at the end:

@activity('Get Metadata').output.value

In the settings, we can also choose if the ForEach executes in parallel, or if it will read the Excel files
sequentially. If you don't want parallelism, you need to select the Sequential checkbox.
Now go back into the ForEach loop canvas and into the Copy Data activity. Now we can map the metadata
we retrieve from the Lookup to the dataset parameters. In the Source pane, click on the text box for
the WorkbookName parameter and go to the dynamic content.

We can access the values of the current item of the ForEach loop by using the item() function.
We just need to specify which column we exactly want:

We can repeat the same process for the sheet name:

And of course, we do the same for the SQL dataset in the Sink tab:

We also need to change the Pre-copy script, to make sure we're truncating the correct table. Like most
properties, we can do this through an expression as well. We're going to use the @concat() function to create
a SQL statement along with the values for the schema and table name.

@concat('TRUNCATE TABLE ',item().SchemaName,'.',item().TableName,';')

Finally, we need to remove the schema mapping in the Mapping pane. Since both the source and the sink are
dynamic, we can't specify any mapping here unless it is the same for all Excel files (which isn't the case). If
the mapping is empty, the Copy Data activity will do it for us on-the-fly. For this to work, the columns
names in the Excel file and the corresponding table need to match!
The pipeline is now ready to run.

Debugging the Pipeline


Start debugging of the pipeline. In the output pane, you'll see the Copy Data activity has been run twice, in
parallel.

We've now successfully loaded two Excel files to an Azure SQL database by using one single pipeline
driven by metadata. This is an important pattern for ADF, as it greatly reduces the amount of work you need
to do for repetitive tasks. Keep in mind though, that each iteration of the ForEach loop results in at least one
minute of billing. Even though our debugging pipeline was running for a mere 24 seconds, we're being
billed for 5 minutes (2 Script activities + 1 Lookup + 2 iterations of the loop).
Additional Information
 The tips How to Load Multiple Files in Parallel in Azure Data Factory - Part 1 and Part 2 give an example of
the same pattern, this time with CSV files with different delimiters.
 You can find another looping example in the tip Azure Data Factory ForEach Activity Example or in the blog
post Dynamic Datasets in Azure Data Factory.
 ADF can define mappings automatically, but for some sources like a JSON file it might be a bit too
complicated for an automated mapping. It's possible to make the mapping dynamic as well by specifying it as
dynamic content as well. The blog post Dynamically Map JSON to SQL in Azure Data Factory explains how
you can do this.

Azure Data Factory Integration Runtimes


Overview

In this tutorial we have been executing pipelines to get data from a certain source and write it to
another destination. The Copy Data activity for example provides us with a auto-scalable source of
compute that will execute this data transfer for us. But what is this compute exactly? Where does it
reside? The answer is: integration runtimes. These runtimes provide us with the necessary
computing power to execute all the different kind of activities in a pipeline. There are 3 types of
integration runtimes (IR), which we'll discuss in the following sections.

The Azure-IR
The most important integration runtime is the one we've been using all this time: the Azure-IR.
Every installation of ADF has a default IR: the AutoResolveIntegrationRuntime. You can find it
when you go to the Manage section of ADF and then click on Integration Runtimes.
It's called auto resolve, because it will try to automatically resolve the geographic region the
compute will need to run. This is determined for example by the data store of the sink in a Copy
Data activity. If the sink is located in West Europe, it will try to run the compute in the West Europe
region as well.

The Azure-IR is a fully managed, serverless compute service. You don't have to do anything to
manage, except pay for the duration it has been running compute. You can always use the default
Azure-IR, but you can also create a new one. Click on New to create one.

In the new window, choose the option with "Azure, Self-Hosted".

In the next step, choose Azure again.


In the following screen, enter a name for the new IR. Also choose your closest region.
You can also configure the IR to use a Virtual Network, but this is an advanced setting that is not
covered in the tutorial. Keep in mind that billing for pipeline durations is several magnitudes higher
when you're using a virtual network. In the third pane, we can configure the compute power for
data flows. Data flows are discussed in the next section of the tutorial.

There are two main reasons to create your own Azure-IR:

 You want to specify a specific region for your compute. For example, if regulations specify your data
can never leave a certain reason, you need to create your own Azure-IR located in that region.
 You want to specify a data flow runtime with different settings than the default one. Especially
the Time To Live setting is something that is worth changing (shorter if you want to save on costs,
longer if you don't want to restart you cluster too often during development/debugging).

Click on Create to finish the setup of the new Azure-IR. But how do we use this IR? If we go for
example to the linked service connecting to our Azure SQL database, we can specify a different
IR:
The Self-hosted IR
Suppose you have data on-premises that you need to access from ADF. How can ADF reach this
data store when it is in the Azure cloud? The self-hosted IR provides us with a solution. You install
the self-hosted IR on one of your local machines. This IR will then act as a gateway through which
ADF can reach the on-premises data.

Another use case for the self-hosted IR is when you want to run compute on your own machines
instead of in the Azure cloud. This might be an option if you want to save costs (the billing for
pipeline durations is lower on the self-hosted IR than one the Azure-IR) or if you want to control
everything yourself. ADF will then act as an orchestrator, while all of the compute is running on
your own local servers.

It's possible to install multiple self-hosted IRs on your local network to scale out resources. You
can also share a self-hosted IR between multiple ADF environments. This can be useful if you
want only one self-hosted IR for both development and production.

The following tips give more detail about this type of IR:

 Connect to On-premises Data in Azure Data Factory with the Self-hosted Integration Runtime - Part
1 and Part 2.
 Transfer Data to the Cloud Using Azure Data Factory
 Build Azure Data Factory Pipelines with On-Premises Data Sources

The Azure-SSIS IR
ADF provides us with the opportunity to run Integration Services packages inside the ADF
environment. This can be useful if you want to quickly migrate SSIS project to the Azure cloud,
without a complete rewrite of your projects. The Azure-SSIS IR provides us with a scale-out
cluster of virtual machines that can run SSIS packages. You create an SSIS catalog in either
Azure SQL database or in Azure SQL Server Managed Instance.
As usual, Azure deals with the infrastructure. You only need to specify how powerful the Azure-
SSIS IR is by configuring the size of a compute node and how many nodes there need to be. You
are billed for the duration the IR is running. You can pause the IR to save on costs.

The following tips give you more information on the Azure-SSIS IR:

 Configure an Azure SQL Server Integration Services Integration Runtime


 Customized Setup for the Azure-SSIS Integration Runtime
 Execute SSIS Package in Azure-SSIS Integration Runtime
 Parallel package execution in Azure-SSIS Runtime
 SSIS Catalog Maintenance in the Azure Cloud
Additional Information
 The tip Migrate a Package Deployment Integration Services Project to Azure details how you can
migrate an SSIS project using the package deployment model to ADF.
 You can find a comparison between ADF and SSIS in the tip Choosing Between SQL Server
Integration Services and Azure Data Factory.
 There's an on-demand webinar you can watch about lifting and shifting your SSIS projects to
ADF: Migrating SQL Server Integration Services to the Cloud.
Azure Data Factory Data Flows
Overview

During the tutorial we've mentioned data flows a couple of times. The activities in a pipeline don't
really support data transformation scenarios. The Copy Data activity can transform data from one
format to another (for example, from a hierarchical JSON file to a table in a database), but that's
about it. Typically, you load data from one or more sources into a destination and you do the
transformations over there. E.g., you can use SQL in a database, or notebooks in Azure
Databricks when the data is stored in a data lake. This makes ADF a great ELT tool (Extract ->
Load -> Transform), but not so great for ETL. Data flows were introduced to remedy this. They are
an abstraction layer on top of Azure Databricks. They intuitively provide you with an option to
create ETL flows in ADF, without having to write any code (like you would need to do if you
worked directly in Azure Databricks). There are two types of data flows:

 The data flow (which was previously called the "mapping data flow".
 Power Query (which was previously called the "wrangling data flow"

Data Flow
A data flow in ADF uses the Azure-IR integration runtime to spin up a cluster of compute behind
the scenes (see the previous part about runtimes on how to configure your own). This cluster
needs to be running if you want to debug or run your data flow.

Data flows in ADF use a visual representation of the different sources, transformations, and sinks;
all connected with precedence constraints. They resemble data flows in Integration Services.
Here's an example from the tip What are Data Flows in Azure Data Factory?. This tip gives a step-
by-step example of how to create a data flow and how to integrate it into a pipeline.
Because you need a cluster to run a pipeline, data flows are not well-suited for processing small
data sets, since there's the overhead of the cluster start-up time.

Power Query
The Power Query data flow is an implementation of the Power Query engine in ADF. When you
run a Power Query in ADF, the Power Query mash-up will be translated into a data flow script,
which will then be run on the Azure Databricks cluster. The advantage of Power Query is that you
can see the data and the results of your transformations as you're applying them. Users who have
been working with Excel, Power BI Desktop or Power BI Data Flows are also already familiar with
the editor.

You can find an example of a Power Query mash-up in the tip What are Data Flows in Azure Data
Factory? as well.
The disadvantage of Power Query is that not all functionality of the regular Power Query (as you
would have in Power BI Desktop for example) is available in ADF. You can find a list of the
limitations in the documentation.

Azure Data Factory Scheduling and


Monitoring
Overview

When you've created your pipelines, you're not going to run them in debug mode every time you
need to transfer some data. Rather, you want to schedule your pipelines so that they run on pre-
defined point in times or when a certain event happens. When using Integration Services projects,
you would use for example SQL Server Agent to schedule the execution of your packages.

Scheduling
In ADF, a "schedule" is called a trigger, and there are a couple of different types:

 Run-once trigger. In this case, you are manually triggering your pipeline so that it runs once. The
difference between the manual trigger and debugging the pipeline, is that with a trigger you're using
the pipeline configuration that is saved to the server. With debugging, you're running the pipeline as
it is in the visual editor.
 Scheduled trigger. The pipeline is being run on schedule, much like SQL Server Agent has
schedules. You can for example schedule a pipeline to run daily, weekly, every hour and so on.
 Tumbling window trigger. This type of trigger fires at a periodic interval. A tumbling window is a
series of fixed-sized, non-overlapping time intervals. For example, you can have a tumbling window
for each day. You can set it to start at the first of this month, and then it will execute for each day of
the month. Tumbling triggers are great for loading historical data (e.g. initial loads) in a "sliced"
manner instead of loading all data at once.
 Event-based trigger. You can trigger a pipeline to execute every time a specific event happens.
You can start a pipeline if a new file arrives in a Blob container (storage event), or you can define
your own custom events in Azure Event Grid.

Let's create a trigger for the pipeline we created earlier. In the pipeline, click on Add Trigger.

If you choose "Trigger Now", you will create a run-once trigger. The pipeline will run and that's it. If
you choose "New/Edit", you can either create a trigger or modify an existing one. In the Add
triggers pane, open the dropdown and choose New.
The default trigger type is Schedule. In the example below, we've scheduled our pipeline to run
every day, for the hours 6, 10, 14 and 18.
Once the trigger is created, it will start running and execute the pipeline according to schedule.
Make sure to publish the trigger after you've created it. You can view existing triggers in
the Manage section of ADF.

You can pause an existing trigger, or you can delete it or edit it. For more information about
triggers, check out the following tips:

 Create Event Based Trigger in Azure Data Factory>


 Create Schedule Trigger in Azure Data Factory ADF
 Create Tumbling Window Trigger in Azure Data Factory ADF

ADF has a REST API which you can also use to start pipelines. You can for example start a
pipeline from an Azure Function or an Azure Logic App.

Monitoring
ADF has a monitoring section where you can view all executed pipelines, both triggered or by
debugging.
You can also view the state of the integration runtimes or view more info about the data flows
debugging sessions. For each pipeline run, you can view the exact output and the resource
consumption of each activity and child pipeline.

It's also possible to configure Log analytics for ADF in the Azure Portal. It's out of scope for this
tutorial, but you can find more info in the tip Setting up Azure Log Analytics to Monitor
Performance of an Azure Resource. You can check out the Monitoring section for the ADF
resource in the Azure Portal:

 
You can choose the type of events that are being logged:
Additional Information
 You can find more info on logging and error handling in the following tips:
o Azure Data Factory Pipeline Logging Error Details
o Logging Azure Data Factory Pipeline Audit Data
o Azure Data Factory Pipeline Scheduling, Error Handling and Monitoring - Part 2
 More info about log analytics:
o Query Audit data in Azure SQL Database using Kusto Query Language (KQL)
o Create an Alert in Microsoft Azure Log Analytics
AZURE DATA FACTORY
ACTIVITIES AND ITS TYPES

Priyanshi Sharma

Priyanshi Sharma
Senior Data Engineer at Omnicom Media Group
Published May 22, 2021

+ Follow

What is Activity in Azure Data Factory?

The activity is the task we performed on our data. We use activity inside the
Azure Data Factory pipelines. ADF pipelines are a group of one or more
activities. For ex: When you create an ADF pipeline to perform ETL you can use
multiple activities to extract data, transform data and load data to your data
warehouse. Activity uses Input and output datasets. Dataset represents your
data if it is tables, files, folders etc. Below diagram shows the relationship
between Activity, dataset and pipeline:
An Input dataset simply tells you about the input data and it’s schema. And an
Output dataset will tell you about the output data and it’s schema. You can
attach zero or more Input datasets and one or more Output datasets. Activities
in Azure Data Factory can be broadly categorized as:

1- Data Movement Activities

2- Data Transformation Activities

3- Control Activities

DATA MOVEMENT ACTIVITIES :

1- Copy Activity: It simply copies the data from Source location to destination
location. Azure supports multiple data store locations such as Azure Storage,
Azure DBs, NoSQL, Files, etc.

To know more about Data Movement activities, please use below link:

Pipelines and activities in Azure Data Factory - Azure Data Factory |


Microsoft Docs

DATA TRANSFORMATION ACTIVITIES:

1- Data Flow: In data flow, First, you need to design data transformation
workflow to transform or move data. Then you can call Data Flow activity
inside the ADF pipeline. It runs on Scaled out Apache Spark Clusters. There are
two types of DataFlows: Mapping and Wrangling DataFlows

MAPPING DATA FLOW: It provides a platform to graphically design data


transformation logic. You don’t need to write code. Once your data flow is
complete, you can use it as an Activity in ADF pipelines.

WRANGLING DATA FLOW: It provides a platform to use power query in


Azure Data Factory which is available on Ms excel. You can use power query M
functions also on the cloud.
2- Hive Activity: This is a HD insight activity that executes Hive queries on
windows/linux based HDInsight cluster. It is used to process and analyze
structured data.

3- Pig activity: This is a HD insight activity that executes Pig queries on


windows/linux based HDInsight cluster. It is used to analyze large datasets.

4- MapReduce: This is a HD insight activity that executes MapReduce


programs on windows/linux based HDInsight cluster. It is used for processing
and generating large datasets with a parallel distributed algorithm on a
cluster. 

5- Hadoop Streaming: This is a HD Insight activity that executes Hadoop


streaming program on windows/linux based HDInsight cluster. It is used to
write mappers and reducers with any executable script in any language like
Python, C++ etc.

6- Spark: This is a HD Insight activity that executes Spark program on


windows/linux based HDInsight cluster. It is used for large scale data
processing.

7- Stored Procedure: In Data Factory pipeline, you can use execute Stored
procedure activity to invoke a SQL Server Stored procedure. You can use the
following data stores: Azure SQL Database, Azure Synapse Analytics, SQL
Server Database, etc.

8- U-SQL: It executes U-SQL script on Azure Data Lake Analytics cluster. It is a


big data query language that provides benefits of SQL.

9- Custom Activity: In custom activity, you can create your own data
processing logic that is not provided by Azure. You can configure .Net activity
or R activity that will run on Azure Batch service or an Azure HDInsight cluster.

10- Databricks Notebook: It runs your databricks notebook on Azure


databricks workspace. It runs on Apache spark.

11- Databricks Python Activity: This activity will run your python files on
Azure Databricks cluster.

12- Azure Functions: It is Azure Compute service that allows us to write code
logic and use it based on events without installing any infrastructure. It stores
your code into Storage and keep the logs in application Insights.Key points of
Azure Functions are :

1- It is a Serverless service.

2- It has Multiple languages available : C#, Java, Javascript, Python and
PowerShell

3- It is a Pay as you go Model.

To know more about Data Transformation activity, use below link:

Pipelines and activities in Azure Data Factory - Azure Data Factory |


Microsoft Docs

3- Control Flow Activities:

1- Append Variable Activity: It assigns a value to the array variable.

2- Execute Pipeline Activity: It allows you to call Azure Data Factory pipelines.

3- Filter Activity: It allows you to apply different filters on your input dataset.

4- For Each Activity: It provides the functionality of a for each loop that
executes for multiple iterations.

5- Get Metadata Activity: It is used to get metadata of files/folders. You need


to provide the type of metadata you require: childItems, columnCount,
contentMDS, exists, itemName, itemType, lastModified, size, structure, created
etc.

6- If condition Activity: It provides the same functionality as If statement, it


executes the set of expressions based on if the condition evaluates to true or
false.

7- Lookup Activity: It reads and returns the content of multiple data sources
such as files or tables or databases. It could also return the result set of a query
or stored procedures.

8- Set Variable Activity: It is used to set the value to a variable of type String,
Array, etc.

9- Switch Activity: It is a Switch statement that executes the set of activities


based on matching cases.
10- Until Activity: It is same as do until loop. It executes a set of activities until
the condition is set to true.

11- Validation Activity: It is used to validate the input dataset.

12- Wait Activity: It just waits for the given interval of time before moving
ahead to the next activity. You can specify the number of seconds.

13- Web Activity: It is used to make a call to REST APIs. You can use it for
different use cases such as ADF pipeline execution.

14- Webhook Activity: It is used to to call the endpoint URLs to start/stop the


execution of the pipelines. You can call external URLs also.

Azure Data Factory Interview Questions and Answer


Azure Data Factory is a cloud-based Microsoft tool that collects raw business data and further
transforms it into usable information. It is a data integration ETL (extract, transform, and
load) service that automates the transformation of the given raw data. This Azure Data Factory
Interview Questions blog includes the most-probable questions asked during Azure job interviews.
Following are the questions that you must prepare for:

Q1. Why do we need Azure Data Factory?


Q2. What is Azure Data Factory?
Q3. What is the integration runtime?
Q4. What is the limit on the number of integration runtime?
Q5. What is the difference between Azure Data Lake and Azure Data Warehouse?
Q6. What is blob storage in Azure?
Q7. What is the difference between Azure Data Lake Store and Blob storage?
Q8. What are the steps for creating ETL process in Azure Data Factory?
Q9. What is the difference between HDInsight and Azure Data Lake Analytics?
Q10. What are the top-level concepts of Azure Data Factory?

These Azure Data Factory interview questions are classified into the following parts:
1. Basic

2. Intermediate

3. Advanced

Check out this video on Azure Data Factory Tutorial by Intellipaat:


Basic Interview Questions

1. Why do we need Azure Data Factory?

 The amount of data generated these days is huge, and this data comes from different
sources. When we move this particular data to the cloud, a few things need to be taken care
of.
 Data can be in any form, as it comes from different sources. These sources will transfer or
channel the data in different ways. They will be in different formats. When we bring this data to
the cloud or particular storage, we need to make sure it is well managed, i.e., you need to
transform the data and delete unnecessary parts. As far as moving the data is concerned, we
need to make sure that data is picked from different sources, brought to one common place,
and stored. If required, we should transform it into something more meaningful.
 This can be done by a traditional data warehouse, but there are certain disadvantages.
Sometimes we are forced to go ahead and have custom applications that deal with all these
processes individually, which is time-consuming, and integrating all these sources is a huge
pain. We need to figure out a way to automate this process or create proper workflows.
 Data Factory helps to orchestrate this complete process in a more manageable or organizable
manner.

Aspiring to become a data analytics professional? Enroll in Data Analytics Courses in


Bangalore and learn from the best.

2. What is Azure Data Factory?

It is a cloud-based integration service that allows the creation of data-driven workflows in the cloud
for orchestrating and automating data movement and transformation.

 Using Azure Data Factory, you can create and schedule data-driven workflows (called
pipelines) that ingest data from disparate data stores.
 It can process and transform data using computer services such as HDInsight, Hadoop,
Spark, Azure Data Lake Analytics, and Azure Machine Learning.

Want to learn big data? Enroll in this Big Data Hadoop Course in Bangalore taught by
industry experts.

3. What is the integration runtime?


 The integration runtime is the compute infrastructure Azure Data Factory uses to provide the
following data integration capabilities across various network environments.
 Three Types of Integration Runtimes:
o Azure Integration Runtime: Azure integration runtime (IR) can copy data between cloud
data stores and dispatch the activity to a variety of computing services, such as Azure
HDInsight or SQL Server, where the transformation takes place.
o Self-Hosted Integration Runtime: A self-hosted integration runtime is software with
essentially the same code as Azure integration runtime. But you install it on an on-
premise machine or a virtual machine in a virtual network. A self-hosted IR can run copy
activities between a public cloud data store and a data store on a private network. It can
also dispatch transformation activities against compute resources on a private network.
We use self-hosted IR because the Data Factory will not be able to directly access
primitive data sources because they sit behind a firewall. It is sometimes possible to
establish a direct connection between Azure and on-premises data sources by configuring
the Azure Firewall in a specific way. If we do that, we don’t need to use a self-hosted IR.
o Azure-SSIS Integration Runtime: With SSIS integration runtime, you can natively execute
SSIS packages in a managed environment. So when we lift and shift the SSIS packages
to the Data Factory, we use Azure SSIS IR.

Learn more about the concept by reading the blog post regarding SSIS by Intellipaat.

4. What is the limit on the number of integration runtimes?

There is no hard limit on the number of integration runtime instances you can have in a data
factory. There is, however, a limit on the number of VM cores that the integration runtime can use
per subscription for SSIS package execution.

Get 100% Hike!


Master Most in Demand Skills Now !

Submit

5. What is the difference between Azure Data Lake and Azure Data
Warehouse?
The data warehouse is a traditional way of storing data that is still widely used. The data lake is
complementary to a data warehouse, i.e., if you have your data in a data lake that can be stored in
the data warehouse, you have to follow specific rules.
DATA LAKE DATA WAREHOUSE
Complementary to the data warehouse Maybe sourced to the data lake
Data is either detailed or raw. It can be in any particular Data is filtered, summarized, and
form. You need to take the data and put it in your data refined.
lake.
Schema on read (not structured, you can define your Schema on write (data is written in
schema in n number of ways) structured form or a particular schema)
One language to process data of any format(USQL) It uses SQL.
Preparing for the Azure Certification exam? Join our Azure Training in Bangalore!

Intermediate Interview Questions

6. What is blob storage in Azure?

Azure Blob Storage is a service for storing large amounts of unstructured object data, such as text
or binary data. You can use Blob Storage to expose data publicly to the world or to store
application data privately. Common uses of Blob Storage are as follows:

 Serving images or documents directly to a browser


 Storing files for distributed access
 Streaming video and audio
 Storing data for backup and restore disaster recovery, and archiving
 Storing data for analysis by an on-premises or Azure-hosted service

7. What is the difference between Azure Data Lake store and Blob
storage?
  Azure Data Lake Storage Azure Blob Storage
Gen1
Purpose Optimized storage for big General-purpose object store for a wide variety of
data analytics workloads storage scenarios, including big data analytics
Structure Hierarchical file system Object store with a flat namespace
Key Concepts Data Lake Storage Gen1 Storage account has containers, which in turn has
account contains folders, data in the form of blobs
which in turn contain data
stored as files
Use Cases Batch, interactive, Any type of text or binary data, such as application
streaming analytics, and back end, backup data, media storage for
machine learning data streaming, and general-purpose data. Additionally,
such as log files, IoT data, full support for analytics workloads: batch,
clickstreams, and large interactive, streaming analytics, and machine
datasets learning data such as log files, IoT data,
clickstreams, and large datasets
Server-Side API WebHDFS-compatible Azure Blob Storage REST API
REST API
Data Operations Based on Azure Active Based on shared secrets – Account Access
– Authentication Directory Identities Keys and Shared Access Signature Keys.
To learn more about big data, check out this Big Data Course offered by Intellipaat.

8. What are the steps for creating ETL process in Azure Data
Factory?

While we are trying to extract some data from the Azure SQL Server database, if something has to
be processed, it will be processed and stored in the Data Lake Storage.

Steps for Creating ETL

 Create a linked service for the source data store, which is SQL Server Database
 Assume that we have a cars dataset
 Create a linked service for the destination data store, which is Azure Data Lake Storage
(ADLS)
 Create a dataset for data saving
 Create the pipeline and add copy activity
 Schedule the pipeline by adding a trigger
9. What is the difference between HDInsight and Azure Data Lake
Analytics?
HDInsight Azure Data Lake Analytics
If we want to process a data set, first of all, It is all about passing queries written for processing
we have to configure the cluster with data. Azure Data Lake Analytics will create the
predefined nodes, and then we use a necessary compute nodes per our instructions on
language like Pig or Hive for processing the demand and process the data set.
data.
Since we configure the cluster with With Azure Data Lake Analytics, it does not give
HDInsight, we can create it as we want and much flexibility in terms of the provision in the
control it as we want. All Hadoop subprojects, cluster, but Microsoft Azure takes care of it. We
such as Spark and Kafka, can be used don’t need to worry about cluster creation. The
without limitations. assignment of nodes will be done based on the
instructions we pass. In addition, we can make use
of U-SQL taking advantage of .Net for processing
data.

10. What are the top-level concepts of Azure Data Factory?

 Pipeline: It acts as a carrier in which various processes take place. An individual process is
an activity.
 Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or
multiple activities. It can be anything, i.e., a process like querying a data set or moving the
dataset from one source to another.
 Datasets: In simple words, it is a data structure that holds our data.
 Linked Services: These store information that is very important when connecting to
an external source.

For example, consider an SQL Server. You need a connection string that you can connect to an
external device. You need to mention the source and destination of your data.
Career Transition



Advanced Interview Questions

11. How can I schedule a pipeline?

 You can use the scheduler trigger or time window trigger to schedule a pipeline.
 The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or
in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at
9:00 PM).

12. Can I pass parameters to a pipeline run?


 Yes, parameters are a first-class, top-level concept in Data Factory.
 You can define parameters at the pipeline level and pass arguments as you execute the
pipeline run-on-demand or by using a trigger.

Are you looking to learn about Azure? Check out our blog on Azure Tutorial!

13. Can I define default values for the pipeline parameters?

You can define default values for the parameters in the pipelines.

14. Can an activity in a pipeline consume arguments that are


passed to a pipeline run?

In a pipeline, an activity can indeed consume arguments that are passed to a pipeline run.
Arguments serve as input values that can be provided when triggering or scheduling a pipeline
run. These arguments can be used by activities within the pipeline to customize their behavior or
perform specific tasks based on the provided values. This flexibility allows for dynamic and
parameterized execution of pipeline activities, enhancing the versatility and adaptability of the
pipeline workflow.

Each activity within the pipeline can consume the parameter value that’s passed to the pipeline
and run with the @parameter construct.

15. Can an activity’s output property be consumed in another


activity?

An activity output can be consumed in a subsequent activity with the @activity construct.

16. How do I handle null values in an activity output?


You can use the @coalesce construct in the expressions to handle the null values.

Check out Intellipaat’s Azure Training and get a head start in your career now!

17. Which Data Factory version do I use to create data flows?

Use the Data Factory version 2 to create data flows.

18. What has changed from private preview to limited public


preview in regard to data flows?

 You will no longer have to bring your own Azure Databricks clusters.
 Data Factory will manage cluster creation and teardown.
 Blob datasets and Azure Data Lake Storage Gen2 datasets are separated into delimited text
and Apache Parquet datasets.
 You can still use Data Lake Storage Gen2 and Blob Storage to store those files. Use the
appropriate linked service for those storage engines.

Courses you may like


19. How do I access data using the other 80 dataset types in Data
Factory?

 The mapping data flow feature currently allows Azure SQL Database, Azure SQL Data
Warehouse, delimited text files from Azure Blob Storage or Azure Data Lake Storage Gen2,
and Parquet files from Blob Storage or Data Lake Storage Gen2 natively for source and sink.
 Use the copy activity to stage data from any of the other connectors, and then execute a Data
Flow activity to transform the data after it’s been staged. For example, your pipeline will first
copy into Blob Storage, and then a Data Flow activity will use a dataset in the source to
transform that data.

Learn about various certifications in Azure in our in-depth blog on Microsoft Azure
Certification.

20. Explain the two levels of security in ADLS Gen2.

The two levels of security applicable to ADLS Gen2 were also in effect for ADLS Gen1. Even
though this is not new, it is worth calling out the two levels of security because it’s a fundamental
piece to getting started with the data lake, and it is confusing for many people to start.

 Role-Based Access Control (RBAC). RBAC includes built-in Azure roles such as reader,
contributor, owner, or custom roles. Typically, RBAC is assigned for two reasons. One is to
specify who can manage the service itself (i.e., update settings and properties for the storage
account). Another reason is to permit the use of built-in data explorer tools, which require
reader permissions.
 Access Control Lists (ACLs). Access control lists specify exactly which data objects a user
may read, write, or execute (execute is required to browse the directory structure). ACLs are
POSIX-compliant, thus familiar to those with a Unix or Linux background.
POSIX does not operate on a security inheritance model, which means that access ACLs are
specified for every object. The concept of default ACLs is critical for new files within a directory to
obtain the correct security settings, but it should not be thought of as an inheritance. Because of
the overhead assigning ACLs to every object, and because there is a limit of 32 ACLs for every
object, it is extremely important to manage data-level security in ADLS Gen1 or Gen2 via Azure
Active Directory groups.

Azure Data Factory Cheat Sheet


December 4, 2021 by Deepak Goyal
Are you planning to attempting for Azure data engineer interview or you are new to a Azure
data engineer, then at times you might find it difficult to remember all those jargons and
acronyms used in the ADF. I have compiled up a full ADF concepts and each every
concepts and component used with in the Azure Data Factory. You can download this
useful cheat sheet to use it as a reference for your interview or your day to day work. Let’s
see this concepts without spending too much time.

 For Azure Study material Join Telegram group : Telegram group link:


 Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on
LinkedIn
 Azure Tutorial Videos: Videos Link
Join Telegram Group
Download Free Azure Videos
Connect on LinkedIn
Email(required)
By submitting your information, you're giving us permission to email you. You may
unsubscribe at any time.
Download Azure Data Factory Cheat Sheet
You can also bookmark and read through this page.
Pipeline: A data integration workload unit in Azure Data Factory. A logical grouping of
activities assembled to execute a particular data integration process.
• Activity: Performs a task inside a pipeline, for example, copying data from one place to
another.
 • Dataset: Contains metadata describing a specific set of data held in an external storage
system. Pipeline activities use datasets to interact with external data.
 • Linked service: Represents a connection to an external storage system or external
compute resource. • Integration runtime: Provides access to internal compute resource
inside Azure Data Factory. ADF has no internal storage resources.
• Debug: You can run a pipeline interactively from the ADF UX using “Debug” mode. This
means that the pipeline definition from the ADF UX session is executed – it does not need
to be published to the connected factory instance. During a debugging run, a pipeline
treats external resources in exactly the same way as in published pipeline runs.
• Copy Data tool: A wizard-style experience in the ADF UX that creates a pipeline to copy
data from one place to another, but in practice you are unlikely to use the tool very often.
• Azure Storage: Microsoft’s cloud-based managed storage platform.
• Storage account: A storage account is created in order to use Azure Storage services.
• Storage key: Storage keys are tokens used to authorize access to a storage account.
You can manage an account’s keys in the Azure portal.
• Blob storage: General-purpose file (blob) storage, one of the types of storage offered by
Azure Storage. Other supported storage types (not described here) include file shares,
queues, and tables.
• Container: Files in blob storage are stored in containers, subdivisions of a storage
account’s blob storage. Blob storage is divided into containers at the root level only – they
cannot be nested.
• Azure Storage Explorer: An app used to manage Azure Storage accounts, available
online and as a desktop application.
• Bandwidth: A term used by Microsoft to describe the movement of data into and out of
Azure data centers. Outbound data movements incur a fee, sometimes referred to as an
egress charge.
• Unstructured file: A file treated as having no internal data structure – a blob. The Copy
data activity treats files as unstructured when a binary copy is specified.
• Structured file: A file with a tabular data structure such as CSV or Parquet.
 • Parquet file: A column-oriented, compressed structured file format supporting efficient
storage and querying of large volumes of data.
 • Semi-structured file: A file with a nontabular, frequently nested data structure, such as
XML or JSON. • Collection reference: Nested data structures can represent multiple
collections of data simultaneously. In a Copy data activity schema mapping, the collection
reference indicates which of the collections is being transformed.
• Sink: Azure Data Factory refers to data pipeline destinations as sinks.
• Interim data type: The Copy data activity converts incoming data values from their
source types to interim ADF data types, then converts them to corresponding sink system
type. This makes extensibility of ADF to support new datasets easier and faster.
• Data integration unit (DIU): A DIU is a measure of computing power incorporating CPU,
memory, and network usage. Power is allocated to Copy data activity executions as a
number of DIUs; the cost of an execution is determined by the duration for which it was
allocated those DIUs.
 • Degree of parallelism (DoP): A Copy data activity can be performed in parallel using
multiple threads to read different files simultaneously. The maximum number of threads
used during an activity’s execution is its degree of parallelism; the number can be set
manually for the activity, but this is not advised.
 • Azure SQL DB: Azure-based, PaaS SQL Server service.
• Logical SQL Server: Logical grouping of Azure SQL Databases for collective
management.
• Online query editor: Web-based query editor available for use with Azure SQL DB (and
other Azure database platforms)
• Expression: An expression is evaluated at pipeline execution time to determine a
property value. The data type of an expression is string, integer, float, boolean, array, or
dictionary.
• Array: A collection of multiple values referred to as elements. Elements are addressed by
an integer index between zero and one less than the array’s length.
 • Dictionary: A collection whose elements are referred to by name.
• Expression builder: An expression editor built into the ADF UX.
• System variable: System variables provide access to the runtime values of various
system properties.
 • User variable: Created to store String, Boolean, or Array values during a pipeline’s
execution.
• Expression function: One of a library of functions available for use in expressions.
Function types include String, Math, Logical, Date, Collection, and Type conversions.
 • Interpolated string: A string literal containing placeholder expressions.
• Placeholder expression: An expression embedded in an interpolated string, evaluated
at runtime to return a string.
• Escape: String literals beginning with the @ character must be escaped to prevent their
interpretation as expressions. @ is escaped by following it with a second @ character.
• Stored procedure activity: ADF activity that enables the execution of a database stored
procedure, specifying values for the stored procedure’s parameters as required.
• Lookup activity: ADF activity that returns one or more rows from a dataset for use
during a pipeline’s execution. In the case of SQL Server datasets, rows can be returned
from tables, views, inline queries, or stored procedures.
 • Set variable activity: ADF activity used to update the value of a user variable.
• Append variable activity: ADF activity used to add a new element to the end of an
existing Array variable’s value.
• Activity dependency: Constraint used to control the order in which a pipeline’s activities
are executed.
• Activity output object: JSON object produced by the execution of an ADF activity. An
output object and its properties are available to any activity dependent on the source
activity, either directly or indirectly.
• Breakpoint: An ADF UX breakpoint allows you to run a pipeline, in debug mode, up to
and including the activity on which the breakpoint is set (breakpoints do not exist in
published pipelines). Unlike in other IDEs, it is not possible to resume execution after
hitting a breakpoint.
 • $$FILEPATH: A reserved system variable that enables the Copy data activity to label
incoming file data with its source file. $$FILEPATH is available solely to populate additional
columns in the Copy data activity and cannot be used in expressions.
• $$COLUMN: A reserved system variable that enables the Copy data activity to duplicate
a specified column in incoming data. $$COLUMN is available solely to populate additional
columns in the Copy data activity and cannot be used in expressions.
• Additional columns: A Copy data activity source can be augmented with additional
columns, the values of which are specified by an expression, $$FILEPATH, $$COLUMN,
or a hard-coded static value.
• Lineage tracking: The practice of labeling data as it is processed, to enable later
identification of information related to its source and/or processing.
• Runtime parameter: A placeholder in a factory resource definition for a value substituted
at runtime. Pipeline, dataset, and linked service parameters are runtime parameters; global
parameters are not.
• Optional parameter: A runtime parameter can be made optional by defining a default
value for it. If a value is not supplied at runtime, the default value is substituted for the
parameter instead.
• Reusability: Runtime parameters enable factory resources to be defined in a reusable
way, using different parameter values to vary resource behavior.
• Global parameter: A constant value, shared by all pipelines, not expected to change
frequently. Global parameters are referred to in ADF expressions, within the scope of
pipeline activities, using the syntax pipeline().globalParameters.ParameterName.
• Pipeline parameter: Runtime parameter for an ADF pipeline. Pipeline parameters are
referred to in ADF expressions, within the scope of pipeline activities, using the syntax
pipeline(). parameters.ParameterName.
• Dataset parameter: Runtime parameter for an ADF dataset. Dataset parameters are
referred to in ADF expressions, within the scope of the dataset, using the syntax
dataset().ParameterName.
• Linked service parameter: Runtime parameter for an ADF linked service. Dataset
parameters are referred to in ADF expressions, within the scope of the linked service,
using the syntax linkedService().ParameterName. The ADF UX does not always support
parameter management for linked services, but parameters can be defined and used by
editing a linked service’s JSON definition.
• Execute Pipeline activity: ADF pipeline activity used to execute another pipeline within
the same data factory instance.
• Azure Key Vault: A secure repository for secrets and cryptographic keys.
 • Secret: A name/value pair stored in an Azure Key Vault. The value usually contains
sensitive information such as service authentication credentials. A secure way to handle
this information is to refer to the secret by name – a service that requires the secret’s value
may retrieve it from the vault by name if permitted to do so.
 • Service principal: An identity created for use with an application or service – such as
Azure Data Factory – enabling the service to be identified when external resources require
authentication and authorization.
• Managed identity: A managed identity associates a service principal with an instance of
Azure Data Factory (or other Azure resources). A system-assigned managed identity is
created automatically for new ADF instances created in the Azure portal and is
automatically removed if and when its factory is deleted.
• Access policy: A key vault’s access policy defines which service or user principals are
permitted to access data stored in the vault.
• Dependency condition: Characterizes an activity dependency. An activity dependent on
another is only executed if the associated dependency condition is met – that is,
depending on whether the prior activity succeeds, fails, or is skipped.
• Multiple dependencies: An activity dependent on multiple activities is only executed if
each prior activity satisfies a dependency condition. If an activity specifies multiple
dependency conditions on the same prior activity, only one needs to be met.
• Leaf activity: A leaf activity is a pipeline activity with no successors.
• Conditional activities: ADF has two conditional activities – the If Condition activity and
the Switch activity.
 • If Condition activity: Specifies an expression that evaluates to true or false and two
corresponding contained sets of activities. When the activity runs, its expression is
evaluated, and the corresponding activity set is executed.
• Switch activity: Specifies an expression that evaluates to a string value and up to 25
corresponding contained sets of activities. When the activity runs, the expression is
evaluated, and the corresponding activity set, if any, is executed. If no matching case is
found, the default activity set is executed instead.
 • Iteration activities: ADF has two iteration activities – the ForEach activity and the Until
activity.
 • ForEach activity: Specifies a JSON array and a set of activities to be executed once for
each element of the array. The current element of the array is addressed in each iteration
using the item() expression.
 • Parallelism: By default, ForEach activity executions take place in parallel, requiring care
to ensure that activities from simultaneous iterations do not interfere with one another.
Variable modification must be avoided, but the Execute Pipeline activity provides an easy
way to isolate iterations. The ADF UX executes ForEach iterations sequentially in Debug
mode, which can make parallelism faults hard to detect at development time.
• Until activity: Specifies a terminating condition and a set of activities to be executed
repeatedly until the terminating condition is met. Activities within an Until activity are
executed at least once and never in parallel.
• Nesting: Iteration activities may not be nested in other iteration activities. Conditional
activities may not be nested in other conditional activities, although nesting in iteration
activities is permitted. A common workaround is to implement inner and outer activities in
separate pipelines, calling the inner activity from the outer via the Execute Pipeline activity.
• Breakpoints: The ADF UX does not support breakpoints inside iteration or conditional
activities.
• Get Metadata activity: Returns metadata that describes attributes of an ADF dataset.
Not all metadata attributes are supported by all datasets – nonexistent dataset targets will
cause errors unless the “Exists” argument is specified; specifying the “Child Items”
argument on a nonfolder target will cause the activity to fail.
• Fault tolerance: The Copy data activity supports enhanced error handling through its
Fault tolerance settings, enabling individual error rows to be diverted into an external log
file without completely abandoning a data load.
• Raising errors: At the time of writing, ADF has no “raise error” activity. Approaches
available to manufacture errors include defining illegal type casts in Set variable activities
or exploiting error raising functionality in external services, for example, by using SQL
Server’s RAISERROR or THROW statements.
•Apache Spark: Open source data processing engine that automatically distributes
processing workloads across a cluster of servers (referred to as nodes) to enable highly
parallelized execution.
• Databricks: Data processing and analytics platform built on Apache Spark and adding a
variety of enterprise features.
• Data flows: ADF’s visual data transformation tool, built on Azure Databricks.
• Data flow debug: Data flow debug mode provisions a Databricks cluster on which you
can execute data flows from the ADF UX.
• Time to live (TTL): The data flow debug cluster has a default TTL of one hour, after
which – if it is not being used – it automatically shuts down.
 • Data flow activity: ADF pipeline activity used to execute a data flow.
 • Parameters: Data flow parameters are specified in Debug Settings during development
and substituted for values supplied by the calling Data flow activity at runtime.
• Data flow canvas: Visual development environment for data flows.
• Transformation: A data flow is made up of a sequence of connected transformations,
each of which modifies a data stream in some way.
• Output stream name: Name that uniquely identifies each transformation in a data flow.
• Inspect tab: Use a transformation’s Inspect tab to view input and output schema
information.

• Data preview tab: Use a transformation’s Data preview tab to preview data emitted by
the transformation. Using data preview requires data flow debug to be enabled and can be
used to delay an approaching cluster timeout.
• Optimize tab: Use a transformation’s Optimize tab to influence data partitioning in Spark
when the transformation is executed.
• Source transformation: Reads input data from an external source. Every data flow
starts with one or more Source transformations.
• Sink transformation: Write transformed data to an external source. Every data flow
ends with one or more Sink transformations.
• Data flow expression language: Data flow expressions have their own language and
expression builder, different from those of ADF pipeline expressions.
• Data Flow Script: Language in which data flow transformations are stored, embedded in
a data flow’s JSON file.
• Column patterns: Where supported, use column patterns to specify multiple columns
which are to be handled in the same way. Columns are specified using data flow
expressions to match column metadata.
• Filter transformation: Selects rows from its input data stream to be included in its output
stream, on the basis of criteria specified as a data flow expression. Other rows are
discarded.
• Lookup transformation: Conceptually similar to a SQL join between two data streams.
Supports a variety of join styles and criteria.
• Derived Column transformation: Uses data flow expressions to derive new columns for
inclusion in a data flow.
• Locals: Named intermediate derivations in a Derived Column transformation. Used to
simplify expressions and eliminate redundancy.
• Select transformation: Used to rename columns in a data flow or to remove them.
• Aggregate transformation: Aggregates one or more columns in a data flow, optionally
grouping by other specified columns.
• Exists transformation: Selects rows from its input data stream to be included in its
output stream, on the basis of the existence (or not) of matching rows in a second data
stream. Other rows are discarded.
• Templates: Reusable implementations of common pipeline and data flow patterns.
• Template gallery: Source of provided templates, accessed using the Create pipeline
from template bubble on the Data Factory overview page.
•External pipeline activity: An ADF pipeline activity executed using compute resource
provided by a service outside Azure Data Factory, for example, Stored procedure,
Databricks, or HDInsight activities.
• Internal pipeline activity: An ADF pipeline activity executed using compute resource
provided internally by Azure Data Factory.
• Integration runtime: Internal compute resource managed by Azure Data Factory.
 • Dispatching: Management of ADF activity execution, particularly external pipeline
activities.
 • Azure IR: A fully managed, serverless integration runtime that executes data
movements and transformations defined by the Copy data and Data flow activities. Azure
IRs also manage dispatching of external activities to storage and compute environments
like Azure blob storage, Azure SQL Database, and others.
• AutoResolveIntegrationRuntime: An Azure IR present in every data factory instance.
The location of IR compute is determined automatically at runtime, and Databricks clusters
created for Data flow activities using this IR have a TTL of zero – you can modify these
characteristics by creating and using your own Azure IR.
• Self-hosted integration runtime: An IR installed on one or more servers provided by
you in a private network. A self-hosted IR permits you to expose private resources to ADF,
for example, source systems which are hosted on-premises or for which no native ADF
connector is available.
• Linked self-hosted IR: A self-hosted IR is connected to exactly one Azure Data Factory
instance. A common pattern used to enable access in other data factories is to share it,
enabling other data factories to create linked self-hosted IRs that refer to the shared
self[1]hosted IR.
• Azure-SSIS IR: A fully managed integration runtime that supports SSIS package
execution in ADF. An Azure-SSIS IR consists of a VM cluster of a specified size and
power – although the cluster is managed for you, its infrastructure is more visible than in
serverless Azure IRs.
 • Web activity: ADF pipeline activity supporting calls to REST API endpoints
•Power Query: Graphical data preparation tool available in a number of Microsoft
products, including ADF Power Query activities, Excel, Power Platform dataflows, or
Power BI.
• Data wrangling: Interactive exploration and preparation of datasets.
• Mashup: Data wrangling transformation implemented in Power Query.
 • M formula language: Language used to express transformations built in Power Query.
M expressions built using the graphical Power Query Editor are translated into Data Flow
Script at runtime by Azure Data Factory, for execution in the same way as an ADF data
flow.
• Power Query activity: ADF pipeline activity used to execute Power Query mashups
implemented in the ADF UX.
•Azure Resource Manager (ARM) template: An ARM template is a JSON file that defines
components of an Azure solution, for example, the contents of an Azure Data Factory
instance.
 • Publish: To run a pipeline independently of the ADF UX, it must be deployed into a data
factory’s published environment. Published pipelines are executed using triggers.
published pipeline runs can be observed in the ADF UX monitoring experience.
• Publish branch: A nominated branch in a factory’s Git repository, by default adf_publish.
The publish branch contains ARM templates produced when publishing factory resources
in the ADF UX.
 • Azure custom role: A custom security role, built by assembling a required set of
permissions, to provide security profiles not supported in the standard Azure role set.
 • Deployment parameters: Specified in an ARM template, deployment parameters
enable different values to be substituted at deployment time. In the case of ADF, this
permits a single template to be used for deployments to multiple different data factories.
 • Parameterization template: A development data factory’s parameterization template
specifies which factory resource properties should be made parameterizable using
deployment parameters.
• CI/CD: Continuous integration and continuous delivery (CI/CD) is a development practice
in which software changes are integrated into the main code base and deployed into
production continuously.
• Azure Pipelines: Microsoft’s cloud-based CI/CD pipeline service.
• Data serialization language: Human-readable language used to represent data
structures in text for storage or transmission. XML, JSON, and YAML are examples of data
serialization languages.
• YAML: Data serialization language with an indentation-based layout, used in Azure
DevOps to define Azure DevOps pipelines. YAML pipeline files are stored under version
control like any other code file.
 • Task: Configurable process used in an Azure DevOps pipeline, for example, script,
AzureResourceManagerTemplateDeployment@3, or AzurePowerShell@4.
• Pipeline variable: Variable defined for use in an Azure DevOps pipeline. Secret
variables allow secret values to be specified in pipeline YAML without storing them in
version control.
• Service connection: Represents a nominated AAD principal with the permissions
required by an Azure DevOps pipeline.
 • Feature branch workflow: A common Git workflow in which development work takes
place in isolated feature branches.
• Pull request: A request to merge a feature branch back into the collaboration branch
when feature development is complete.
• Az.DataFactory: PowerShell module providing cmdlets for interacting with Azure Data
Factory.
•Trigger: A unit of processing that runs one or more ADF pipelines when certain execution
conditions are met. A pipeline can be associated with – and run by – more than one
trigger.
• Trigger run: A single execution of a trigger. If the trigger is associated with multiple
pipelines, one trigger run starts multiple pipeline runs. The ADF UX monitoring experience
reports trigger and pipeline runs separately.
• Trigger start date: The date and time from which a trigger is active.
• Trigger end date: The date and time after which a trigger is no longer active.
 • Recurrence pattern: A simple time-based scheduling model, defined by a repeated
interval after a given start date and time.
• Schedule trigger: A trigger whose execution condition is defined by a recurrence pattern
based on the trigger’s start date or using a wall clock schedule.
• Event-based trigger: A trigger whose execution condition is the creation or deletion of a
file from Azure blob storage.
 • Resource provider: Azure uses resource providers to create and manage resources. In
order to use a resource in an Azure subscription, the corresponding resource provider
must be registered to that subscription.
• Azure Event Grid: Cloud service providing infrastructure for event[1]driven architectures.
Azure blob storage uses Event Grid to publish file creation and other events; ADF
subscribes to Event Grid to consume events and run event-based triggers.
 • Tumbling window trigger: A trigger that uses a recurrence pattern based on the
trigger’s start date to define a sequence of processing windows between successive
executions. Tumbling window triggers also support more advanced scheduling behaviors
like dependencies, concurrency limits, and retries.
• Pipeline run overlap: Pipeline runs may overlap if a trigger starts a new pipeline run
before a previous one has finished. Use a tumbling window self-dependency with a
concurrency limit of one to prevent this.
 • Reusable triggers: A single schedule or event-based trigger can be used by multiple
pipelines. A tumbling window trigger can be used by a single pipeline
. • Trigger-scoped system variables: ADF system variables available for use in trigger
definitions. Some trigger-scoped variables are specific to the type of ADF trigger in use.
• Azure Logic Apps: Cloud service for general-purpose task scheduling, orchestration,
and automation. Internally, ADF triggers are implemented using Azure Logic Apps.
• Trigger publishing: Triggers do not operate in the ADF UX debugging environment and
must be published to a factory instance to have any effect.
Pipeline annotation: A label, added to a pipeline, that appears in the log of subsequent
pipeline runs and can be used to filter or group log data. Multiple annotations can be
added to a pipeline.
• Trigger annotation: A label, added to a trigger, providing functionality analogous to a
pipeline annotation.
• Activity user property: A name-value pair, added to a pipeline activity, that appears in
the log of subsequent pipeline runs. Multiple user properties can be added to an activity.
The Copy data activity supports two auto-generated properties that identify its runtime
source and sink.
 • Azure Monitor: Monitoring service used to collect, analyze, and respond to data from
Azure resources.
 • Metric: Automatically maintained count of a given system property over a period of time,
emitted to and logged by Azure Monitor.
• Log Analytics: Azure Monitor component that enables sophisticated analysis of system
logs and metrics.
• Log Analytics workspace: Identified Log Analytics provision, to which Azure resource
logs and metrics can be sent for analysis and longer-term storage.
 • Diagnostic setting: Per-resource configuration information identifying log data and
metrics to be sent to other storage services, for example, a Log Analytics workspace.
• Kusto: Query language used to interrogate data stored in Log Analytics and Azure Data
Explorer.
• Tabular expression statement: Kusto query expression that returns a result set. Every
Kusto query must contain a tabular expression statement.
• Log Analytics workbook: A notebook-like interface for querying Log Analytics data,
allowing code and text cells to be interleaved to create narrative reports.
• Azure Data Explorer: Analytics service for near real-time, large-scale analysis of raw
data. Like Log Analytics, the service accepts read-only queries written in Kusto.
• Alerts: Azure Monitor supports the raising of alerts in response to configured metric or
custom query output thresholds.
• Alert rule: Information that defines an alert – its scope (what to monitor), its conditions
(when an alert should be raised), and its actions (who to notify and/or what to do when an
alert is raised).
• Signal: Measure used to evaluate an alert condition.
• Action group: Defines a collection of notifications and actions, used to specify the action
component of an alert rule.

ADF (AZURE DATA FACTORY) DAILY ROUTINE


EXPRESSIONS / ADF EXPRESSIONS
CHEATSHEET
February 18, 2021 Kloudspro Azure Data Factory Comments Offon ADF (Azure Data Factory) Daily Routine
Expressions / ADF Expressions Cheatsheet

All keywords, function names etc. are case-sensitive

Single quote to be used for string

All expressions to begin with @

No direct use of operators; all manipulations through functions only

Datetime format is SQL Server’s DATETIMEOFFSET format

Accessing Parameter –

1 @pipeline().parameters.<parametername>

Accessing Variable e.g. –

1 @variables('TempCounter')

Lookup Activity First Row –

1 @activity('Lookup Activity Name').output.firstRow.ColumnName

Lookup Activity Multi Row Output but reading from one particular row
1 @string(activity('Lookup Activity').output.value[0].ColumnName)

Lookup Activity’s result in an Array –

1 @activity('Lookup Activity').output.value

Until Activity’s Terminating Expression –

1 @greater(variables('Counter'),variables('TargetCounter'))

Filtering an Array under Filter Activity –

1 @equals(int(item().ColumnName), int(variables('Counter')))

If Activity Expression –

1 @greater(length(activity('Previous Activity Name Returning Array').output.value),0)

For Each Activity, accessing current item –

1 @item().ColumnName

System variables accessed as e.g. –

1 @pipeline().RunId

2 @pipeline().Pipeline

3 @pipeline().DataFactory

Current Date Time –

1 @utcnow()

Variable value as HTML –

1 ["<b>Table - Rows Read - Rows Copied - Duration (in seconds) </b>"]

Common Functions –

1 @formatDateTime(utcnow(), 'yyyy-MM-dd HH:mm:ss')

2  

3 @concat('a', 'b', 'c')

4  
5 @coalesce('First Value', 'Second Value')

6  

7 @empty('Variable or Parameter to be checked for null')

Type Casting –

1 @int('string to integer conversion')

2  
3 @json('string to JSON conversion')

4  

5 @string('Number to string conversion')

Display Array Variable – converting an array to string produces each item separated by
linefeed

1 @string(variables('Array Type Variable'))


 

You might also like