Azure Data Factory

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

ADF: Copying data from ASQL-To-ADLS

Copy data from Azure SQL to Azure Data Lake and how to visualize the data by loading it to
Power BI, and how to create an ETL process by creating a pipeline using Azure Data Factory. You
will be introduced to Azure Data Lake Analytics by using USQL for data processing.

Introduction to Azure Data Factory

Data generated by several applications of products is increasing exponentially day by day. As the
data is coming from many sources, it is very difficult to manage it.

To analyze and store all this data, we can use Data Factory which:

·0 Stores data with the help of Azure Data Lake Storage


·1 Analyzes the data

·2 Transforms the data with the help of pipelines (a logical grouping of activities that
together perform a task)

·3 Publishes the organized data


·4 Visualizes the data with third-party applications like Apache Spark or Hadoop.
Azure Data Factory is a cloud-based data integration service that allows you to create data-
driven workflows in the cloud for orchestrating and automating data movement and data
transformation.
We can make use of Azure Data Factory to create and schedule data-driven workflows that can
ingest data from various data stores. It can process and transform the data by using compute
services such as Azure Data Lake Analytics, Azure Machine Learning, and Azure HDInsight
Hadoop.

We can publish the output data to data stores such as Azure Data Lake for Business
Intelligence(BI) applications to perform visualisation or analytics. For better business decisions,
we can organise the raw data into meaningful data stores.

Flow Process of Data Factory

The Data Factory service allows us to create pipelines which helps us to move and transform
data and then run the pipelines on a specified schedule which can be daily, hourly or weekly. The
data that is consumed and produced by workflows is time-sliced data, and we can specify the
pipeline mode as scheduled or one time.

Input dataset: It is the data we have within our data store, which needs to be processed and
then passed through a pipeline.

Pipeline: Pipeline operates on data to transform it. Data transformation could be anything like
data movement.

Data transformation is possible with the help of USQL, stored procedures, or Hive.

Output dataset: It will contain data that is in a structured form because it is already been
transformed and made structured in the pipeline storage. Then, it is given to linked services like
Azure Data Lake, blob storage, or SQL.

Linked services: These store information that is very important when it comes to connecting an
external source.

For example, consider the SQL server. We need a connection string to connect to an external
device. We need to mention the source and the destination of our data.

Gateway: The Gateway connects our on-premises data to the cloud. We need a client installed
on our on-premises system so that we can connect to the Azure cloud.

Cloud: Our data can be analyzed and visualized with much different analytical software like
Apache Spark, R, Hadoop, and so on.

Using Data Factory for automating the movement and transformation of data by creating linked
services, data sets, pipelines and scheduling those pipelines.

Azure Data Lake

Azure Data Lake is a data storage or a file system that is highly scalable and distributed. It is
located in the cloud and works with multiple analytics frameworks, which are external
frameworks, like Hadoop, Apache Spark, and so on.

We can get our output dataset from web, mobile, or social media. It is sent into Azure Data Lake
Store, and then it is provided to external frameworks like Apache spark, Hive, etc.

There are two main concepts when it comes to Azure Data Lake Storage: Storage & Analytics.

Storage is of unlimited size. It can be terabytes, gigabytes, and much more. Azure Data Lake
Store shows a wide variety of data, both unstructured or structured. It can store really large files.

How does analytics work?

We can monitor and diagnose real-time data.

Copying Data from Azure SQL to Azure Data Lake


To create and connect to a database in Azure Data Warehouse, we need to have a software
known as SSMS (SQL Server Management Studio).

After the installation of SSMS, open the dashboard in Microsoft Azure. We need to create a data
warehouse.

·5 Click on Create a resource


·6 Select Databases

·7 Select SQL Data Warehouse


* create a new server

* Configure the performance

* Deploy

·8 After the successful deployment, click on Go to resource


·9 Open SSMS

·10 Connect to the server created in Azure data warehouse using SQL authentication
Note: We can find the server name in the overview of SQL warehouse.

·11 Create a table under the created database


* Right-click on the database, select a new query, and type in the following query

CREATE TABLE student (namee varchar(20),groupp varchar(20),reg int );

INSERT INTO STUDENT (namee,groupp,reg) VALUES (‘palani’,’CSE’,123456),


(‘shyam’,’ECE’,456987),(‘hema’,’CIVIL’,789456),(‘vicky’,’MCH’,5 64987),
(‘jai’,’BBA’,789452);

·12 Go to the dashboard of Microsoft Azure and create a new storage (Azure Data Lake)

·13 Go to the dashboard of Microsoft Azure and create a Data Factory


·14 After the deployment of Data Factory, click on Author and monitor from quick links
Click on Go to resource

Click on Author and monitor

Select Overview displayed on the left-side panel

·15 Select copy data

·16 Mention the source data as SQL server and create a new linked service (SQL server)
·17 Select the dataset

·18 Mention the destination data store as Azure Data Lake and create a new linked service
To get the service principal ID and service principal key, do the following steps:

·19 Go to Azure dashboard


·20 Click on Azure active directory

·21 Click on App registrations


·22 Create a new web application, and the application ID which gets generated is the
service principal ID

·23 Go to Certificates and secrets, create a new client secret, and then a password is
generated which is the service principal key

·24 Go to Azure Data Lake Store and give all access to the application which is
created for generating the service principal key

·25 Copying or moving of data can be done vice versa, i.e., source and destination
can be interchanged

Loading Data from Data Lake to Power BI

Lake to Power BI for doing visualisation and performing analytics.

Microsoft’s Power BI is a cloud-based business analytics service for analyzing and visualizing
data. Various processes in Power BI are as follows:

·26 Connecting to our data

·27 Shaping of data


·28 Modeling of data

·29 Data visualization


·30 Publishing data reports
Let’s now see how to load data from Data Lake to Power BI

·31 Download Power BI Desktop from the below-mentioned link:


https://powerbi.microsoft.com/en-us/desktop/

·32 Go to Microsoft Azure Dashboard and create a new Azure Data Lake Store
·33 Upload the dataset to Azure Data Lake Store

·34 Following is the dataset which we are having.

·35 Click on Data explorer and then upload the dataset


·36 Open Power BI

·37 Connect to Azure Data Lake Store in Power BI with the URL provided by Data Explorer in
Azure Data Lake Store
·38 Go to Azure dashboard and open Data Lake Store which we have created

·39 Click on Data explorer


·40 Click on Properties to find the path, copy the link, and share it with
Power BI

·41 Open Power BI and click on Get Data

·42 Select Data Lake Storage as source


·43 Provide the copied path

·44 Edit query to import data from a binary column


Click on edit query

·45 Click on edit query


·46 Select binary

·47 Click on transform


·48 Use the first row as the header if needed

·49 Select columns from the dataset


·50 Create the visualisation
Delete the resource group to stop paying for the Data Lake

Creating a Pipeline Using Data Factory – ETL (Extract, Transform, Load) Solution

We have an SQL database which is on our Azure SQL server database, and we are trying to
extract some data from this database. While extracting, if something has to be processed, then it
will be processed and then stored in the Data Lake Store.

Power BI has self-service ETL within itself. Azure Data Factory Data Flow is a new preview feature
in Azure Data Factory to visually create ETL flows.
Steps for Creating ETL

·51 Create a Linked Service for SQL Server Database


·52 Create a Linked Service for Azure Data Lake Store

·53 Create a dataset for Data Extraction


·54 Create a dataset for Data Saving

·55 Create the pipeline and add copy activity


·56 Schedule the pipeline (by adding a trigger)
Introduction to Azure Data Lake Analytics

If we have large data, either structured or unstructured, and we want to process it in a


distributed manner without spending time on provisioning data clusters, the best service
available is Azure Data Lake Analytics.

If we have large data, either structured or unstructured, and we want to process it in a


distributed manner without spending time on provisioning data clusters, the best service
available is Azure Data Lake Analytics.

Key components of Azure Data Lake are:

Analytics

Storage

Analytics, we have HDInsight Analytics and Azure Data Lake Analytics. Distributed analytics
service is built on Apache YARN which is similar to Hadoop because Hadoop also uses YARN for
distributed processing.

Distributed processing: Imagine, we have a large amount of data and we want to process it in a
distributed manner to speed up the process.

All we want is to get the data separated, process this separated dataset with the instruction set
we have, and finally combine them. This is distributed processing, and that’s what we get from
Azure Data Lake Analytics.

In Hadoop, we have to spend some considerable time on provisioning it, but still we have been
using Hadoop in an open-source platform or HDInsight in Azure mainly for processing data.

Azure Data Lake Analytics simplifies this. Here, we don’t need to worry about provisioning
clusters. We can simply make a job for processing our dataset and submit it.

We don’t need to worry about the installation, configuration, and management of our big data
cluster management. Moreover, we are only charged per job instance, i.e., we are charged for
how nodes that we assign to the job as well as for how long the job runs.

Though we compare this with Hadoop, up to some extent, it doesn’t mean that this completely
replaces the Hadoop ecosystem. When it comes to a distributed environment system, scalability
is a must. There should not be a limit on scaling out.

Azure Data Lake Analytics is an on-demand analytics job service to simplify big data analytics. It
provides us the ability to increase or decrease the processing power in a per job basis. Let us say,
we are handling terabytes of data and we need answers fast. In this case, we can throw nodes
per job on the particular instance and thus we will be able to get our insights quickly.

We make use of USQL for processing data. This language is based on internal languages that
Microsoft has been using for years: TSQL and C#.NET. We can pass the instruction data using
USQL, and there are some other ways too. Let’s first discuss about using USQL.

USQL can be written in Visual Studio which is the most familiar integrated development
environment, and once it is written it can be submitted to Data Lake Analytics. USQL can be used
against many data sources.

Data Lake Analytics supports many enterprise features like integration, security, compliance,
role-based access control, and auditing. So, we can simply say that it is enterprise grade. If data
is in Azure Data Lake Store, we can expect good performance because Azure Data Lake Analytics
is optimized for working with Azure Data Lake Store.

It is also important to understand the difference between HDInsight and Azure Data Lake
Analytics (ADLA). The below table will help us differentiate between the two.
USQL works with both structured and unstructured data. We can process images stored in Azure
Data Lake Store with the help of image processing libraries. That’s how we process unstructured
data using USQL.

When there is a dataset without a predefined structure, we can define the structure as we want
and can read data using this defined structure. It allows defining the schema for reading without
holding a dataset in a predefined structure.

You might also like