Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

https://docs.microsoft.

com/en-us/azure/data-lake-analytics/

https://www.infoq.com/articles/azure-data-lake-analytics-usql/#:~:text=U
%2DSQL%20is%20the%20big,C%23%20rich%20types%20and
%20expressions.

What is Azure Data Lake Analytics?

Azure Data Lake Analytics is an on-demand analytics job service that simplifies big
data. Instead of deploying, configuring, and tuning hardware, you write queries to
transform your data and extract valuable insights. The analytics service can handle
jobs of any scale instantly by setting the dial for how much power you need. You
only pay for your job when it is running, making it cost-effective.

Dynamic scaling

Data Lake Analytics dynamically provisions resource and lets you do analytics on
terabytes to petabytes of data. You pay only for the processing power used. As you
increase or decrease the size of data stored or the amount of compute resources
used, you don’t have to rewrite code.

Develop faster, debug, and optimize smarter using familiar tools

Data Lake Analytics deep integrates with Visual Studio. You can use familiar tools to
run, debug, and tune your code. Visualizations of your U-SQL jobs let you see how
your code runs at scale, so you can easily identify performance bottlenecks and
optimize costs.

U-SQL: simple and familiar, powerful, and extensible

Data Lake Analytics includes U-SQL, a query language that extends the familiar,
simple, declarative nature of SQL with the expressive power of C#. The U-SQL
language uses the same distributed runtime that powers Microsoft's internal
exabyte-scale data lake. SQL and .NET developers can now process and analyze their
data with the skills they already have.

Integrates seamlessly with your IT investments

Data Lake Analytics uses your existing IT investments for identity, management, and
security. This approach simplifies data governance and makes it easy to extend your
current data applications. Data Lake Analytics is integrated with Active Directory for
user management and permissions and comes with built-in monitoring and auditing.
Affordable and cost effective

Data Lake Analytics is a cost-effective solution for running big data workloads. You
pay on a per-job basis when data is processed. No hardware, licenses, or service-
specific support agreements are required. The system automatically scales up or
down as the job starts and completes, so you never pay for more than what you
need. Learn more about controlling costs and saving money.

Works with all your Azure data

Data Lake Analytics works with Azure Data Lake Storage for the highest performance,
throughput, and parallelization and works with Azure Storage blobs, Azure SQL
Database, Azure Synapse Analytics.

In-region data residency

Data Lake Analytics does not move or store customer data out of the region in which
it is deployed.

Key Takeaways

 Azure Data Lake Analytics, along with Azure Data Lake Storage, is a key component
of Microsoft’s Azure Data Lake solution. 

 Currently, Azure Data Lake Analytics can be used for batch workloads only. For
streaming and event processing workloads, alternate big data analytics solutions on
Azure like HDInsight or Azure Databricks should be used.

 Azure Data Lake Analytics introduces a new big data query and processing language
called U-SQL.

 U-SQL combines the concepts and constructs both of SQL and C#; the power of U-
SQL comes from the simplicity and declarative nature of SQL with the programmatic
power of C# including rich types and expressions.

 U-SQL operates on unstructured data stored in files and provides a schematized view
on top of it. It also provides a general metadata catalog system very similar to
relational databases for structured data.

Even though big data and Hadoop technologies are more than a decade old now,
big data and big data analytics are more relevant than ever. While the initial version
of Hadoop was only able to handle batch workloads, now Hadoop ecosystem has
tools for other use cases like structured data, streaming data, event processing,
machine learning workloads and graph processing.

While Hadoop ecosystem has a bunch of tools like Hive, Impala, Pig, Storm,


and Mahout to provide the complete set of features, newer data analytics framework
like Spark have an integrated approach to handle different types of workloads.

Azure Data Lake Analytics, or ADLA, is one of the newer big data analytics engines.
ADLA is Microsoft’s fully managed, on-demand analytics service on Azure cloud.
Together with Azure Data Lake Storage and HDInsight, Azure Data Lake Analytics
forms the complete cloud hosted data lake and analytics offering from Microsoft.
Azure Data Lake Analytics introduces a new big data query and processing language
called U-SQL. This article provides an overview of U-SQL language and how to use it
in applications.

Analytics & Processing Services

 Azure Data Lake Analytics or ADLA– Fully managed, on-demand analytics service on
Azure cloud. Supports new U-SQL big data processing language apart from .NET, R
and Python.

 HDInsight– HDInsight provides managed Hadoop clusters running on Azure and is


based on Hortonworks Data Platform (HDP) Hadoop distro. Supports Hadoop
ecosystem tools including Spark, Hive, Map Reduce, HBase, Storm, and Kafka.

 Azure Databricks– Managed serverless analytics service based on Azure Spark.


Supports a Jupyter/ iPython/Zeppelin like Notebook experience, along with other
communication features, and supports Scala, Python, R and SQL.

Complementary Services

 Cosmos DB – The managed, serverless, multi modal NoSQL database service on
Azure.
 Azure SQL Database – A managed, relational database as a service/DBaaS on Azure.

 Azure SQL Datawarehouse – Cloud-based Enterprise Data Warehouse (EDW) solution


running on Azure. It uses familiar distributed systems and data warehousing
concepts like Massively Parallel Processing (MPP), columnar storage, compression
etc. to ensure fast performance for complex queries.

 Azure Analysis Service – A fully managed analytics engine on Azure; helps to build
semantic models on the cloud. It’s built on familiar SQL Server Analysis Server which
is an on-premise analytics engine based on SQL Server. As of now, Azure Analysis
Service only supports Tabular Models and does not support Multidimensional
Models (remember cubes?).

 Azure Data Factory – A cloud-based ETL and data integration service. It’s serverless
and provides out-of-the-box connectors to 50+ cloud or on-premise
systems/services like Azure Blob Storage, Cosmos DB, Azure SQL Database, on prem
SQL Server/MySQL/PostgreSQL and even 3rd party services like SFDC, Dropbox etc. It
can move data between cloud services, from on-premise systems to cloud or vice
versa.  

Figure 1 below shows these various cloud offerings from Microsoft on Azure cloud.

The big data and data lake-based application architecture on Azure cloud platform is
shown below in Figure 2.
U-SQL Introduction

U-SQL is the big data query and processing language for Azure Data Lake Analytics.
It’s a new language created by Microsoft especially for Azure Data Lake Analytics. U-
SQL combines SQL-like declarative language with the programmatic power of C#,
including C# rich types and expressions. U-SQL provides the familiar big data
processing concepts such as "schema on read", "lazy evaluation", custom processors
and reducers. Data engineers who have previously used languages like Pig, Hive and
Spark would find similarity with those. Developers with C# and SQL knowledge
would find U-SQL easy to learn and start with.

Though U-SQL uses many concepts and keywords from SQL language, it’s not ANSI
SQL compliant. It adds unstructured file handling capabilities using keywords
like EXTRACT and OUTPUT.
Currently, ADLA and U-SQL can be used for batch processing only. It doesn’t provide
stream analytics or event processing capability.
 U-SQL scripts follow the familiar Extract/Retrieve, Transform and Load/Output
pattern (ETL) used by other big data languages like Pig or Spark. It can extract data
from text files (both unstructured text files and semi structured files like JSON or
XML) and tables.

 U-SQL imposes a schema while retrieving unstructured data from files – this helps in
performing SQL-like operations of retrieved data.

 Rowset is the primary data structure of U-SQL. It’s used across for extracting data
from input file/table, and performing transformation, as well as for writing to output
destination. Rowsets are unordered which helps Azure Data Analytics Engine to
parallelize the processing using multiple processing nodes.

 U-SQL scripts can use types, operators and expressions from C#.

 U-SQL scripts use SQL constructs like SELECT, WHERE, JOIN and other data definition


(DDL) and data manipulation language (DML). All keywords must be written in upper
case only.
 Supports control flow constructs like IF ELSE, but do not support While or For loop.
First U-SQL script

For the first U-SQL script, we would use a dataset that constitutes of the rating of
restaurants in Bangalore, India. The raw data are in CSV files and have the following
columns:
 rest_id - Unique id of the restaurant
 name - name of the restaurant
 address - address of the restaurant
 online_order - whether online ordering is available in the restaurant or not
 book_table - whether table booking options are available or not
 rate - overall rating of the restaurant out of 5
 votes - total number of ratings for the restaurant
 phone - phone number of the restaurant
 location - neighborhood in which the restaurant is located
 rest_type - type of restaurant (e.g. Casual Dining, Quick Bites, Delivery, Bakery,
Dessert Parlor etc.)
 favorite_dish_id - id of the most favorite dish of the restaurant

The below table shows the sample data.

The below script reads restaurant ratings data from a CSV file and writes the same
data to a TSV file. It doesn’t use a transformation step yet.
// Script - RestaurantScript.usql

// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings =
EXTRACT rest_id int,
name string,
address string,
online_order bool,
book_order bool,
rate double,
votes int,
phone string,
location string,
rest_type string,
favorite_dish_id int
FROM "/Samples/Data/restaurants_ratings.csv"
USING Extractors.Csv();

// No Transformation – extracted data is loaded as is to output file (TSV)


OUTPUT @restaurant_ratings
TO "/output/restaurants_out.tsv"
USING Outputters.Tsv();

The scripts would write the entire restaurant rowsets to output file in a tab separated
format.

Note that C# datatypes are used here (e.g. string and not char/varchar as typically
used in SQL). Not only can we use the datatypes of C#, but expressions and all the
goodness of an expressive programming language.

U-SQL script with Transform Step


// Script - RestaurantScript.usql
// Variables for input and output file name and paths
DECLARE @inputFile = "/Samples/Data/restaurants_ratings.csv";
DECLARE @outputFile = "/output/restaurants_out.tsv";

// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings =
EXTRACT rest_id int,
name string,
address string,
online_order bool,
book_order bool,
rate double,
votes int,
phone string,
location string,
rest_type string,
favorite_dish_id int
FROM @inputFile
USING Extractors.Csv(skipFirstNRows:1); // Skip first row which contain headers

// Transformation Step: Columns are renamed and no of rows are filtered


@bestOnlineRestaurants =
SELECT name.ToUpper() AS Name, // Converting the names to uppercase
rate AS Rating,
online_order AS OnlineOrder,
phone AS Phone,
location AS Location,
rest_type AS Category,
favorite_dish_id AS FavoriteDishId
FROM @restaurants_rating
WHERE rate > 4 && online_order == true;

// Load transformed data to output file


OUTPUT @bestOnlineRestaurants
TO @outputFile
USING Outputters.Tsv(outputHeader:true); // Write column names/headers to output
file

Extend U-SQL expression using custom code

U-SQL supports custom expressions written in C# code. The C# code resides in code
behind files. Note in the below diagram each .usql file has an associated .usql.cs file
where custom C# code resides.

Figure 8: U-SQL project with multiple script and code behind files

// Code behind C# file - RestaurantScript.usql.cs


namespace UsqlApp1
{
public static class Helpers
{
public static string FormatRestaurantName(string name, string location,
string restaurantType)
{
return name + " (" + restaurantType + ") - " + location;
// Note that U-SQL does not yet support new C# 7.0 string interpolation
// return $"{name} ( {restaurantType} ) - {location}";
}
}
}

// Script - RestaurantScript.usql
// Variables for input and output file name and paths
DECLARE @inputFile = "/Samples/Data/restaurants_ratings.csv";
DECLARE @outputFile = "/output/restaurants_out.tsv";

// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings =
EXTRACT rest_id int,
name string,
address string,
online_order bool,
book_order bool,
rate double,
votes int,
phone string,
location string,
rest_type string,
favorite_dish_id int
FROM @inputFile
USING Extractors.Csv(skipFirstNRows:1); // Skip first row which contain
headers

// Transformation Step: Columns are renamed and no of rows are filtered


@bestOnlineRestaurants =
SELECT USQLApp1.Helpers.FormatRestaurantName(name, location, rest_type) AS
Name,
rate AS Rating,
online_order AS OnlineOrder,
phone AS Phone,
favorite_dish AS FavoriteDish
FROM @restaurant_ratings
WHERE rate > 4 && online_order == true;

// Load transformed data to output file


OUTPUT @bestOnlineRestaurants
TO @outputFile
USING Outputters.Tsv(outputHeader:true); // Write column names/headers to output
file

U-SQL script performing joins

U-SQL supports joins between two different datasets. It provides Inner Join, Outer
Join, Cross Join, etc.
In the following code snippet, we perform inner join between a restaurants dataset
and dish ingredients dataset.
// Script - RestaurantScript.usql
DECLARE @inputFile = "/Samples/Data/restaurants_ratings.csv";
DECLARE @outputFile = "/output/restaurants_out.tsv";

// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings = // Code not shown for brevity. Exact same code as above
example

// Transformation Step: Columns are renamed and no of rows are filtered


@bestOnlineRestaurants = // Code not shown for brevity. Exact same code as above
example

Now, we would need data about dishes and their ingredients. Though this data
typically would be present in an external source, we would use an in-memory rowset
here.
// Declare an in-memory rowset for dish ingredients containing dish id, name of
dish and // ingredients.

@dish_ingredients =
SELECT
* FROM
(VALUES
(1, "Biryani", "Rice, Indian spices, Vegetables, Meat, Egg,
Yoghurt, Dried Fruits"),
(2, "Masala Dosa", "rice, husked black gram, mustard seeds,
fenugreek seeds, salt, vegetable oil, potatoes, onion, green chillies, curry
leaves, turmeric"),
(3, "Cake", " sugar, butter, egg, cocoa, creme, salt")
) AS D(DishId, Dish, Ingredients);
// Perform an Inner Join between @bestOnlineRestaurants and @dish_ingredients
rowset
@rs_innerJn = SELECT
r.Name,
r.Rating,
i.Dish,
i.Ingredients
FROM @bestOnlineRestaurants AS r
INNER JOIN @dish_ingredients AS i
ON r.FavoriteDishId == i.DishId;

// Write to output file


OUTPUT @rs_innerJn
TO @outputFile
USING Outputters.Tsv(outputHeader:true);

This returns the restaurants with higher ratings, along with the ingredient details of
its favorite dish, which is retrieved by joining the restaurant details rowset with dish
ingredient rowset through inner join.

[Click on the image to enlarge it]


Figure 9: U-SQL project with multiple script and code behind files

U-SQL script using built-in functions

U-SQL provides a host of built-in functions including aggregate functions, analytical


functions, ranking functions etc. Below are few samples.
Type of Function     Example

Aggregate Functions AVG, SUM, COUNT, STDEV (Standard Deviation), MIN, MAX etc.

Analytical Functions FIRST_VALUE, LAST_VALUE, LAG, LEAD, PERCENTILE_CONT etc.

Ranking Functions RANK, DENSE_RANK, NTILE, ROW_NUMBER etc.

In the below script, we are using built-in aggregate functions like MIN, MAX, AVG
and STDEV on for restaurant.
// Declare variables for input and output files
DECLARE @inputFile = "/Samples/Data/restaurants_raw_data.csv";
DECLARE @outputFile = "/output/restaurants_aggr.csv";

@restaurant_ratings =
EXTRACT rest_id int,
name string,
address string,
online_order bool,
book_order bool,
rate double,
votes int,
phone string,
location string,
rest_type string,
favorite_dish_id int
FROM @inputFile
USING Extractors.Csv(skipFirstNRows:1);

@output =
SELECT
rest_type AS RestaurantType,
MIN(rate) AS MinRating,
MAX(rate) AS MaxRating,
AVG(rate) AS AvgRating,
STDEV(rate) AS StdDevRating
FROM @restaurants_ratings
GROUP BY rest_type;

// Write to output file


OUTPUT @output
TO @outputFile
USING Outputters.Csv(outputHeader:true);
U-SQL catalog: So far, we have focused on unstructured and semi-structured
data being read from files and written to files. While one of U-SQL’s strengths
is to operate on unstructured data stored in files and provide a schematized
view on top of unstructured data, it can manage structured data. It provides a
general metadata catalog system like Hive. Below is a list of primary objects
supported by U-SQL:

 Database: U-SQL supports databases similar to other big data systems like Hive.
 Database Schema: Database schemas group related objects present under a
Database, exactly like relational databases.
 Tables and Indexes: Tables are containers to hold structured data. Tables can
contain columns of different data types. Table data is stored in files. Tables provide
additional advantages above just schematized views over unstructured files like
indexing, partitioning table data into multiple buckets with each bucket backed up
by a file.
 Views: U-SQL views are of two types – (i) views that are based on a U-SQL table and
(ii) views that point to a file and use EXTRACT to get the data.
 Functions: Supports both scalar and table valued functions.
 Procedures:  Procedures are similar to functions but they don’t return any value.
 Assemblies: U-SQL supports storing .NET assemblies which extends U-SQL scripts
with custom expression.

Now, let’s say in our restaurant rating example, we would like to further analyze
restaurants with low ratings. To do so, we would like to move all the restaurants with
less than a four rating to a U-SQL table for further analysis.

U-SQL database, tables and indexes

In the below example. we will create a U-SQL database which will be created within
the database with a schema and index key. We are not creating a schema specifically
here, so the table will be created under the default schema ‘dbo’ (Remember SQL
Server?) inside the database.

The below code example shows how to create this table.


// Script - RestaurantScript.usql
DECLARE @inputFile = "/Samples/Data/restaurants_ratings.csv";
DECLARE @outputFile = "/output/restaurants_out.tsv";

// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings = // Code not shown for brevity. Exact same code as above
example

// Transformation Step: Filter only those restaurants with rating less than 4
@lowRatedRestaurants =
SELECT rest_id AS RestaurantId,
name AS Name,
rate AS Rating,
online_order AS OnlineOrder,
phone AS Phone,
location AS Location,
rest_type AS Category,
favorite_dish_id AS FavoriteDishId
FROM @restaurants_ratings
WHERE rate < 4;

// Insert Low rated restaurant details to U-SQL Catalog


// Create the database if it does not exist already
CREATE DATABASE IF NOT EXISTS RestaurantsDW;
USE RestaurantsDW;

// Drop the table if it exists


DROP TABLE IF EXISTS dbo.LowRatedRestaurants;

// Create the table by specifying the column schema and index


CREATE TABLE dbo.LowRatedRestaurants(
RestaurantId int,
Name string,
INDEX idx
CLUSTERED (Name DESC)
DISTRIBUTED BY HASH(Name),
Rating double,
OnlineOrder bool,
Phone string,
Location string,
Category string,
FavoriteDishId int
);

// Insert the rowset data to the U-SQL table created just before
INSERT INTO dbo.LowRatedRestaurants
SELECT * FROM @lowRatedRestaurants;

U-SQL views

U-SQL views are similar to database views– they do not physically store the data and
provide a view over data stored in table or files. Views could be based on table or
based on an extraction over files.

The example script below shows how to create a view that’s based on an extraction.
USE DATABASE RestaurantsDW;
// Delete the View if it already exists
DROP VIEW IF EXISTS RestaurantsView;

// Create the View based on an extraction


CREATE VIEW RestaurantsView AS
EXTRACT rest_id int,
name string,
address string,
online_order bool,
book_order bool,
rate double,
votes int,
phone string,
location string,
rest_type string,
favorite_dish_id int
FROM "/Samples/Data/restaurants_raw_data.csv"
USING Extractors.Csv(skipFirstNRows:1); // Skip first row which contain
headers

To run the view the following code is used:


@result = SELECT * FROM RestaurantsDW.dbo.RestaurantsView;
OUTPUT @result
TO "/output/Restaurants_View.csv"
USING Outputters.Csv();

U-SQL table valued functions (TVF)

U-SQL supports both scalar function and table valued function (TVF). Functions take
zero to many arguments and return either a single scalar value or a table, which is a
dataset comprised of columns and rows.

The below code snippet shows first how to create a TVF and then how to invoke it. It
takes a single parameter and returns a table.

CREATE DATABASE IF NOT EXISTS RestaurantsDW;


USE DATABASE RestaurantsDW;

DROP FUNCTION IF EXISTS tvf_SearchRestaurants;

// Create the Table Valued Function that accepts Restaurant type as string
// and returns a table that contains matched restaurant details.
CREATE FUNCTION tvf_SearchRestaurants(@RestaurantType string)
RETURNS @searchRestaurants TABLE(rest_id int, name string, address string,
online_order bool, book_order bool, rate
double,
votes int, phone string, location string,
rest_type string, favorite_dish_id int)
AS
BEGIN
@allRestaurants =
EXTRACT rest_id int,
name string,
address string,
online_order bool,
book_order bool,
rate double,
votes int,
phone string,
location string,
rest_type string,
favorite_dish_id int
FROM "/Samples/Data/restaurants_raw_data.csv"
USING Extractors.Csv(skipFirstNRows:1); // Skip first row which contain
headers

@searchRestaurants =
SELECT *
FROM @allRestaurants
WHERE rest_type == @RestaurantType;

RETURN;
END;

Now let’s invoke the table valued function we just created and pass ‘Bakery’ as
parameter– it would return all the restaurants which are of type Bakery.
OUTPUT RestaurantsDW.dbo.tvf_SearchRestaurants("Bakery")
TO "/output/BakeryRestaurants.csv"
USING Outputters.Csv();

Case Study

The following case study highlights using Azure Data Lake Analytics and U-SQL
language in a multiyear, large, strategic digital transformation program. The
customer, a large insurance major, over the year acquired multiple insurance
companies and brokers, and as a result used multiple customer engagement systems
for interacting with customers over email, text/SMS, web/mobile chat and calls (both
inbound and outbound). Because of the fractured approach, it became very difficult
for the customer to analyze customer interaction data.

While the customer embarked on a journey to build an omni channel platform and
an integrated contact center for customer service over various channels (email, text,
chat bot, contact center voice calls), their immediate tactical choice was to analyze
data from various sources for email, text/SMS, chat and call logs.

An Azure Data Lake-based solution was developed to answer the immediate need of
analyzing data from different systems, in different formats. Data from various source
systems were moved to Azure Data Lake Store and were then analyzed using Azure
Data Lake Analysis and U-SQL.
1. Ingest – In the ingest phase, unstructured and structured data from two different sources
(Email/Text/Chat data as well as Call Log) are moved to Azure using Azure Data Factory ETL
service.
2. Store – Raw data is stored on Azure Data Lake Storage/ADLS as flat files.
3. Analyze – Various types of analysis including filtering, joins, aggregation, windowing etc. are
performed in U-SQL.
4. Model and Serve – Analyzed data is stored in structured tables for later consumption from
Power BI/custom reports by user.
Figure 10: Azure Data Analytics Pipeline

Conclusions

Azure Data Lake Storage and Analytics have emerged as a strong option for
performing big data and analytics workloads in parallel with Azure HDInsight and
Azure Databricks. Though it’s still in its early days and lacks streaming and event
processing capabilities, its power lies in the new U-SQL language which combines the
simplicity and ubiquity of SQL with Mirosoft’s flagship, the powerful C# language.
Also, Microsoft’s development tools like Visual Studio and local dev/test capability
make it a powerful competitor in big data & analytics space.

You might also like