Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 108

Azure Data Platform End-to-End

Implement a Modern Data Platform Architecture

<your name>
<your role>
<your email>
Begin with the end in mind

© Microsoft Corporation
Course Objectives
• We will understand Cloud and Big Data concepts and technologies used to solve the most common
advanced analytics problems
• We will understand the role of Microsoft Azure data services in a modern data platform architecture
• We will look at individual Azure Data Services and use them to implement a modern data platform
reference architecture
• We will have a ARM template of a data platform that will enable us to solve most of our data challenges

© Microsoft Corporation
Important Reminder
• The modern data platform architecture proposed in this course aims to help with your technology
decisions when architecting data solutions in Azure.
• The Azure services covered in this course are only a subset of a much larger family of data services.
Some real-world data scenarios may require the use of services not included in this course.
• This course does not replace in-depth training on each Azure service covered today.
• Some concepts presented in this course can be quite complex and you may need to seek for more
information from different sources.

© Microsoft Corporation
Modern Data Platform Reference Architecture
Load and Ingest Process
Stream Datasets
λ Lambda Architecture and Real-time
Stream Dashboards
Hot Path
V=Velocity Business User
Real Time
IoT devices, sensors, gadgets Analytics
(loosely-typed) Event Hubs Stream Analytics

Cold Path
History and
Trend Analysis

Non-structured Cognitive Services Analytics


V=Variety Power BI Premium
Azure ML
images, video, audio, free text
(no structure)

Build and Score


ML models

Semi-Structured Data Factory Azure Data Lake Gen2 Databricks CosmosDB Application
V=Volume
csv, logs, json, xml Scheduled / event-
triggered data ingestion
(loosely-typed) Integrate big data
Fast load
data with scenarios with
Polybase/ traditional data
ParquetDirect warehouse

Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics

Store Serve
© Microsoft Corporation
Lab Guide Azure Data Platform End2End Lab 1: Load Data into Azure Synapse Analytics using Azure Data Factory Pipelines
Lab 2: Transform Big Data using Azure Data Factory Mapping Data Flows

Lab Architecture Lab 3: Explore Big Data with Azure Databricks


Lab 4: Add AI to your Big Data pipeline with Cognitive Services
Lab 5: Ingest and Analyse Real-Time Data with Event Hubs and Stream Analytics

Azure Data Platform


Resource Group

5
ADPLogicApp 1 ADPEventHubs-suffix 3 SynapseStreamAnalytics-suffix
Logic App Event Hubs Event Hubs

4
2 5 6
ADPCosmosDB-suffix
Azure CosmosDB

1 2
3
PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix ADPDatabricks
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2 Azure Databricks
Workspace
1 2 1

2
ADPComputerVision
4 Computer Vision API
3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1

RDP Connection 3 2
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network

© Microsoft Corporation
The modern data world out there

© Microsoft Corporation
I tried to understand it, but…
No-SQL Databricks
Storm Data Catalog
IoT PaaS vs IaaS
Hadoop Power BI Streaming
Deep Learning Machine Learning
Predictive Data Mart SMP vs MPP
ETL vs ELT
Data Visualisation Prescriptive
Data Warehouse
Data Lake Master Data
Big Data Data Factory Cloud vs On-prem
Data Quality Velocity, Variety and Volume
© Microsoft Corporation
Semantic Layer Spark AI
The modern data estate

LOB CRM Graph Image Social IoT

Hybrid

Operational databases Operational databases

Data warehouses Data warehouses

Data lakes Data lakes

Reason over any data, anywhere Flexibility of choice Security and performance

© Microsoft Corporation
The Microsoft offering

LOB CRM Graph Image Social IoT

Hybrid
Easiest lift and shift
with no code changes

SQL Server Azure Data Services

Industry leader 4 years in a row Operational databases Operational databases 70% faster   

#1 TPC-H performance Data warehouses Data warehouses 2x the global reach

T-SQL query over any data Data lakes Data lakes 99.9% SLA

AI built-in | Most secure | Lowest TCO

Reason over any data, anywhere Flexibility of choice Security and performance

© Microsoft Corporation
The Azure data landscape

SQL

Azure Data Azure Import/Export


Factory service Azure SQL DB Azure Cosmos DB Azure Synapse Analytics Azure Analysis Services Power BI

>_
10
Azure CLI Azure SDK 01

Azure Blob Storage Azure Data Lake Azure Azure Azure ML ML Server Azure
Store HDInsight Databricks Databricks

Azure IoT Hub Azure event hubs

Azure Stream Azure Azure Bot service Cognitive services


Azure Search Azure Data Catalog Analytics HDInsight Databricks
Kafka on Azure HDInsight

NSG

Azure ExpressRoute Azure Active Directory Azure network Azure key Operations Azure Functions Visual Studio
security groups management service Management Suite

© Microsoft Corporation
The Azure Big Data landscape

SQL

Azure Data Azure Import/Export


Factory service Azure SQL DB Azure Cosmos DB Azure Synapse Analytics Azure Analysis Services Power BI

>_
10
Azure CLI Azure SDK 01

Azure Blob Storage Azure Data Lake Azure Azure Azure ML ML Server Azure
Store HDInsight Databricks Databricks

Azure IoT Hub Azure event hubs

Azure Stream Azure Azure Bot service Cognitive services


Azure Search Azure Data Catalog Analytics HDInsight Databricks
Kafka on Azure HDInsight

NSG

Azure ExpressRoute Azure Active Directory Azure network Azure key Operations Azure Functions Visual Studio
security groups management service Management Suite

© Microsoft Corporation
Our job, pretty much

© Microsoft Corporation
Azure Data Architecture Guide
Valuable collection of architecture principles to help you with your technology choices
https://aka.ms/adag

© Microsoft Corporation
Azure Architecture Solutions
Collection of reference architectures for most common challenges
https://azure.microsoft.com/en-us/solutions/architecture/

© Microsoft Corporation
Modern Data Platform Solution Scenarios
Big Data and advanced analytics

SQL

Modern data warehousing Advanced analytics Real-time analytics

“We want to integrate all our data “We’re trying to predict when “We’re trying to get insights
—including Big Data—with our our customers churn” from our devices in real-time”
data warehouse”

© Microsoft Corporation
Modern Data Platform Concepts
Part I

© Microsoft Corporation
IaaS vs PaaS vs SaaS
On Premises Infrastructure Platform Software
Physical / Virtual as a Service (IaaS) as a Service (PaaS) as a Service (SaaS)

You manage
Applications Applications Applications Applications

resilient & manage


You scale, make resilient and manage

You scale, make


Data Data Data Data

management by Microsoft
Runtime Runtime Runtime Runtime

Scale, Resilience and


Middleware Middleware Middleware Middleware

management by Microsoft
Scale, Resilience and
O/S O/S O/S O/S

Virtualization Virtualization Virtualization Virtualization

Managed by
Microsoft
Servers Servers Servers Servers

Storage Storage Storage Storage

Networking Networking Networking Networking

Azure Azure
Virtual Machines Cloud Services
© Microsoft Corporation
What is a Data Warehouse?
A data warehouse is a large collection of business data used to help an organization make decisions. Data in the
Data Warehouse has been identified as valuable to specifically defined business cases and is stored in a structured
way readily available for reporting and data analysis.
It is not an Operational Database
Different workload types: transactional (DB) versus analytics (DW)
It is not a Data Lake
These are different concepts, they can co-exist and they compliment each other
It is not a Data Mart
A data mart is a subject-oriented database populated from a subset of the Data Warehouse

© Microsoft Corporation
Modern Data Warehousing

© Microsoft Corporation
Modern data warehousing
The modern data warehouse extends the scope of the data warehouse to
serve Big Data that’s prepared with techniques beyond relational ETL

SQL

Modern data warehousing Advanced analytics Real-time analytics

“We want to integrate all our data “We’re trying to predict when “We’re trying to get insights
—including Big Data—with our our customers churn” from our devices in real-time”
data warehouse”

© Microsoft Corporation
Modern data warehousing pattern

LOB

CRM
BI + Reporting
INGEST STORE PREP MODEL & SERVE
(& store)
Graph

Advanced Analytics
Image

Social Data orchestration Big data store Transform & Clean Data warehouse
and monitoring
AI
IoT

© Microsoft Corporation
SQL Server and Azure SQL Database

© Microsoft Corporation
Data platform continuum
Shared lower cost

PaaS & SaaS


Azure SQL Database
Virtualized Database

IaaS
SQL Server in Azure VM
Virtualized Machines

Virtual
SQL Server Private Cloud
Virtualized Machine + Appliance
Dedicated higher cost

Physical
SQL Server
Physical Machine (raw iron)

Higher administration Lower administration


SQL Server 2019 big data, analytics, and AI
Data virtualization Managed SQL Server, Spark, Complete AI platform
and data lake

Admin portal and management services


Analytics Apps
T-SQL Integrated AD-based security
REST API containers
for models

SQL Server External Tables SQL


Server Spark
SQL Server Spark &
ML Services Spark ML
Compute pools and data pools

Scalable, shared storage (HDFS)


Open NoSQL Relational HDFS External data HDFS
database databases sources
connectivity

Combine data from many sources without Store high volume data in a data lake and access Easily feed integrated data from many sources to
moving or replicating it it easily using either SQL or Spark your model training
Scale out compute and caching to boost Management services, admin portal, and Ingest and prep data and then train, store, and
performance integrated security make it all easy to manage operationalize your models all in one system
Azure SQL Database deployment option

Azure SQL Database

Single Elastic Pool Managed Instance


Database-scoped deployment option with Shared resource model optimized for greater Instance-scoped deployment option with high
predictable workload performance efficiency of multi-tenant applications compatibility with SQL Server and full PaaS benefits

Best for apps that require resource Best for SaaS apps with multiple databases that can share Best for modernization at scale with
guarantee at database level resources at database level, achieving better cost efficiency low friction and effort

General Purpose
Service Tiers

Business Critical
Hyperscale
Serverless
Azure SQL Managed Instance
Enable full isolation from other tenants without VNET support in SQL Database Managed Instance
resource sharing
Promote secure communication over private IP
addresses with native VNET integration
Enable your on-premise identities on cloud
instances, through integration with Azure Active
Directory and AD Connect
Combine the best of SQL Server with the
benefits of a fully-managed service
Use familiar SQL Server features in SQL
Database Managed Instance
Azure Data Factory

© Microsoft Corporation
Azure Data Factory
Hybrid data integration service for enabling code-free ETL

Industry leading Visual Hybrid Pay only for what Managed SSIS
data ingestion No Code you use

Productive & trusted hybrid data integration service


that simplifies ETL with any data, from any source, at scale.
Command and Control
Data

UX & SDK
Data Factory
Authoring | Monitoring/Mgmt A data integration account.
Location of orchestration, service metadata
Azure Data Factory v2 Service
Scheduling | Orchestration | Monitoring

Pipeline SSIS
Package
Integration Runtime (IR)
ADF’s execution engine
- Azure Integration Runtime
- Self-Hosted Integration Runtime
- SSIS Integration Runtime

Three core capabilities:


Azure
Self Hosted Integration • data movement
Integration Runtime
Runtime • pipeline activity execution
Azure Svcs
• SSIS package execution
On Prem Apps &
Data
Cloud & SaaS
Orchestration @ Scale
Trigger Pipeline foreach (…)

On demand Activity Activity


Schedule
Data Window Activity Activity
Event
Activity

Self-hosted Azure
Integration Runtime Integration Runtime

LEGEND
Linked Command and Control
On-prem Azure Services
Service Apps & Data Data
Azure Data Factory Data Flows
No-code data transformation and preparation @ scale

Mapping Dataflow Wrangling Dataflow

Code free data transformation @scale Code free data preparation @scale

© Microsoft Corporation
Azure Synapse Analytics

© Microsoft Corporation
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence
Designed for analytics workloads
Artificial Intelligence / Machine Learning / Internet of Things
at any scale
Intelligent Apps / Business Intelligence
Azure Synapse Analytics
SaaS developer experiences for
code free and code first
Experience Azure Synapse Analytics Studio
Multiple languages suited to
Platform Languages different analytics workloads
MANAGE MENT
SQL Python .NET Java Scala R
Integrated analytics runtimes
Form Factors
available provisioned and
SECURITY
P ROVISIONE D ON- DEM AND serverless on-demand
SQL Analytics offering T-SQL for
Analytics Runtimes
MONITORING
batch, streaming and interactive
processing
SQL Spark for big data processing with
ME TASTORE Python, Scala, R and .NET
DATA INTE GRATION
Integrated platform services
for, management, security,
monitoring, and metastore
Azure Common Data Model
Data Lake Storage Enterprise Security Data lake integrated and
Optimized for Analytics Common Data Model aware
Azure Synapse Analytics MPP Architecture

Cores Memory Cores Memory Cores Memory

NVMe SSD NVMe SSD NVMe SSD


Compute

Cache TempDB Cache TempDB Cache TempDB


Remote storage

Data
Snapshot backups
Log

© Microsoft Corporation
Table Distributions
CREATE TABLE [dbo].[FactInternetSales]
Round-robin distributed (
Distributes table rows evenly across all [ProductKey] int NOT NULL,
distributions at random.
[OrderDateKey] int NOT NULL,
Hash distributed [CustomerKey] int NOT NULL,
[PromotionKey] int NOT NULL,
Distributes table rows across the Compute [SalesOrderNumber] nvarchar(20) NOT NULL,
nodes by using a deterministic hash [OrderQuantity] smallint NOT NULL,
function to assign each row to one [UnitPrice] money NOT NULL,
distribution. [SalesAmount] money NOT NULL
)
Replicated WITH
(
Full copy of table accessible on each CLUSTERED COLUMNSTORE INDEX,
Compute node. DISTRIBUTION = HASH([ProductKey]) |
ROUND ROBIN |
REPLICATED
);
© Microsoft Corporation
Polybase
Data ingestion using external data sources -- Create Azure DataLake Gen2 Storage reference
CREATE EXTERNAL DATA SOURCE AzureStorage with
(
TYPE = HADOOP,
LOCATION='abfss://<container>@<storageaccnt>.blob.core.windows.net' ,

CREDENTIAL = AzureStorageCredential –- not required if using


Overview managed identity
);
Polybase supports querying files stored in -- Type of format in Hadoop (CSV, RCFILE , ORC, PARQUET).
a Hadoop File System (HDFS), Azure Blob CREATE EXTERNAL FILE FORMAT TextFileFormat WITH
storage, or Azure Data Lake Store. (
FORMAT_TYPE = DELIMITEDTEXT,
To query files, users create three objects: FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT =
TRUE)
External data source, external file format, )
external table. -- LOCATION: path to file or directory that contains data
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data]
(
[SensorKey] int NOT NULL,
[Speed] float NOT NULL,
[YearMeasured] int NOT NULL
)
WITH (LOCATION='/Demo/’, DATA_SOURCE = AzureStorage,
FILE_FORMAT = TextFileFormat
© Microsoft Corporation );
Lab 1: Load data into Azure Synapse
Analytics using Azure Data Factory
Pipelines

© Microsoft Corporation
Lab 1
Load data into Azure Synapse Analytics using Azure Data Factory Pipelines
Load and Ingest Process

Business User

Data Factory Azure Data Lake Gen2


Scheduled / event-
triggered data ingestion
Fast load
data with
Polybase/
ParquetDirect

Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics

© Microsoft Corporation
Store Serve
Lab 1
Azure Data Platform
Lab Architecture Resource Group

PowerBI
MDWResources SynapseDataLakesuffix
Power BI Desktop/
Storage Account Azure Data Lake Storage Gen2
Workspace

operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1

RDP Connection 3
or
Azure Bastion
ADPDesktop
Virtual Machine

synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network

© Microsoft Corporation
Modern Data Platform Concepts
Part II

© Microsoft Corporation
TY
CI
The Modern Data Problem

LO
VE
Re
al
-ti
m
e

VOLUM
E Ba
tc
h

ZB
How to derive value from data:

GB
What happened historically?
ed
What is happening now? Str u ctur
d at a
What is going to happen?

Each dimension of data is red


st r uctu
Un data
constantly expanding

VAR
IET
Y
What is a Data Lake?
It is a central storage repository that holds data coming from many sources in a raw, granular format. It can store
structured, semi-structured, or unstructured data, which means data ingested quickly and can be kept in a
more flexible format for future use cases.

• Schema-on-read • Quickly ingest high • Data Governance


Characteristics

Best Practices
Benefits
(ELT) volumes of diverse needed to avoid
• Collection of data, data structures Data Swamp
not a platform • Enable advanced • Security
• Perfect place for analytics and data considerations
evolving data exploration • Design your Data
• Scalability and Lake
storage cost • Metadata
reduction management

© Microsoft Corporation
Data Warehouse or Data Lake?
Answer: both.

Data Warehouse Data Lake


Requirements Relational requirements Diverse data, scalability, low cost
Data Value Data of recognised high value Candidate data of potential value
Data Processing Mostly refined calculated data Mostly detailed source data
Business Entities Known entities, tracked over time Raw material for discovering entities and facts
Data Standards Data conforms to enterprise standards Fidelity to original format and condition
Data Integration Data integration upfront Data prep on demand
Transformation Data transformed, in principle Data repurposed later, as needs arise
Schema Definition Schema-on-write Schema-on-read
Metadata Management Metadata improvement Metadata developed on read

© Microsoft Corporation
Data Lake Design Considerations
Data Lake Zones Data Governance Considerations
Transient Landing Zone Security and Compliance
Temporary storage of data to meet regulatory and quality control Access Control at Folder/File level
requirements. Limited access. May not be required depending on Encryption at rest
requirements.
Metadata Management
Raw Zone
Data Quality
Original source of data ready for consumption. Metadata publicly
available but access to data still limited. Metadata Management
Trusted Zone Lifecycle Management
Standardized and enriched datasets ready for consumption to those
with appropriate role-based access. Metadata available to all.
Curated/Refined Zone
Data transformed from Trusted Zone to meet specific business
requirements.
Sandbox Zone
Playground for Data Scientists for ad hoc exploratory use cases.

© Microsoft Corporation
Data Lake Design Example
Data Lake Zone
Business Area
Source System
Entity Name
Year
Month
Data Files (standard naming convention)

Source: https://www.sqlchick.com/entries/2019/1/20/faqs-about-organizing-a-data-lake
© Microsoft Corporation
Azure Data Lake Storage Gen2

© Microsoft Corporation
Azure Data Lake Storage Gen2
A “no-compromises” Data Lake: Secure, performant and massively-scalable
A Data Lake that brings together the cost and scale of object storage with the performance and analytics feature set of data lake storage

Fast Manageable Secure Scalable Cost effective Integration ready


Atomic file operations Automated Lifecycle Policy Support for fine-grained No limits on data Object store pricing levels Optimized for Spark and Hadoop
mean jobs complete Management ACLs, protecting data at the store size Analytic Engines
faster file and folder level File system operations
Object Level tiering Global footprint minimize transactions Tightly integrated with Azure end
Multi-layered protection via (50 regions) required for job completion to end analytics solutions
at-rest Storage Service
encryption & Azure Active Multiprotocol access
Directory integration

Single service
Azure Data Lake Storage Gen2
High performance HDFS Endpoint to Azure Blob Storage

ADLS Gen2 API Blob API

Hierarchical file system


File Data Unstructured Object Data

Hadoop Filesystem, File and Folder Server Backups, Archive Storage, Semi-
Hierarchy, Granular ACLs structured Data

Security Performance Scale and cost effectiveness

Common Blob storage foundation

Common SDK, Tools, Control Plane Object Tiering and Lifecycle AAD integration, RBAC, Storage HA/DR support through ZRS and RA-
Policy Management account security GRS

© Microsoft Corporation
Lab 2: Transform Big Data using
Azure Data Factory Mapping Data
Flows

© Microsoft Corporation
Lab 2
Transform Big Data using Azure Data Factory Mapping Data Flows
Load and Ingest Process

Business User

Semi-Structured Data Factory Azure Data Lake Gen2


V=Volume
csv, logs, json, xml Scheduled / event-
triggered data ingestion
(loosely-typed)
Fast load
data with
Polybase/
ParquetDirect

Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics

© Microsoft Corporation
Store Serve
Lab 2
Lab Architecture Azure Data Platform
Resource Group

PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2
Workspace
1 2

3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1

RDP Connection 3
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network

© Microsoft Corporation
Advanced Analytics

© Microsoft Corporation
Advanced analytics
Advanced analytics goes beyond the traditional business intelligence (BI) and uses mathematical, probabilistic,
and statistical modeling techniques to enable predictive processing and automated decision making.

SQL

Modern data warehousing Advanced analytics Real-time analytics

“We want to integrate all our data “We’re trying to predict when “We’re trying to get insights
—including Big Data—with our our customers churn” from our devices in real-time”
data warehouse”

© Microsoft Corporation
Modern Data Platform Concepts
Part III

© Microsoft Corporation
Hadoop and Spark in Azure
Open Source Apache Projects for Big Data Compute

It was the original open-source framework for distributed Effective, fast, general-purpose unified cluster computing framework with
processing and analysis of big data sets on clusters. high-level APIs in Java, Scala, Python and R.

Read/write from disk. In-memory processing.

Economical batch mode. Fast, interactive data processing.


Linear processing of huge datasets. Streaming and Machine Learning Support

Azure HDInsight is a managed, full-spectrum, open-source analytics service for enterprises 

© Microsoft Corporation
Apache Spark Value Proposition

• Run programs up to • Write applications • Includes higher-level • Apache Project

Open
Unified and integrated analytics model
Performance

Productivity
100x faster than Hadoop quickly in Scala or libraries in integrated • Contributors: 50
MapReduce in memory Python, Java framework companies and over
• 10x faster on disk • Over 80 high-level • Combine SQL, 400 developers,
• Useful for real-time and operators that make it streaming, graph growing
batch processing easy to build parallel processing
apps • Runs on Hadoop (Yarn,
• Use interactively from Mesos), standalone, or
the Scala and Python in the cloud
shells • Diverse data sources
including HDFS (CSV,
JSON, etc.), Azure
Storage, Cosmos DB,
Azure SQL Data
Warehouse, Parquet,
JDBC

© Microsoft Corporation
Azure Databricks

© Microsoft Corporation
Azure Databricks
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure

Best of Databricks Best of Microsoft

Designed in collaboration with the founders of Apache Spark

One-click set up; streamlined workflows

Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Native integration with Azure services (Power BI, SQL DW, Cosmos DB, ADLS, Azure Storage, Azure Data Factory,
Azure AD, Event Hub, IoT Hub, HDInsight Kafka, SQL DB)

Enterprise-grade Azure security (Active Directory integration, compliance, enterprise-grade SLAs)

© Microsoft Corporation
Azure Databricks
Azure Databricks
Collaborative Workspace

IoT / streaming data Machine learning models

DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST

Deploy Production Jobs & Workflows


BI tools
Cloud storage

MULTI-STAGE PIPELINES JOB SCHEDULER NOTIFICATION & LOGS

Data warehouses
Optimized Databricks Runtime Engine Data exports

Hadoop storage
DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs
Data warehouses

Enhance Productivity Build on secure & trusted cloud Scale without limits
© Microsoft Corporation
Azure Databricks Deployment

Managed Attached Azure Workspace Workspace


Cluster Node(s) BLOB (DBFS) VNET NSG rules
Resource Group
Azure Portal

Integrate with other Azure Services

Azure BLOBs Data Lake


Notebooks
Azure
Databricks
Workspace Run on Event Hub IOT Hub Kafka
Clusters

Cosmos DB SQL DW
Jobs

Azure Resource Interact using UI or Azure Databricks REST API Data Factory
Manager APIs

© Microsoft Corporation
Azure Databricks Clusters

Driver Program
SparkContext

 Azure Databricks clusters are the set of Azure Linux VMs that
host the Spark Worker and Driver Nodes
Cluster Manager
 Your Spark application code (i.e. Jobs) runs on the provisioned
clusters.
Worker Node Worker Node Worker Node
 Azure Databricks clusters are launched in your subscription—
but are managed through the Azure Databricks portal. Cache Cache
Cache
 Azure Databricks provides a comprehensive set of graphical
Task Task Task
wizards to manage the complete lifecycle of clusters—from
creation to termination.

Data Sources (HDFS, SQL, NoSQL, …)


© Microsoft Corporation
Azure Databricks Notebooks
Notebooks are a popular way to develop, and run, Spark Applications
Notebooks are not only for authoring Spark applications but can be run/executed
directly on clusters
• Shift+Enter
• click the at the top right of the cell in a notebook
• Submit via Job
Fine grained permissions support so they can be securely shared with colleagues for
collaboration
Notebooks are well-suited for prototyping, rapid development, exploration, discovery
and iterative development

With Azure Databricks notebooks you have a default language but you can mix multiple languages in the same notebook:
%python Allows you to execute python code in a notebook (even if that notebook is not python)
%sql Allows you to execute sql code in a notebook (even if that notebook is not sql).
%r Allows you to execute r code in a notebook (even if that notebook is not r).
%scala Allows you to execute scala code in a notebook (even if that notebook is not scala).
%sh Allows you to execute shell code in your notebook.
%fs Allows you to use Databricks Utilities - dbutils filesystem commands.
%md To include rendered markdown

© Microsoft Corporation
Lab 3: Explore Big Data with Azure
Databricks

© Microsoft Corporation
Lab 3
Explore Big Data with Azure Databricks
Load and Ingest Process

Business User

Semi-Structured Data Factory Azure Data Lake Gen2 Databricks


V=Volume
Scheduled / event-
csv, logs, json, xml
triggered data ingestion
(loosely-typed) Integrate big data
Fast load
data with scenarios with
Polybase/ traditional data
ParquetDirect warehouse

Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics

Store Serve
© Microsoft Corporation
Lab 3
Lab Architecture
Azure Data Platform
Resource Group

PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix ADPDatabricks
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2 Azure Databricks
Workspace
1 2 1

2
ADPComputerVision
Computer Vision API
3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1

RDP Connection 3 2
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network

© Microsoft Corporation
Modern Data Platform Concepts
Part IV

© Microsoft Corporation
Artificial Intelligence
“The ability of a digital computer or computer-controlled robot to perform tasks commonly associated with
intelligent beings.” – Encyclopedia Britannica
Machine Learning
Supervised Learning

Regression

Classification

Unsupervised Learning

Cluster Analysis

Application Examples

Weather Forecast

Fraud Detection

Customer Churn

Insurance Premium

© Microsoft Corporation
What’s No-SQL?
Term coined in 2009 for a developer meetup – ”Not Only SQL” -> “NoSQL”.
Databases that allow you to store and retrieve data in various structures, formats, and models other than
tabular relational model.

There’s a time and a place for everything Data Structures


Sometimes a relational store is the right choice Key-Value Databases

Cosmos DB, Redis Cache, Azure Table


Sometimes a NoSQL store is the right choice
Column Family Stores
Sometimes you need more than one store for an app ->
polyglot persistence Cosmos DB, Cassandra, HBase

Graph Databases

Cosmos DB, Neo4j, Gremlin

Document Databases

Cosmos DB, MongoDB

© Microsoft Corporation
Azure AI

© Microsoft Corporation
Azure AI
Solution Areas

AI apps and agents Knowledge mining Machine learning

Azure Cognitive Services Azure Search Azure Databricks


Azure Bot Service Azure Machine Learning
Azure AI Infrastructure

Productive Built for enterprises Trusted


Machine Learning on Azure

Domain specific pretrained models …


To simplify solution development Vision Speech Language Search

Familiar Data Science tools


To simplify model development Azure Notebooks Jupyter Command line
Visual Studio Code

Popular frameworks
To build advanced deep learning solutions
PyTorch TensorFlow Scikit-Learn ONNX

Productive services
To empower data science and development teams Azure Azure Machine Machine
Databricks Learning Learning VMs

Powerful infrastructure
To accelerate deep learning
CPU GPU FPGA

From the Intelligent Cloud to the Intelligent Edge


Custom Speech
Speech API
Bing Spell
Check Translator
Text Analytics
Speech
Speech

Unified Speech Text to


speech

Bing Speech Cognitive


Translator Text Language Speaker Recognition Services Lab
Language
Understanding (LUIS)

Cognitive Services APIs


Computer
Knowledge Vision

Custom
Vision
QnA Maker
Custom Decision Bing Visual Search
Video Indexer

Bing Entity Search Vision


Bing Statistics
add-in
Bing Custom Search Emotion

Bing Search
Bing Autosuggest
Search Content
Moderator
Face
Cognitive Services capabilities
Infuse your apps, websites, and bots with human-like intelligence

Vision Speech Language Knowledge Search


Object, scene, and Speech transcription Language detection Q&A extraction from Ad-free web, news, image,
activity detection (speech-to-text) unstructured text and video search results
Named entity recognition
Face recognition Custom speech models for Knowledge base creation Trends for video, news
and identification unique vocabularies or Key phrase extraction from collections of Q&As
complex environment Image identification,
Celebrity and landmark Text sentiment analysis Semantic matching for classification and
recognition Text-to-speech Multilingual and contextual knowledge bases knowledge extraction
Emotion recognition Custom Voice spell checking Customizable content Identification of similar
Explicit or offensive text personalization learning images and products
Text and handwriting Real-time speech translation
recognition (OCR) content moderation Named entity recognition
Customizable speech and classification
Customizable image transcription and translation PII detection for text
recognition moderation Knowledge acquisition
Speaker identification for named entities
Video metadata, audio, and verification Text translation
and keyframe extraction Customizable text translation Search query autosuggest
and analysis
Contextual language Ad-free custom search
Explicit or offensive understanding engine creation
content moderation
Knowledge mining with Azure Search
Documents Cognitive skills Fully text-searchable
rich index

Key Phrase extraction Location entity extraction Sentiment analysis

Organization entity extraction Persons entity extraction Language detection

Face detection Celebrity recognition Tag extraction

Custom skills Landmark detection Printed text recognition


Cosmos DB

© Microsoft Corporation
Azure Cosmos DB

Core (SQL) API

Table API MongoDB

Column-family Document
Key-value Graph

Guaranteed low latency


at the 99th percentile
Elastic scale out Five well-defined
of storage & throughput consistency models

Turnkey global
Comprehensive
distribution
SLAs
RESOURCE MODEL

Account

Database
Database
Database

Database
Database
Container = Collection Graph Table

Database
Database
Item
C O N TA I N E R - L E V E L R E S O U R C E S

Account

Database
Database
Database

Database
Database
Container

Database
Database
Item Sproc Trigger UDF Conflict
Lab 4: Add AI to your Big Data
Pipeline with Cognitive Services

© Microsoft Corporation
Lab 4
Add AI to your Big Data Pipeline with Cognitive Services
Load and Ingest Process

Business User

Non-structured Cognitive Services


V=Variety Azure ML
images, video, audio, free text
(no structure)

Build and Score


ML models

Semi-Structured Data Factory Azure Data Lake Gen2 Databricks CosmosDB Application
V=Volume
Scheduled / event-
csv, logs, json, xml
triggered data ingestion
(loosely-typed) Integrate big data
Fast load
data with scenarios with
Polybase/ traditional data
ParquetDirect warehouse

Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics

© Microsoft Corporation
Store Serve
Lab 4
Lab Architecture Azure Data Platform
Resource Group

5 6
ADPCosmosDB-suffix
Azure CosmosDB

1 2
3
PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix ADPDatabricks
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2 Azure Databricks
Workspace
1 2 1

2
ADPComputerVision
4 Computer Vision API
3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1

RDP Connection 3 2
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network

© Microsoft Corporation
Real-time Analytics

© Microsoft Corporation
Real-time analytics
Deals with streams of data that are captured in real-time and processed with minimal latency to generate real-
time (or near-real-time) reports or automated responses.

SQL

Modern data warehousing Advanced analytics Real-time analytics

“We want to integrate all our data “We’re trying to predict when “We’re trying to get insights
—including Big Data—with our our customers churn” from our devices in real-time”
data warehouse”

© Microsoft Corporation
Modern Data Platform Concepts
Part V

© Microsoft Corporation
Streaming Use Cases
Retail Financial Oil/Gas & Energy Security
CONSUMER ENGAGEMENT RISK AND REVENUE MANAGEMENT GRID OPS, ASSET OPTIMIZATION ACTIONABLE THREAT INTELLIGENCE

Real-time Pricing Risk and Fraud, Threat


Optimization Industrial IoT Security Intelligence
Detection
• Real-time firewall, network, and auth
• Demand-Elasticity • Real-time anomaly detection • Preventive Maintenance
• Smart Grids and Microgrids log correlation
• Personal Pricing Schemes • Card Monitoring and Fraud
• • Anomaly detection
• Promotion events Detection Asset performance as a Service
• • Security context, enrichment
• Multi-channel engagement • Risk Aggregation UAV image analysis
• Security Orchestration

Healthcare Advertising Media Entertainment


SENSOR DATA RECOMMENDATION ENGINE CONSUMER ENGAGEMENT ANALYSIS

And Much More!

IoT DEVICE Next Best and


Sentiment Analysis
ANALYTICS Personalized Offers
• Aggregation of streaming events • Right product, promotion, at right • Demand-Elasticity
• Predictive Maintenance time • Social Network Analysis
• Anomaly Detection • Real time Ad bidding platform • Promotion events
• Personalized Ad Targeting • Multi-channel Attribution
© Microsoft Corporation
Unlocking Real-time Insights

Time to Insight is Critical


Reducing decision latency can unlock business value

Insights are Perishable


Window of opportunity for insights to be actionable

Ask Questions to Data in Motion


Can’t wait for data to get to rest before running computation

© Microsoft Corporation
Real-time Stream Processing
Simple Event Processing
Filter
Transform
Enrich
Split
Route

Event Stream Processing


[Simple event processing] +
Aggregate
Rules

Complex Event Processing


[Event Stream Processing] +
Pattern detection
Time windows
Joins & correlations

© Microsoft Corporation
Scenario Types
Automation
Dashboarding
Actions by Human Actors
“See and seize” insights
Live visualization
Alerts and alarms
Dynamic aggregation

Enriched Data Movement

Machine to Machine Interactions


Data movement with enrichment
Kick-off workflows for automation

© Microsoft Corporation
Lambda (λ) Architecture
Designed to handle Big Data use cases by taking advantage of both batch and stream-processing methods

1. All data entering the system is dispatched to both


the batch layer and the speed layer for processing.

2. The batch layer has two functions:


I. manage the master dataset (an immutable,
append-only set of raw data)
II. pre-compute the batch views.

3. The serving layer indexes the batch views so that


they can be queried in low-latency, ad-hoc way.

4. The speed layer compensates for the high latency


of updates to the serving layer and deals with recent
data only.

5. Any incoming query can be answered by merging


results from batch views and real-time

© Microsoft Corporation
Event Hubs

© Microsoft Corporation
Event Hubs
Big data streaming platform and event ingestion service capable of receiving and processing millions of events
per second. 

© Microsoft Corporation
Event Hubs Capture
Batch on stream

Policy based push to your own storage


Uses Avro format
Raises Event Grid events – connect to Functions, ACI, or whatever you like
Does not impact throughput
Offloads batch processing from your real-time stream

© Microsoft Corporation
Stream Analytics

© Microsoft Corporation
Stream Analytics
Event-processing engine that allows you to examine high volumes of data streaming from devices

© Microsoft Corporation
Differentiators
1,915 lines of code with Apache Storm
@ApplicationAnnotation(name="WordCountDemo")
Programmer Productivity public class Application implements StreamingApplication
{  
Declarative SQL like language protected String fileName =
"com/datatorrent/demos/wordcount/samplefile.txt";  
Built-in temporal semantics private Locality locality = null;   

@Override  public void populateDAG(DAG dag, Configuration


conf)  
{   

Ease of Getting Started locality = Locality.CONTAINER_LOCAL;     


WordCountInputOperator input =
dag.addOperator("wordinput", new
Integrations with sources, sinks, & ML WordCountInputOperator());    
input.setFileName(fileName);    
Build real-time dashboards in minutes UniqueCounter<String> wordCount =
dag.addOperator("count", new
UniqueCounter<String>());     
dag.addStream("wordinput-count", input.outputPort,

Lowest Total Cost of Ownership(TCO)


wordCount.data).setLocality(locality);     
ConsoleOutputOperator consoleOperator =
dag.addOperator("console", new
Fully managed service ConsoleOutputOperator());    
dag.addStream("count-console",wordCount.count,
No cluster topology management required consoleOperator.input);  
 }  
Seamless scalability }
Usage based pricing 3 lines of SQL in Azure Stream Analytics

SELECT Avg(Purchase), ScoreTollId, Count(*)


FROM GameDataStream
GROUP BY TumblingWindows(5, Minute), Score
© Microsoft Corporation
Stream Analytics Job
Users construct and deploy jobs to Azure Stream Analytics

Job definition includes inputs, a query, and output


Inputs are from where the job reads the data stream
Query runs for perpetuity unless explicitly stopped and transforms the input stream
Output is where the job sends the job results to

© Microsoft Corporation
Example Scenario: Tolling Stations
Tolling stations have multiple booths (identified by TollId)
Each booth has two sensors: Entry and Exit that send out EntryStream and ExitStream respectively

EntryStream – Data stream from the Entry sensor data on vehicles entering toll booths

ExitStream - Data stream from the Exit sensor on vehicles exiting toll booths

ReferenceData - Commercial vehicle registration data

© Microsoft Corporation
Filters and Projections
SELECT VehicleCategory =
Case Type
From the incoming stream find only vehicles that: WHEN 1 THEN ‘Passenger’
• Are from either WA and CA state WHEN 2 THEN ‘Commercial’
ELSE THEN ‘Other’
• Have a weight less than 3000 lbs END,
• Have License plate number end in 999 TollId,
• Have a make that starts with a “M” LicensePlate,
State,
Make,
Display: Model,
• “Passenger” if type = 1 DATEPART(mi,EntryTime) AS ‘Mins’,
DATEPART(ss,EntryTime) AS ‘Seconds’
• “Commercial” if Type = 2
DATEPART(ms,EntryTime) AS ‘Milleseconds’
• “Other” for all other types FROM EntryStream TIMESTAMP BY EntryTime
WHERE (State = ‘CA’ OR State = ‘WA’)
AND Weight < 3000
• Display time as ‘Mins’, ‘Seconds’, ‘Milliseconds’
AND CHARINDEX (‘M’, model) = 0
AND PATINDEX(‘%999’, LicensePlate) = 5

© Microsoft Corporation
Temporal Joins
Report the time in seconds required for vehicles to pass the toll booth

SELECT ES.TollId,
ES.EntryTime,
EX.ExitTime,
ES.EntryTime,
ES.LicensePlate DATEDIFF(minute, ES.EntryTime, EX.ExitTime )
FROM EntryStream ES TIMESTAMP BY EntryTime
JOIN ExitStream EX TIMESTAMP BY ExitTime
ON (EX.TollId= ES.TollId and ES.LicensePlate = EX.LicensePlate)
AND DATEDIFF(minute, ES, EX) BETWEEN 0 AND 15

© Microsoft Corporation
Windowing Concepts

• Operations on the data contained in temporal windows is a common pattern


• Four types of Temporal Windows:
• Sliding
• Tumbling
• Hopping
• Session
• Output at the end of each window
• Windows are fixed length
• Used in a GROUP BY clause

© Microsoft Corporation
Windowing Functions
Sliding Windows and Tumbling Windows

Sliding Windows

Tumbling Windows

© Microsoft Corporation
Windowing Functions
Hopping Windows and Session Windows
Hopping Windows

Session Windows

© Microsoft Corporation
Lab 5: Ingest and Analyse real-time
data with Event Hubs and Stream
Analytics

© Microsoft Corporation
Lab 5
Ingest and Analyse real-time data with Event Hubs and Stream Analytics
Load and Ingest Process
Stream Datasets
λ Lambda Architecture and Real-time
Stream Dashboards
Hot Path
V=Velocity Real Time Business User
IoT devices, sensors, gadgets Analytics
(loosely-typed) Event Hubs Stream Analytics

Cold Path
History and
Trend Analysis

Non-structured Cognitive Services Analytics


V=Variety Power BI Premium
Azure ML
images, video, audio, free text
(no structure)

Build and Score


ML models

Semi-Structured Data Factory Azure Data Lake Gen2 Databricks CosmosDB Application
V=Volume
Scheduled / event-
csv, logs, json, xml
triggered data ingestion
(loosely-typed) Integrate big data
Fast load
data with scenarios with
Polybase/ traditional data
ParquetDirect warehouse

Enterprise-grade
semantic model
Relational Databases
(strongly-typed, structured) Azure Synapse Analytics Power BI Premium Analytics

Store Serve
© Microsoft Corporation
Lab 5
Lab Architecture Azure Data Platform
Resource Group

5
ADPLogicApp 1 ADPEventHubs-suffix 3 SynapseStreamAnalytics-suffix
Logic App Event Hubs Event Hubs

4
2 5 6
ADPCosmosDB-suffix
Azure CosmosDB

1 2
3
PowerBI
MDWResources SynapseDataFactory-suffix SynapseDataLakesuffix ADPDatabricks
Power BI Desktop/
Storage Account Azure Data Factory Azure Data Lake Storage Gen2 Azure Databricks
Workspace
1 2 1

2
ADPComputerVision
4 Computer Vision API
3
operationalsql-suffix\NYCDataSets SynapseDataFactory-suffix
Azure SQL Database Azure Data Factory
1

RDP Connection 3 2
or
Azure Bastion
ADPDesktop
Virtual Machine
4
synapsesql-suffix\SynapseDW
Student͛s Azure Synapse Analytics
Computer 4
ADPVirtualNetwork
Virtual Network

© Microsoft Corporation
It’s all on

© Microsoft Corporation
© Copyright Microsoft Corporation. All rights reserved.

You might also like