Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 44

Getting started with real-time

analytics with Kafka and


Spark in Microsoft Azure
Joe Plumb
Cloud Solution Architect – Microsoft UK
@joe_plumb
Alternative title: Everything I
know about real time analytics
in Microsoft Azure
Joe Plumb
Cloud Solution Architect – Microsoft UK
@joe_plumb
Agenda
• Fundamentals of streaming data
• What streaming data can be useful for
• What options are there to use data streams in Microsoft Azure?
• Demo
• Q&A
Streaming 101
What is streaming data?
• “Streaming data is data that is continuously generated by different
sources.” https://en.wikipedia.org/wiki/Streaming_data

• Streaming system - A type of data processing engine that is designed


with infinite datasets in mind.
https://learning.oreilly.com/library/view/streaming-systems/9781491
983867/ch01.html
Why bother?
• Batch processing can give great insights into things that happened in
the past, but it lacks the ability to answer the question of "what is
happening right now?”
• “Data is valuable only when there is an easy way to process and get
timely insights from data sources.”
Where is streaming What is it good for?
data?
• Website monitoring
• Network monitoring
• Fraud detection
Clickstream data Sensors
• Web clicks
• Advertising
• Environment monitoring
• Application usage tracking
• Recommendations
Smart machinery (e.g. GPS ….
production lines)
Streaming System architecture

Source: https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/real-time-processing
Event vs Message
• Could be argued its an issue of semantics, as they ‘look’ the same
(e.g. JSON object, CSV etc)
• Message is a catch-all term, as messages are just bundles of data
• Event message is a type of message

“When a subject has an event to announce, it will create an event


object, wrap it in a message, and send it on a channel.”
https://www.enterpriseintegrationpatterns.com/patterns/messaging/E
ventMessage.html
It’s all about time
Cardinality is important because the
unbounded nature of infinite datasets
imposes additional burdens on data
processing frameworks that consume
them..

We need ways to reason about time


It’s all about time: Event time vs Processing
time
• In an ideal world, the processing system
receives the event when it happens.
Event Time Processing time • In reality, the skew between an event
Time the event Time the system happening and the system processing
occurs becomes aware of
the event
that event can vary wildly

Time
It’s all about time: Event time vs Processing
time
• In an ideal world, the processing system
receives the event when it happens.
• In reality, the skew between an event
happening and the system processing
that event can vary wildly

• Processing time lag is the difference in


observed time vs processing time
• Event-time skew is how far behind the
processing pipeline is at that moment.
It’s all about time: Watermarking
• An event time marker that indicates all events up to “a point” have
been fed to the streaming processor. By the nature of streams, the
incoming event data never stops, so watermarks indicate the progress
to a certain point in the stream.
• Watermarks can either be a strict guarantee (perfect watermark) or
an educated guess (heuristic watermark)
It’s all about time: Windowing
Tumbling windows
Hopping windows
Sliding windows
Session Windows
It’s not just about time: Triggers
• They determine when the processing on the accumulated data is
started.

• Repeated update triggers


• These periodically generate updated panes for a window as its contents
evolve.
• Completeness triggers
• These materialize a pane for a window only after the input for that window is
believed to be complete to some threshold
Delivery Guarantees
• At-most-once
• means that for each message handed to the mechanism, that message is
delivered zero or one times; in more casual terms it means that messages
may be lost.
• At-least-once
• means that for each message handed to the mechanism potentially multiple
attempts are made at delivering it, such that at least one succeeds; again, in
more casual terms this means that messages may be duplicated but not lost.
• Exactly-once
• means that for each message handed to the mechanism exactly one delivery
is made to the recipient; the message can neither be lost nor duplicated.
Streaming + Batch?
• Lambda architecture

• Increasingly viewed as a
workaround, due to advances in
capabilities and reliability of
streaming data systems

By Textractor - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=34963985

https://en.wikipedia.org/wiki/Lambda_architecture#/media/
File:Diagram_of_Lambda_Architecture_(generic).png
Service Options
in Azure
Event Hubs
• Fully-managed PaaS service
• Big data streaming platform and event ingestion service.
• It can receive millions of events per second. Data sent to an event hub can be
transformed and stored by using any real-time analytics provider or
batching/storage adapters.
• Wide range of use cases
• Scalable
• Kafka for Event Hubs
• Data can be captured automatically in either Azure Blob Storage or Azure
Data Lake Store
Stream Analytics
• Event-processing engine that allows you to examine high volumes of
data streaming from devices.
• Supports extracting information from data streams, identifying patterns,
and relationships.
• Can then use these patterns to trigger other actions downstream, such
as create alerts, feed information to a reporting tool, or store it for later
use
Integration with Azure event hubs and IoT hub

• Azure Stream Analytics has built-in, first class


Streaming data
integration with Azure Event Hubs and IoT Hub
• Data from Azure Event Hubs and Azure IoT Hub
can be sources of streaming data to Azure Azure Event Hubs

Stream Analytics
• The connections can be established through the
Azure Portal without any coding Streaming data

• Azure Blob Storage is supported as a source of


reference data Azure IoT Hub

Azure Stream
• Azure Stream Analytics supports compression Analytics

across all data stream input sources—Event Reference data

Hubs, IoT Hub, and Blob Storage


Azure Blob Storage
Fully-managed Hadoop and Spark for the cloud. 99.9% SLA

100% Open Source Hortonworks data platform

Azure Clusters up and running in minutes

HDInsight Familiar BI tools, interactive open source notebooks

63% lower TCO than deploy your own Hadoop on-premises*


Cloud Spark and Hadoop Scale clusters on demand
service for the Enterprise
Secure Hadoop workloads via Active Directory and Ranger

Compliance for Open Source bits

Best in class monitoring and predictive operations via OMS

Native Integration with leading ISVs

*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Azure HD Insight
Apache storm on HDInsight
Apache Storm offered as a managed service on Azure HDInsight

Scalable. Can analyse One of seven HDInsight


millions of events per second cluster types

Dynamically scale-up and


Integrates with Event Hub
scale-down

SLA of 99.9 percent Develop with Visual Studio


uptime using Java or C#
Azure Databricks
• Apache Spark-based analytics platform optimized for Microsoft Azure.
Designed with the founders of Apache Spark, Databricks is integrated
with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace or analytics.
Spark structured streaming overview
A unified system for end-to-end fault-tolerant, exactly-once, stateful stream processing

The simplest way to perform streaming analytics is not having to think about streaming at all!

Develop
• Unifies streaming, interactive, and batch queries. Uses a single API
for both static bounded data and streaming unbounded data continuous applications
• Supports streaming aggregations, event-time windows, windowed That need to interact with batch data,
grouped aggregation, and stream-to-batch joins interactive analysis, machine learning…
• Features streaming deduplication, multiple output modes, and APIs
for managing and monitoring streaming queries
Pure streaming system Continuous application
• Also supports interactive and batch queries >_ Ad-hoc
queries
• Aggregate data in a stream, then serve using JDBC
Input Streaming Output Input Continuous Output
• Change queries at runtime stream computation sink
(transactions stream application sink
(transactions
often up to user) co often up to user)
ns
• Build and apply machine learning models ist
en
tw
ith

• Built-in sources: Kafka, file source (JSON, CSV, text, and Parquet) (interactions with other systems
Batch
job
left to the user Static data
• App development in Scala, Java, Python, and R
Demo – Event Hubs and
Stream Analytics
What we’re looking at

Event hubs Stream Analytics


Python Flask app - Kafka enabled - Simple tumbling Power BI
- kafka-python window
Azure
So… what do I use?
INGESTION SERVICES- A COMPARISON

HDInsight (Apache Kafka) Azure Event Hubs Azure IoT Hub


Open-Source Yes No No

Serverless Service No Yes Yes

Hybrid (cloud and on-prem) Yes No No

MQTT, AQMP, HTTPS and


Protocols supported HTTP REST AQMP, HTTP REST Azure IoT protocol gateway for custom
protocols

Replication and reliability Manually configured with tools like Relies on underlying Azure Blob Storage Relies on underlying Azure Blob Storage
MirrorMaker
• A side-by-side comparison of the capabilities and features
SLA 99.9% 99.9% 99.9%

Limited by number and type of nodes in the


Scaling HDInsight cluster provisioned Limited by number of Throughput Units Limited by number of IoT Hub Units

Throttling No, explicit throttling Yes, when TU limits are reached Yes, when IoT Hub Unit limits are reached

Message Size No Limits 1MB 256 KB

Message Ordering Yes, Ordered within a partition Yes, Ordered within a partition Yes, within a partition

Can automatically store in Azure Managed Disk. Can automatically store in Azure Blob Storage
Can automatically store in Azure Blob Storage or
Long-term Storage Number of disk has to be explicitly specified during
cluster creation.
Azure Data Lake Store. (using ABS as an endpoint)
COMPARING STREAMING ANALYTICS SERVICE
(1/2)
HDInsight (Apache Storm) Azure Stream Analytics Spark Streaming (Azure Databricks)
Open-Source Yes No Yes

Serverless Service No Yes No

Hybrid (cloud and on-prem) Yes No Yes

Exactly-once processing No Yes Yes


[Cannot distinguish between new events and replays]

SQL as Query Language No Yes Yes

Yes – Combines Batch, Interactive, Machine


Unified Programming Model No No
Learning and Streaming.
• A side-by-side comparison of the capabilities and features
Extensibility Yes, custom code in Java, C# etc No (partial support with JavaScript UDFs) Yes, custom code in Java, Scala, Python

Windowing Support No, needs Trident for Tumbling Window Yes – Sliding, Hopping and Tumbling Yes – Siding and Tumbling Window

No built-in support. Trained model can be Yes. Published Azure Machine Learning models
Azure ML Integration can be configured as functions during job No
invoked through custom Storm Bolts. creation.

Kafka Integration Yes, Kafka Spout available Yes Kafka connector available
COMPARING STREAMING ANALYTICS SERVICE
(2/2)
Apache Storm on HDInsight Azure Stream Analytics Spark Structured Streaming
(Azure Databricks)
Pay for number and type of nodes in the Pay for number and type of nodes in the
Pricing Model HDInsight cluster and duration of use Pay per Streaming Unit (SU) cluster and duration of use
Limited by number of Streaming Units (SU). Limited by number and type of nodes in the
Scaling Model Upper limit of 1 GB/sec.
cluster provisioned
Each SU = 1 MB/sec with max 50 SUs.
Can be anything - custom code needed to
Input Data Format parse Avro, CSV, JSON Text, CSV, JSON, PARQUET

Azure Event Hubs, Azure Blob Storage, Azure


Input Data Sources Can be anything –but need custom code. File Source, Kafka, Socket (for testing)
IoT Hub
Azure Events Hubs, Azure Blob Storage, Azure
Azure Events Hubs, Azure Blob Storage, Azure
Output Data Sink • A side-by-side comparisonTables,
of theAzure capabilities
Tables, Azure Cosmos DB, Azure SQL DB, Azure
Power BI, Azure Data Lake Store, HBase,
and
Cosmos DB, Azure SQL DB, features
Azure Console, Kafka, Memory, ForEachSink
Custom Power BI, Azure Data Lake Store

No limits on data size. Connectors available for Azure Blob Storage only with max size of 100 No limits on data size. Can be stored in any
Reference Data HBase, Azure Cosmos DB, Azure SQL DB and MB in-memory lookup cache source supported by Apache Spark.
custom sources
Users can create, debug, and monitor jobs
Users using .NET can develop, debug, and
Dev Experience monitor through Visual Studio. through the Azure portal, using sample data Use Azure Databricks Notebooks.
derived from a live stream
Further reading
Hands on with Event Hubs and python
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-python

Hands on with streaming ETL with Azure Databricks


https://medium.com/microsoftazure/an-introduction-to-streaming-etl-o
n-azure-databricks-using-structured-streaming-databricks-16b369d77e3
4

Choosing the right service(s) for your use case


https://docs.microsoft.com/en-us/azure/architecture/data-guide/techn
ology-choices/stream-processing
Further reading
http://shop.oreilly.com/product/0636920073994.do
Questions?
We'd love your feedback!
aka.ms/SQLBits19
Thanks!
Joe Plumb
Cloud Solution Architect – Microsoft UK
@joe_plumb

You might also like