Professional Documents
Culture Documents
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
Source: https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/real-time-processing
Event vs Message
• Could be argued its an issue of semantics, as they ‘look’ the same
(e.g. JSON object, CSV etc)
• Message is a catch-all term, as messages are just bundles of data
• Event message is a type of message
Time
It’s all about time: Event time vs Processing
time
• In an ideal world, the processing system
receives the event when it happens.
• In reality, the skew between an event
happening and the system processing
that event can vary wildly
• Increasingly viewed as a
workaround, due to advances in
capabilities and reliability of
streaming data systems
https://en.wikipedia.org/wiki/Lambda_architecture#/media/
File:Diagram_of_Lambda_Architecture_(generic).png
Service Options
in Azure
Event Hubs
• Fully-managed PaaS service
• Big data streaming platform and event ingestion service.
• It can receive millions of events per second. Data sent to an event hub can be
transformed and stored by using any real-time analytics provider or
batching/storage adapters.
• Wide range of use cases
• Scalable
• Kafka for Event Hubs
• Data can be captured automatically in either Azure Blob Storage or Azure
Data Lake Store
Stream Analytics
• Event-processing engine that allows you to examine high volumes of
data streaming from devices.
• Supports extracting information from data streams, identifying patterns,
and relationships.
• Can then use these patterns to trigger other actions downstream, such
as create alerts, feed information to a reporting tool, or store it for later
use
Integration with Azure event hubs and IoT hub
Stream Analytics
• The connections can be established through the
Azure Portal without any coding Streaming data
Azure Stream
• Azure Stream Analytics supports compression Analytics
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Azure HD Insight
Apache storm on HDInsight
Apache Storm offered as a managed service on Azure HDInsight
The simplest way to perform streaming analytics is not having to think about streaming at all!
Develop
• Unifies streaming, interactive, and batch queries. Uses a single API
for both static bounded data and streaming unbounded data continuous applications
• Supports streaming aggregations, event-time windows, windowed That need to interact with batch data,
grouped aggregation, and stream-to-batch joins interactive analysis, machine learning…
• Features streaming deduplication, multiple output modes, and APIs
for managing and monitoring streaming queries
Pure streaming system Continuous application
• Also supports interactive and batch queries >_ Ad-hoc
queries
• Aggregate data in a stream, then serve using JDBC
Input Streaming Output Input Continuous Output
• Change queries at runtime stream computation sink
(transactions stream application sink
(transactions
often up to user) co often up to user)
ns
• Build and apply machine learning models ist
en
tw
ith
• Built-in sources: Kafka, file source (JSON, CSV, text, and Parquet) (interactions with other systems
Batch
job
left to the user Static data
• App development in Scala, Java, Python, and R
Demo – Event Hubs and
Stream Analytics
What we’re looking at
Replication and reliability Manually configured with tools like Relies on underlying Azure Blob Storage Relies on underlying Azure Blob Storage
MirrorMaker
• A side-by-side comparison of the capabilities and features
SLA 99.9% 99.9% 99.9%
Throttling No, explicit throttling Yes, when TU limits are reached Yes, when IoT Hub Unit limits are reached
Message Ordering Yes, Ordered within a partition Yes, Ordered within a partition Yes, within a partition
Can automatically store in Azure Managed Disk. Can automatically store in Azure Blob Storage
Can automatically store in Azure Blob Storage or
Long-term Storage Number of disk has to be explicitly specified during
cluster creation.
Azure Data Lake Store. (using ABS as an endpoint)
COMPARING STREAMING ANALYTICS SERVICE
(1/2)
HDInsight (Apache Storm) Azure Stream Analytics Spark Streaming (Azure Databricks)
Open-Source Yes No Yes
Windowing Support No, needs Trident for Tumbling Window Yes – Sliding, Hopping and Tumbling Yes – Siding and Tumbling Window
No built-in support. Trained model can be Yes. Published Azure Machine Learning models
Azure ML Integration can be configured as functions during job No
invoked through custom Storm Bolts. creation.
Kafka Integration Yes, Kafka Spout available Yes Kafka connector available
COMPARING STREAMING ANALYTICS SERVICE
(2/2)
Apache Storm on HDInsight Azure Stream Analytics Spark Structured Streaming
(Azure Databricks)
Pay for number and type of nodes in the Pay for number and type of nodes in the
Pricing Model HDInsight cluster and duration of use Pay per Streaming Unit (SU) cluster and duration of use
Limited by number of Streaming Units (SU). Limited by number and type of nodes in the
Scaling Model Upper limit of 1 GB/sec.
cluster provisioned
Each SU = 1 MB/sec with max 50 SUs.
Can be anything - custom code needed to
Input Data Format parse Avro, CSV, JSON Text, CSV, JSON, PARQUET
No limits on data size. Connectors available for Azure Blob Storage only with max size of 100 No limits on data size. Can be stored in any
Reference Data HBase, Azure Cosmos DB, Azure SQL DB and MB in-memory lookup cache source supported by Apache Spark.
custom sources
Users can create, debug, and monitor jobs
Users using .NET can develop, debug, and
Dev Experience monitor through Visual Studio. through the Azure portal, using sample data Use Azure Databricks Notebooks.
derived from a live stream
Further reading
Hands on with Event Hubs and python
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-python