Professional Documents
Culture Documents
Spark Kafkaintegration PDF
Spark Kafkaintegration PDF
#EUstr5
Degree In Computer Science
About me Advanced Programming Techniques &
System Interfaces and Integration
#EUstr5 2
Apache Kafka
#EUstr5
What is - Publish - Subscribe
Apache Message System
Kafka?
#EUstr5 4
What is - Publish - Subscribe
Apache Message System
Kafka?
Kafka
#EUstr5 6
Apache Apache
My Java App Logger
Storm Spark
Kafka
#EUstr5 7
Topic: A feed of messages
Kafka
Terminology Producer: Processes that publish
messages to a topic
Partition 0 0 1 2 3 4 5 6 7 8 9
Partition 1 0 1 2 3 4 5 6 writes
Partition 2 0 1 2 3 4 5 6 7 8
Old New
#EUstr5 9
Kafka Topic Partitions
Producer
writes
1 1 1 1 1 1
Partition 0 0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5
reads reads
Consumer A Consumer B
(offset=6) (offset=12)
Old New
#EUstr5 10
Kafka Topic Partitions
Broker 1 Broker 2 Broker 3
P0 0 1 2 3 4 5 6 7 8 9 P3 0 1 2 3 4 5 6 7 8 9 P6 0 1 2 3 4 5 6 7 8 9
P1 0 1 2 3 4 5 6 P4 0 1 2 3 4 5 6 P7 0 1 2 3 4 5 6
P2 0 1 2 3 4 5 6 7 8 P5 0 1 2 3 4 5 6 7 8 P8 0 1 2 3 4 5 6 7 8
Consumers &
Producers
#EUstr5 11
Kafka Topic Partitions
Broker 1 Broker 2 Broker 3
P0 0 1 2 3 4 5 6 7 8 9 P3 0 1 2 3 4 5 6 7 8 9 P6 0 1 2 3 4 5 6 7 8 9
P1 0 1 2 3 4 5 6 P4 0 1 2 3 4 5 6 P7 0 1 2 3 4 5 6
P2 0 1 2 3 4 5 6 7 8 P5 0 1 2 3 4 5 6 7 8 P8 0 1 2 3 4 5 6 7 8
More
Storage
More
Parallelism Consumers &
Producers
#EUstr5 12
Kafka Semantics
In short: consumer - Kafka doesn’t store the
delivery semantics are state of the consumers*
up to you, not Kafka - It just sends you what
you ask for (topic,
partition, offset, length)
- You have to take care of
your state
#EUstr5 13
Apache Kafka Timeline
#EUstr5 14
Apache
Spark Streaming
#EUstr5
What is - Process streams of data
Apache - Micro-batching approach
Spark
Streaming?
#EUstr5 16
What is - Process streams of data
Apache - Micro-batching approach
Spark
Streaming? - Same API as Spark
- Same integrations as Spark
What makes it great?
- Same guarantees &
semantics as Spark
#EUstr5 17
What is Apache Spark Streaming?
Relying on the same Spark Engine: “same syntax” as batch jobs
https://spark.apache.org/docs/latest/streaming-programming-guide.html 18
How does it work?
- Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html 19
How does it work?
- Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html 20
How does it work?
https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html 21
How does it work?
https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html 22
Spark As in Spark:
- Not guarantee exactly-once
Streaming semantics for output actions
Semantics - Any side-effecting output
operations may be repeated
Side effects
- Because of node failure, process
failure, etc.
#EUstr5
Spark Streaming Kafka Integration Timeline
sep-2014 dec-2014 mar-2015 jun-2015 sep-2015 jan-2016 jul-2016 dec-2016
#EUstr5 25
Kafka Receiver (≤ Spark 1.1)
Driver
Launch jobs
on data
Continuously receive
Executor data using
High Level API
Receiver
Update offsets in
ZooKeeper
#EUstr5 26
Kafka Receiver with WAL (Spark 1.2)
Driver
Launch jobs
on data
Continuously receive
Executor data using
High Level API
Receiver
Update offsets in
ZooKeeper
HDFS
WAL
#EUstr5 27
Kafka Receiver with WAL (Spark 1.2)
Application Executor
Computation
Driver
Block Input
checkpointed
metadata stream
Streaming Receiver
Context
Block
metadata
Jobs written
Block data
to log
written both
Spark memory + log
Context
#EUstr5 28
Kafka Receiver with WAL (Spark 1.2)
#EUstr5 29
Kafka Receiver with WAL (Spark 1.2)
Driver
Launch jobs
on data
Continuously receive
Executor data using
High Level API
Receiver
Update offsets in
ZooKeeper
HDFS
WAL
#EUstr5 30
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver
Executor
#EUstr5 31
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
for batch
Executor
#EUstr5 32
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
topic1, p1, for batch
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3,
(2002, 2102)
Executor
#EUstr5 33
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
topic1, p1, for batch
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3, 3. Reads data using
(2002, 2102) offset ranges in jobs
using Simple API
Executor
#EUstr5 34
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
topic1, p1, for batch
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3, 3. Reads data using
(2002, 2102) offset ranges in jobs
using Simple API
Executor
#EUstr5 35
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
topic1, p1, for batch
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3, 3. Reads data using
(2002, 2102) offset ranges in jobs
using Simple API
Executor
#EUstr5 36
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
Executor
#EUstr5 37
- No WALs or Receivers
Direct Kafka
- Allows end-to-end
API benefits
exactly-once semantics
pipelines *
* updates to downstream systems should be
idempotent or transactional
- More fault-tolerant
- More efficient
- Easier to use.
#EUstr5 38
Spark Streaming UI improvements (Spark 1.4) 39
Kafka Metadata (offsets) in UI (Spark 1.5) 40
What about Spark 2.0+ and
new Kafka Integration?
This is why we are here, right?
#EUstr5 41
Spark 2.0+ new Kafka Integration
spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
#EUstr5 42
What’s really - New Consumer API
New with this * Instead of Simple API
New Kafka - Location Strategies
Integration?
- Consumer Strategies
- SSL / TLS
- No Python API :(
#EUstr5 43
Location Strategies
- New consumer API will pre-fetch messages into buffers
- So, keep cached consumers into executors
- It’s better to schedule partitions on the host with
appropriate consumers
#EUstr5 44
Location Strategies
- PreferConsistent
Distribute partitions evenly across available executors
- PreferBrokers
If your executors are on the same hosts as your Kafka brokers
- PreferFixed
Specify an explicit mapping of partitions to hosts
#EUstr5 45
Consumer Strategies
- New consumer API has a number of different
ways to specify topics, some of which require
considerable post-object-instantiation setup.
#EUstr5 47
SSL/TTL encryption
- New consumer API supports SSL
- Only applies to communication between Spark
and Kafka brokers
- Still responsible for separately securing Spark
inter-node communication
#EUstr5 48
val kafkaParams = Map[String, Object](
How to use "bootstrap.servers" -> "broker01:9092,broker02:9092",
"key.deserializer" -> classOf[StringDeserializer],
New Kafka "value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
Integration on "auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
Spark 2.0+ )
#EUstr5 49
stream.foreachRDD { rdd =>
How to use val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
New Kafka rdd.foreachPartition { iter =>
Integration on val osr: OffsetRange = offsetRanges(
#EUstr5 50
Kafka or Spark RDD Partitions?
Kafka Spark
Topic RDD
1 1
2 2
3 3
4 4
51
Kafka or Spark RDD Partitions?
Kafka Spark
Topic RDD
1 1
2 2
3 3
4 4
52
stream.foreachRDD { rdd =>
How to use val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
New Kafka rdd.foreachPartition { iter =>
Integration on val osr: OffsetRange = offsetRanges(
#EUstr5 53
How to use
New Kafka stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
Integration on .offsetRanges
Spark 2.0+
// DO YOUR STUFF with DATA
Java Example Code
stream.asInstanceOf[CanCommitOffsets]
Store offsets in Kafka itself:
Commit API .commitAsync(offsetRanges)
}
}
#EUstr5 54
- At most once
- At least once
- Exactly once
1. Set spark.task.maxFailures to 1
2. Make sure spark.speculation is false
(the default)
3. Set Kafka param auto.offset.reset
to “largest”
4. Set Kafka param enable.auto.commit
to true
#EUstr5 56
- This will mean you lose messages on
Kafka + Spark restart
Semantics - At least they shouldn’t get replayed.
At most once
- Test this carefully if it’s actually
important to you that a message never
gets repeated, because it’s not a
common use case.
#EUstr5 57
- We don’t want to loose any record
Kafka + Spark
- We don’t care about duplicates
Semantics
- Example: Sending internal alerts on
At least once relative rare occurrences on the stream
#EUstr5 58
- Don’t be silly! Do NOT replay your whole
Kafka + Spark log on every restart…
Semantics - Manually commit the offsets when you
are 100% sure records are processed
At least once
#EUstr5 59
- We don’t want to loose any record
Kafka + Spark
- We don’t want duplicates either
Semantics
- Example: Storing stream in data
Exactly once warehouse
#EUstr5 60
- Probably the hardest to achieve right
Kafka + Spark
- Still some small chance of failure if your
Semantics app fails just between writing data and
committing offsets… (but REALLY small)
Exactly once
#EUstr5 61
Apache Kafka
Apacke Spark
at Billy Mobile
15B 35TB
records monthly weekly retention log
6K
events/second
x4
growth/year
62
Our use - Input events from Kafka
cases - Enrich events with some
external data sources
ETL to Data
Warehouse - Finally store it to Hive
63
Our use - Hive is not transactional
- Neither idempotent writes
cases - Writing files to HDFS is “atomic”
ETL to Data (whole or nothing)
Warehouse
- A relation 1:1 from each
partition-batch to file in HDFS
- Store to ZK the current state of
the batch
- Store to ZK offsets of last
finished batch
64
Our use - Input events from Kafka
- Periodically load
cases batch-computed model
Anomalies detector - Detect when an offer stops
converting (or too much)
65
Our use - It’s useless to detect anomalies
on a lagged stream!
cases - Actually it could be very bad
Anomalies detector
66
Our use - Input events from Kafka
cases - Almost no processing
- Store it to HBase
Store to
Entity Cache - (has idempotent writes)
67
Our use - Since HBase has idempotent writes,
we can write events multiple times
cases without hassle
Store to - But, we do NOT start with earliest
Entity Cache offsets…
- That would be 7 days of redundant writes…!!!
68
Lessons - Do NOT use checkpointing
- Not recoverable across code upgrades
Learned - Do your own checkpointing
69
Further - Dynamic Allocation
spark.dynamicAllocation.enabled vs
Improvements spark.streaming.dynamicAllocation.enabled
https://issues.apache.org/jira/browse/SPARK-12133
But no reference in docs...
- Graceful shutdown
- Structured Streaming
70
Thank you very much!
Questions?
@joanvr
joanviladrosa
joan.viladrosa@billymob.com