Spark Kafkaintegration PDF

Apache Spark Streaming
+ Kafka 0.10: An Integration Story

Joan Viladrosa, Billy Mobile
#EUstr5
Degree In Computer Science
About me Advanced Programming Techniques &
System Interfaces and Integration
Joan Viladrosa Riera Co-Founder, Educabits

Educational Big data solutions
using AWS cloud
Big Data Developer, Trovit

Hadoop and MapReduce Framework
SEM keywords optimization
@joanvr
Big Data Architect & Tech Lead
joanviladrosa
BillyMobile
Full architecture with Hadoop:
joan.viladrosa@billymob.com
Kafka, Storm, Hive, HBase, Spark, Druid, …
#EUstr5 2
Apache Kafka
#EUstr5
What is - Publish - Subscribe
Apache Message System
Kafka?
#EUstr5 4
What is - Publish - Subscribe
Apache Message System
Kafka?
What makes it great? - Fast

- Scalable
- Durable
- Fault-tolerant
#EUstr5 5
Producer Producer Producer Producer
Kafka
Consumer Consumer Consumer Consumer
What is Apache Kafka?

As a central point
#EUstr5 6
Apache Apache
My Java App Logger
Storm Spark
Kafka
Apache Apache Monitoring

My Java App
Storm Spark Tool
What is Apache Kafka?

A lot of different connectors
#EUstr5 7
Topic: A feed of messages
Kafka
Terminology Producer: Processes that publish
messages to a topic
Consumer: Processes that

subscribe to topics and process the
feed of published messages
Broker: Each server of a kafka
cluster that holds, receives and
sends the actual data
#EUstr5 8
Kafka Topic Partitions
Topic:
Partition 0 0 1 2 3 4 5 6 7 8 9
Partition 1 0 1 2 3 4 5 6 writes
Partition 2 0 1 2 3 4 5 6 7 8
Old New
#EUstr5 9
Producer
writes
1 1 1 1 1 1
Partition 0 0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5
reads reads
Consumer A Consumer B
(offset=6) (offset=12)
Old New
#EUstr5 10
Broker 1 Broker 2 Broker 3
P0 0 1 2 3 4 5 6 7 8 9 P3 0 1 2 3 4 5 6 7 8 9 P6 0 1 2 3 4 5 6 7 8 9
P1 0 1 2 3 4 5 6 P4 0 1 2 3 4 5 6 P7 0 1 2 3 4 5 6
P2 0 1 2 3 4 5 6 7 8 P5 0 1 2 3 4 5 6 7 8 P8 0 1 2 3 4 5 6 7 8
Consumers &
Producers
#EUstr5 11
Broker 1 Broker 2 Broker 3
P0 0 1 2 3 4 5 6 7 8 9 P3 0 1 2 3 4 5 6 7 8 9 P6 0 1 2 3 4 5 6 7 8 9
P1 0 1 2 3 4 5 6 P4 0 1 2 3 4 5 6 P7 0 1 2 3 4 5 6
P2 0 1 2 3 4 5 6 7 8 P5 0 1 2 3 4 5 6 7 8 P8 0 1 2 3 4 5 6 7 8
More
Storage
More
Parallelism Consumers &
Producers
#EUstr5 12
Kafka Semantics
In short: consumer - Kafka doesn’t store the
delivery semantics are state of the consumers*
up to you, not Kafka - It just sends you what
you ask for (topic,
partition, offset, length)
- You have to take care of
your state
#EUstr5 13
Apache Kafka Timeline
nov-2012 nov-2013 nov-2015 may-2016
0.7 0.8 0.9 0.10
Apache New New Security

Incubator Producer Consumer Kafka Streams
Project
#EUstr5 14
Apache
Spark Streaming
#EUstr5
What is - Process streams of data
Apache - Micro-batching approach
Spark
Streaming?
#EUstr5 16
What is - Process streams of data
Apache - Micro-batching approach
Spark
Streaming? - Same API as Spark
- Same integrations as Spark
What makes it great?
- Same guarantees &
semantics as Spark
#EUstr5 17
What is Apache Spark Streaming?
Relying on the same Spark Engine: “same syntax” as batch jobs
https://spark.apache.org/docs/latest/streaming-programming-guide.html 18
How does it work?
- Discretized Streams
How does it work?
- Discretized Streams
How does it work?
https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html 21
How does it work?
https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html 22
Spark As in Spark:
- Not guarantee exactly-once
Streaming semantics for output actions
Semantics - Any side-effecting output
operations may be repeated
Side effects
- Because of node failure, process
failure, etc.
So, be careful when outputting to

external sources
#EUstr5 23
Spark Streaming Kafka
Integration
#EUstr5
Spark Streaming Kafka Integration Timeline
sep-2014 dec-2014 mar-2015 jun-2015 sep-2015 jan-2016 jul-2016 dec-2016
1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
Receivers Fault Tolerant Direct Improved Metadata in Native Kafka

WAL Streams Streaming UI UI (offsets) 0.10
+ + + (experimental)
Python API Python API Graduated
Direct
#EUstr5 25
Kafka Receiver (≤ Spark 1.1)
Driver
Launch jobs
on data
Continuously receive
Executor data using
High Level API
Receiver
Update offsets in
ZooKeeper
#EUstr5 26
Kafka Receiver with WAL (Spark 1.2)
Driver
Launch jobs
on data
Executor data using
High Level API
Receiver
Update offsets in
ZooKeeper
HDFS
WAL
#EUstr5 27
Application Executor
Computation
Driver
Block Input
checkpointed
metadata stream
Streaming Receiver
Context
Block
metadata
Jobs written
Block data
to log
written both
Spark memory + log
Context
#EUstr5 28
Restart Restarted Driver Restarted

computation Recover Executor
from info in Block Resend
checkpoints Restarted metadata Restarted unacked data
Streaming from log
Receiver
Context
Relaunch
Recover Block
Jobs
data from log
Restarted
Spark
Context
#EUstr5 29
Driver
Launch jobs
on data
Executor data using
High Level API
Receiver
Update offsets in
ZooKeeper
HDFS
WAL
#EUstr5 30
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver
Executor
#EUstr5 31
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
for batch
Executor
#EUstr5 32
(Spark 1.3)
topic1, p1, for batch
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3,
(2002, 2102)
Executor
#EUstr5 33
(Spark 1.3)
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3, 3. Reads data using
(2002, 2102) offset ranges in jobs
using Simple API
Executor
#EUstr5 34
(Spark 1.3)
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
using Simple API
Executor
#EUstr5 35
(Spark 1.3)
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
using Simple API
Executor
#EUstr5 36
(Spark 1.3)
for batch
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
Executor
#EUstr5 37
- No WALs or Receivers
Direct Kafka
- Allows end-to-end
API benefits
exactly-once semantics
pipelines *
* updates to downstream systems should be
idempotent or transactional
- More fault-tolerant
- More efficient
- Easier to use.
#EUstr5 38
Spark Streaming UI improvements (Spark 1.4) 39
Kafka Metadata (offsets) in UI (Spark 1.5) 40
What about Spark 2.0+ and
new Kafka Integration?
This is why we are here, right?
#EUstr5 41
Spark 2.0+ new Kafka Integration
spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
Broker Version 0.8.2.1 or higher 0.10.0 or higher
Api Stability Stable Experimental
Language Support Scala, Java, Python Scala, Java
Receiver DStream Yes No
Direct DStream Yes Yes
SSL / TLS Support No Yes
Offset Commit Api No Yes
Dynamic Topic Subscription No Yes
#EUstr5 42
What’s really - New Consumer API
New with this * Instead of Simple API
New Kafka - Location Strategies
Integration?
- Consumer Strategies
- SSL / TLS
- No Python API :(
#EUstr5 43
Location Strategies
- New consumer API will pre-fetch messages into buffers
- So, keep cached consumers into executors
- It’s better to schedule partitions on the host with
appropriate consumers
#EUstr5 44
Location Strategies
- PreferConsistent
Distribute partitions evenly across available executors
- PreferBrokers
If your executors are on the same hosts as your Kafka brokers
- PreferFixed
Specify an explicit mapping of partitions to hosts
#EUstr5 45
Consumer Strategies
- New consumer API has a number of different
ways to specify topics, some of which require
considerable post-object-instantiation setup.
- ConsumerStrategies provides an abstraction

that allows Spark to obtain properly configured
consumers even after restart from checkpoint.
#EUstr5 46
Consumer Strategies
- Subscribe subscribe to a fixed collection of topics
- SubscribePattern use a regex to specify topics of
interest
- Assign specify a fixed collection of partitions
● Overloaded constructors to specify the starting offset

for a particular partition.
● ConsumerStrategy is a public class that you can extend.
#EUstr5 47
SSL/TTL encryption
- New consumer API supports SSL
- Only applies to communication between Spark
and Kafka brokers
- Still responsible for separately securing Spark
inter-node communication
#EUstr5 48
val kafkaParams = Map[String, Object](
How to use "bootstrap.servers" -> "broker01:9092,broker02:9092",
"key.deserializer" -> classOf[StringDeserializer],
New Kafka "value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
Integration on "auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
Spark 2.0+ )
Scala Example Code val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](

Basic usage streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
#EUstr5 49
stream.foreachRDD { rdd =>
How to use val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
New Kafka rdd.foreachPartition { iter =>
Integration on val osr: OffsetRange = offsetRanges(
Spark 2.0+ TaskContext.get.partitionId)

// get any needed data from the offset range
Java Example Code val topic = osr.topic
val kafkaPartitionId = osr.partition
Getting metadata val begin = osr.fromOffset
val end = osr.untilOffset
}
}
#EUstr5 50
Kafka or Spark RDD Partitions?
Kafka Spark
Topic RDD
1 1
2 2
3 3
4 4
51
Kafka or Spark RDD Partitions?
Kafka Spark
Topic RDD
1 1
2 2
3 3
4 4
52
stream.foreachRDD { rdd =>
How to use val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
New Kafka rdd.foreachPartition { iter =>
Integration on val osr: OffsetRange = offsetRanges(
Spark 2.0+ TaskContext.get.partitionId)

// get any needed data from the offset range
Java Example Code val topic = osr.topic
val kafkaPartitionId = osr.partition
Getting metadata val begin = osr.fromOffset
val end = osr.untilOffset
}
}
#EUstr5 53
How to use
New Kafka stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
Integration on .offsetRanges
Spark 2.0+
// DO YOUR STUFF with DATA
Java Example Code
stream.asInstanceOf[CanCommitOffsets]
Store offsets in Kafka itself:
Commit API .commitAsync(offsetRanges)
}
}
#EUstr5 54
- At most once
- At least once
- Exactly once
Kafka + Spark Semantics

#EUstr5 55
- We don’t want duplicates
Kafka + Spark
- Not worth the hassle of ensuring that
Semantics messages don’t get lost
At most once - Example: Sending statistics over UDP
1. Set spark.task.maxFailures to 1
2. Make sure spark.speculation is false
(the default)
3. Set Kafka param auto.offset.reset
to “largest”
4. Set Kafka param enable.auto.commit
to true
#EUstr5 56
- This will mean you lose messages on
Kafka + Spark restart
Semantics - At least they shouldn’t get replayed.
At most once
- Test this carefully if it’s actually
important to you that a message never
gets repeated, because it’s not a
common use case.
#EUstr5 57
- We don’t want to loose any record
Kafka + Spark
- We don’t care about duplicates
Semantics
- Example: Sending internal alerts on
At least once relative rare occurrences on the stream
1. Set spark.task.maxFailures > 1000

2. Set Kafka param auto.offset.reset
to “smallest”
3. Set Kafka param enable.auto.commit
to false
#EUstr5 58
- Don’t be silly! Do NOT replay your whole
Kafka + Spark log on every restart…
Semantics - Manually commit the offsets when you
are 100% sure records are processed
At least once
- If this is “too hard” you’d better have a

relative short retention log
- Or be REALLY ok with duplicates. For
example, you are outputting to an
external system that handles duplicates
for you (HBase)
#EUstr5 59
- We don’t want to loose any record
Kafka + Spark
- We don’t want duplicates either
Semantics
- Example: Storing stream in data
Exactly once warehouse
1. We need some kind of idempotent writes,

or whole-or-nothing writes (transactions)
2. Only store offsets EXACTLY after writing
data
3. Same parameters as at least once
#EUstr5 60
- Probably the hardest to achieve right
Kafka + Spark
- Still some small chance of failure if your
Semantics app fails just between writing data and
committing offsets… (but REALLY small)
Exactly once
#EUstr5 61
Apache Kafka
Apacke Spark
at Billy Mobile
15B 35TB
records monthly weekly retention log
6K
events/second
x4
growth/year
62
Our use - Input events from Kafka
cases - Enrich events with some
external data sources
ETL to Data
Warehouse - Finally store it to Hive
We do NOT want duplicates

We do NOT want to lose events
63
Our use - Hive is not transactional
- Neither idempotent writes
cases - Writing files to HDFS is “atomic”
ETL to Data (whole or nothing)
Warehouse
- A relation 1:1 from each
partition-batch to file in HDFS
- Store to ZK the current state of
the batch
- Store to ZK offsets of last
finished batch
64
- Periodically load
cases batch-computed model
Anomalies detector - Detect when an offer stops
converting (or too much)
- We do not care about losing

some events (on restart)
- We always need to process the
“real-time” stream
65
Our use - It’s useless to detect anomalies
on a lagged stream!
cases - Actually it could be very bad
Anomalies detector
- Always restart stream on latest

offsets
- Restart with “fresh” state
66
cases - Almost no processing
- Store it to HBase
Store to
Entity Cache - (has idempotent writes)
- We do not care about duplicates

- We can NOT lose a single event
67
Our use - Since HBase has idempotent writes,
we can write events multiple times
cases without hassle
Store to - But, we do NOT start with earliest
Entity Cache offsets…
- That would be 7 days of redundant writes…!!!
- We store offsets of last finished batch

- But obviously we might re-write some
events on restart or failure
68
Lessons - Do NOT use checkpointing
- Not recoverable across code upgrades
Learned - Do your own checkpointing
- Track offsets yourself

- In general, more reliable:
HDFS, ZK, RMDBS...
- Memory usually is an issue

- You don’t want to waste it
- Adjust batchDuration
- Adjust maxRatePerPartition
69
Further - Dynamic Allocation
spark.dynamicAllocation.enabled vs
Improvements spark.streaming.dynamicAllocation.enabled
https://issues.apache.org/jira/browse/SPARK-12133
But no reference in docs...
- Graceful shutdown
- Structured Streaming
70
Thank you very much!
Questions?
@joanvr
joanviladrosa
joan.viladrosa@billymob.com

Spark Kafkaintegration PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark Kafkaintegration PDF

Uploaded by

Copyright:

Available Formats

Apache Spark Streaming

+ Kafka 0.10: An Integration Story

Joan Viladrosa Riera Co-Founder, Educabits

Big Data Developer, Trovit

What makes it great? - Fast

Consumer Consumer Consumer Consumer

What is Apache Kafka?

Apache Apache Monitoring

What is Apache Kafka?

Consumer: Processes that

nov-2012 nov-2013 nov-2015 may-2016

0.7 0.8 0.9 0.10

Apache New New Security

So, be careful when outputting to

1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1

Receivers Fault Tolerant Direct Improved Metadata in Native Kafka

Restart Restarted Driver Restarted

Broker Version 0.8.2.1 or higher 0.10.0 or higher

Api Stability Stable Experimental

Language Support Scala, Java, Python Scala, Java

Receiver DStream Yes No

Direct DStream Yes Yes

SSL / TLS Support No Yes

Offset Commit Api No Yes

Dynamic Topic Subscription No Yes

- ConsumerStrategies provides an abstraction

● Overloaded constructors to specify the starting offset

Scala Example Code val topics = Array("topicA", "topicB")

val stream = KafkaUtils.createDirectStream[String, String](

stream.map(record => (record.key, record.value))

Spark 2.0+ TaskContext.get.partitionId)

Spark 2.0+ TaskContext.get.partitionId)

Kafka + Spark Semantics

At most once - Example: Sending statistics over UDP

1. Set spark.task.maxFailures > 1000

- If this is “too hard” you’d better have a

1. We need some kind of idempotent writes,

We do NOT want duplicates

- We do not care about losing

- Always restart stream on latest

- We do not care about duplicates

- We store offsets of last finished batch

- Track offsets yourself

- Memory usually is an issue

You might also like