Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Apache Spark Streaming

+ Kafka 0.10: An Integration Story


Joan Viladrosa, Billy Mobile

#EUstr5
Degree In Computer Science
About me Advanced Programming Techniques &
System Interfaces and Integration

Joan Viladrosa Riera Co-Founder, Educabits


Educational Big data solutions
using AWS cloud

Big Data Developer, Trovit


Hadoop and MapReduce Framework
SEM keywords optimization
@joanvr
Big Data Architect & Tech Lead
joanviladrosa
BillyMobile
Full architecture with Hadoop:
joan.viladrosa@billymob.com
Kafka, Storm, Hive, HBase, Spark, Druid, …

#EUstr5 2
Apache Kafka

#EUstr5
What is - Publish - Subscribe
Apache Message System
Kafka?

#EUstr5 4
What is - Publish - Subscribe
Apache Message System
Kafka?

What makes it great? - Fast


- Scalable
- Durable
- Fault-tolerant
#EUstr5 5
Producer Producer Producer Producer

Kafka

Consumer Consumer Consumer Consumer

What is Apache Kafka?


As a central point

#EUstr5 6
Apache Apache
My Java App Logger
Storm Spark

Kafka

Apache Apache Monitoring


My Java App
Storm Spark Tool

What is Apache Kafka?


A lot of different connectors

#EUstr5 7
Topic: A feed of messages
Kafka
Terminology Producer: Processes that publish
messages to a topic

Consumer: Processes that


subscribe to topics and process the
feed of published messages
Broker: Each server of a kafka
cluster that holds, receives and
sends the actual data
#EUstr5 8
Kafka Topic Partitions
Topic:

Partition 0 0 1 2 3 4 5 6 7 8 9

Partition 1 0 1 2 3 4 5 6 writes

Partition 2 0 1 2 3 4 5 6 7 8

Old New

#EUstr5 9
Kafka Topic Partitions
Producer

writes

1 1 1 1 1 1
Partition 0 0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5

reads reads

Consumer A Consumer B
(offset=6) (offset=12)

Old New
#EUstr5 10
Kafka Topic Partitions
Broker 1 Broker 2 Broker 3

P0 0 1 2 3 4 5 6 7 8 9 P3 0 1 2 3 4 5 6 7 8 9 P6 0 1 2 3 4 5 6 7 8 9

P1 0 1 2 3 4 5 6 P4 0 1 2 3 4 5 6 P7 0 1 2 3 4 5 6

P2 0 1 2 3 4 5 6 7 8 P5 0 1 2 3 4 5 6 7 8 P8 0 1 2 3 4 5 6 7 8

Consumers &
Producers
#EUstr5 11
Kafka Topic Partitions
Broker 1 Broker 2 Broker 3

P0 0 1 2 3 4 5 6 7 8 9 P3 0 1 2 3 4 5 6 7 8 9 P6 0 1 2 3 4 5 6 7 8 9

P1 0 1 2 3 4 5 6 P4 0 1 2 3 4 5 6 P7 0 1 2 3 4 5 6

P2 0 1 2 3 4 5 6 7 8 P5 0 1 2 3 4 5 6 7 8 P8 0 1 2 3 4 5 6 7 8

More
Storage
More
Parallelism Consumers &
Producers
#EUstr5 12
Kafka Semantics
In short: consumer - Kafka doesn’t store the
delivery semantics are state of the consumers*
up to you, not Kafka - It just sends you what
you ask for (topic,
partition, offset, length)
- You have to take care of
your state

#EUstr5 13
Apache Kafka Timeline

nov-2012 nov-2013 nov-2015 may-2016

0.7 0.8 0.9 0.10

Apache New New Security


Incubator Producer Consumer Kafka Streams
Project

#EUstr5 14
Apache
Spark Streaming

#EUstr5
What is - Process streams of data
Apache - Micro-batching approach
Spark
Streaming?

#EUstr5 16
What is - Process streams of data
Apache - Micro-batching approach
Spark
Streaming? - Same API as Spark
- Same integrations as Spark
What makes it great?
- Same guarantees &
semantics as Spark

#EUstr5 17
What is Apache Spark Streaming?
Relying on the same Spark Engine: “same syntax” as batch jobs

https://spark.apache.org/docs/latest/streaming-programming-guide.html 18
How does it work?
- Discretized Streams

https://spark.apache.org/docs/latest/streaming-programming-guide.html 19
How does it work?
- Discretized Streams

https://spark.apache.org/docs/latest/streaming-programming-guide.html 20
How does it work?

https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html 21
How does it work?

https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html 22
Spark As in Spark:
- Not guarantee exactly-once
Streaming semantics for output actions
Semantics - Any side-effecting output
operations may be repeated
Side effects
- Because of node failure, process
failure, etc.

So, be careful when outputting to


external sources
#EUstr5 23
Spark Streaming Kafka
Integration

#EUstr5
Spark Streaming Kafka Integration Timeline
sep-2014 dec-2014 mar-2015 jun-2015 sep-2015 jan-2016 jul-2016 dec-2016

1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1

Receivers Fault Tolerant Direct Improved Metadata in Native Kafka


WAL Streams Streaming UI UI (offsets) 0.10
+ + + (experimental)
Python API Python API Graduated
Direct

#EUstr5 25
Kafka Receiver (≤ Spark 1.1)

Driver
Launch jobs
on data
Continuously receive
Executor data using
High Level API
Receiver
Update offsets in
ZooKeeper

#EUstr5 26
Kafka Receiver with WAL (Spark 1.2)

Driver
Launch jobs
on data
Continuously receive
Executor data using
High Level API
Receiver
Update offsets in
ZooKeeper
HDFS
WAL
#EUstr5 27
Kafka Receiver with WAL (Spark 1.2)
Application Executor
Computation
Driver
Block Input
checkpointed
metadata stream
Streaming Receiver
Context
Block
metadata
Jobs written
Block data
to log
written both
Spark memory + log

Context

#EUstr5 28
Kafka Receiver with WAL (Spark 1.2)

Restart Restarted Driver Restarted


computation Recover Executor
from info in Block Resend
checkpoints Restarted metadata Restarted unacked data
Streaming from log
Receiver
Context
Relaunch
Recover Block
Jobs
data from log
Restarted
Spark
Context

#EUstr5 29
Kafka Receiver with WAL (Spark 1.2)

Driver
Launch jobs
on data
Continuously receive
Executor data using
High Level API
Receiver
Update offsets in
ZooKeeper
HDFS
WAL
#EUstr5 30
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver

Executor

#EUstr5 31
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
for batch

Executor

#EUstr5 32
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
topic1, p1, for batch
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3,
(2002, 2102)

Executor

#EUstr5 33
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
topic1, p1, for batch
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3, 3. Reads data using
(2002, 2102) offset ranges in jobs
using Simple API
Executor

#EUstr5 34
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
topic1, p1, for batch
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3, 3. Reads data using
(2002, 2102) offset ranges in jobs
using Simple API
Executor

#EUstr5 35
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
topic1, p1, for batch
2. Launch jobs (2000, 2100)
using offset
topic1, p2,
ranges (2010, 2110)
topic1, p3, 3. Reads data using
(2002, 2102) offset ranges in jobs
using Simple API
Executor

#EUstr5 36
Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Driver 1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
Executor

#EUstr5 37
- No WALs or Receivers
Direct Kafka
- Allows end-to-end
API benefits
exactly-once semantics
pipelines *
* updates to downstream systems should be
idempotent or transactional

- More fault-tolerant
- More efficient
- Easier to use.
#EUstr5 38
Spark Streaming UI improvements (Spark 1.4) 39
Kafka Metadata (offsets) in UI (Spark 1.5) 40
What about Spark 2.0+ and
new Kafka Integration?
This is why we are here, right?

#EUstr5 41
Spark 2.0+ new Kafka Integration
spark-streaming-kafka-0-8 spark-streaming-kafka-0-10

Broker Version 0.8.2.1 or higher 0.10.0 or higher

Api Stability Stable Experimental

Language Support Scala, Java, Python Scala, Java

Receiver DStream Yes No

Direct DStream Yes Yes

SSL / TLS Support No Yes

Offset Commit Api No Yes

Dynamic Topic Subscription No Yes

#EUstr5 42
What’s really - New Consumer API
New with this * Instead of Simple API
New Kafka - Location Strategies
Integration?
- Consumer Strategies
- SSL / TLS

- No Python API :(

#EUstr5 43
Location Strategies
- New consumer API will pre-fetch messages into buffers
- So, keep cached consumers into executors
- It’s better to schedule partitions on the host with
appropriate consumers

#EUstr5 44
Location Strategies
- PreferConsistent
Distribute partitions evenly across available executors

- PreferBrokers
If your executors are on the same hosts as your Kafka brokers

- PreferFixed
Specify an explicit mapping of partitions to hosts

#EUstr5 45
Consumer Strategies
- New consumer API has a number of different
ways to specify topics, some of which require
considerable post-object-instantiation setup.

- ConsumerStrategies provides an abstraction


that allows Spark to obtain properly configured
consumers even after restart from checkpoint.
#EUstr5 46
Consumer Strategies
- Subscribe subscribe to a fixed collection of topics
- SubscribePattern use a regex to specify topics of
interest
- Assign specify a fixed collection of partitions

● Overloaded constructors to specify the starting offset


for a particular partition.
● ConsumerStrategy is a public class that you can extend.

#EUstr5 47
SSL/TTL encryption
- New consumer API supports SSL
- Only applies to communication between Spark
and Kafka brokers
- Still responsible for separately securing Spark
inter-node communication

#EUstr5 48
val kafkaParams = Map[String, Object](
How to use "bootstrap.servers" -> "broker01:9092,broker02:9092",
"key.deserializer" -> classOf[StringDeserializer],
New Kafka "value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
Integration on "auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
Spark 2.0+ )

Scala Example Code val topics = Array("topicA", "topicB")

val stream = KafkaUtils.createDirectStream[String, String](


Basic usage streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)

stream.map(record => (record.key, record.value))

#EUstr5 49
stream.foreachRDD { rdd =>
How to use val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
New Kafka rdd.foreachPartition { iter =>
Integration on val osr: OffsetRange = offsetRanges(

Spark 2.0+ TaskContext.get.partitionId)


// get any needed data from the offset range
Java Example Code val topic = osr.topic
val kafkaPartitionId = osr.partition
Getting metadata val begin = osr.fromOffset
val end = osr.untilOffset
}
}

#EUstr5 50
Kafka or Spark RDD Partitions?
Kafka Spark

Topic RDD
1 1

2 2

3 3

4 4

51
Kafka or Spark RDD Partitions?
Kafka Spark

Topic RDD
1 1

2 2

3 3

4 4

52
stream.foreachRDD { rdd =>
How to use val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
New Kafka rdd.foreachPartition { iter =>
Integration on val osr: OffsetRange = offsetRanges(

Spark 2.0+ TaskContext.get.partitionId)


// get any needed data from the offset range
Java Example Code val topic = osr.topic
val kafkaPartitionId = osr.partition
Getting metadata val begin = osr.fromOffset
val end = osr.untilOffset
}
}

#EUstr5 53
How to use
New Kafka stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
Integration on .offsetRanges
Spark 2.0+
// DO YOUR STUFF with DATA
Java Example Code
stream.asInstanceOf[CanCommitOffsets]
Store offsets in Kafka itself:
Commit API .commitAsync(offsetRanges)
}
}

#EUstr5 54
- At most once
- At least once
- Exactly once

Kafka + Spark Semantics


#EUstr5 55
- We don’t want duplicates
Kafka + Spark
- Not worth the hassle of ensuring that
Semantics messages don’t get lost

At most once - Example: Sending statistics over UDP

1. Set spark.task.maxFailures to 1
2. Make sure spark.speculation is false
(the default)
3. Set Kafka param auto.offset.reset
to “largest”
4. Set Kafka param enable.auto.commit
to true
#EUstr5 56
- This will mean you lose messages on
Kafka + Spark restart
Semantics - At least they shouldn’t get replayed.
At most once
- Test this carefully if it’s actually
important to you that a message never
gets repeated, because it’s not a
common use case.

#EUstr5 57
- We don’t want to loose any record
Kafka + Spark
- We don’t care about duplicates
Semantics
- Example: Sending internal alerts on
At least once relative rare occurrences on the stream

1. Set spark.task.maxFailures > 1000


2. Set Kafka param auto.offset.reset
to “smallest”
3. Set Kafka param enable.auto.commit
to false

#EUstr5 58
- Don’t be silly! Do NOT replay your whole
Kafka + Spark log on every restart…
Semantics - Manually commit the offsets when you
are 100% sure records are processed
At least once

- If this is “too hard” you’d better have a


relative short retention log
- Or be REALLY ok with duplicates. For
example, you are outputting to an
external system that handles duplicates
for you (HBase)

#EUstr5 59
- We don’t want to loose any record
Kafka + Spark
- We don’t want duplicates either
Semantics
- Example: Storing stream in data
Exactly once warehouse

1. We need some kind of idempotent writes,


or whole-or-nothing writes (transactions)
2. Only store offsets EXACTLY after writing
data
3. Same parameters as at least once

#EUstr5 60
- Probably the hardest to achieve right
Kafka + Spark
- Still some small chance of failure if your
Semantics app fails just between writing data and
committing offsets… (but REALLY small)
Exactly once

#EUstr5 61
Apache Kafka
Apacke Spark
at Billy Mobile
15B 35TB
records monthly weekly retention log

6K
events/second
x4
growth/year

62
Our use - Input events from Kafka
cases - Enrich events with some
external data sources
ETL to Data
Warehouse - Finally store it to Hive

We do NOT want duplicates


We do NOT want to lose events

63
Our use - Hive is not transactional
- Neither idempotent writes
cases - Writing files to HDFS is “atomic”
ETL to Data (whole or nothing)
Warehouse
- A relation 1:1 from each
partition-batch to file in HDFS
- Store to ZK the current state of
the batch
- Store to ZK offsets of last
finished batch
64
Our use - Input events from Kafka
- Periodically load
cases batch-computed model
Anomalies detector - Detect when an offer stops
converting (or too much)

- We do not care about losing


some events (on restart)
- We always need to process the
“real-time” stream

65
Our use - It’s useless to detect anomalies
on a lagged stream!
cases - Actually it could be very bad
Anomalies detector

- Always restart stream on latest


offsets
- Restart with “fresh” state

66
Our use - Input events from Kafka
cases - Almost no processing
- Store it to HBase
Store to
Entity Cache - (has idempotent writes)

- We do not care about duplicates


- We can NOT lose a single event

67
Our use - Since HBase has idempotent writes,
we can write events multiple times
cases without hassle
Store to - But, we do NOT start with earliest
Entity Cache offsets…
- That would be 7 days of redundant writes…!!!

- We store offsets of last finished batch


- But obviously we might re-write some
events on restart or failure

68
Lessons - Do NOT use checkpointing
- Not recoverable across code upgrades
Learned - Do your own checkpointing

- Track offsets yourself


- In general, more reliable:
HDFS, ZK, RMDBS...

- Memory usually is an issue


- You don’t want to waste it
- Adjust batchDuration
- Adjust maxRatePerPartition

69
Further - Dynamic Allocation
spark.dynamicAllocation.enabled vs
Improvements spark.streaming.dynamicAllocation.enabled
https://issues.apache.org/jira/browse/SPARK-12133
But no reference in docs...

- Graceful shutdown

- Structured Streaming
70
Thank you very much!
Questions?

@joanvr
joanviladrosa
joan.viladrosa@billymob.com

You might also like