4 Spark Cassandra

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Advanced Streaming Operations

Ahmad Alkilani
DATA ARCHITECT

@akizl
Advanced Streaming Operations

Checkpointing
Window Operations

Stateful Transformations
Streaming Cardinality Estimation
Checkpointing
Advanced Window Operations
Streaming
Stateful Transformations
Operations
Streaming Cardinality Estimation
Checkpointing

§ Can be used in normal Spark applications


§ Generally used/required in Streaming applications

§ Metadata checkpointing – Driver Recovery


§ Configuration
§ DStream operations
§ Incomplete batches

§ Data checkpointing – Stateful Transformations


§ Stateful transformations using data across batches
Window Operations

7 1 6 3 2 4

RDD RDD

Batch Interval Batch Interval


DStream
Window Operations
Batch Interval
2 Seconds

7 1 6 3 2 4

6 Second (Window Size)

To create a window of a stream of data in Spark define:

• Window Size
• Slide Interval

Must be multiples of batch interval


Window Operations
clicksDStream.window(Seconds(6), Seconds(2))
clicksDStream.reduceByWindow((a, Seconds(6),
b) => a + b, (x, Seconds(2))
y) => x - y, Seconds(6), Seconds(2))

2 Seconds

7 1 6 3 2 4 Original DStream

2 Second (Slide)
y b

x
6 Second (Window Size)
12
t
a
14 10 11 Windowed DStream
Stateful Transformations
DStream

Sedan Sports State function


/ 33
32

/ 5
4

DStream I have a type [T]


I only carry a certain kind of goods
33 5 State Type [Int]
Sedan Sports
Stateful Transformations
• updateStateByKey
• mapWithState
Stateful Transformations
• updateStateByKey
• mapWithState

Key State

A ( 3 , 8 , 6 ) State is a tuple of 3 integers


[(Int, Int, Int)]
B ( 12 , 4 , 7 )
Option[(Int, Int, Int)]
.updateStateByKey[(Int, Int, Int)] (
(newItemsPerKey: Seq[ ActivityByProduct] , currentState: Option[(Int, Int, Int)] ) => { }
)
Stateful Operations
• updateStateByKey
• mapWithState

mapWithState[StateType, MappedType](spec : StateSpec[K, V, …., ….])


: MapWithStateDStream[K, V, StateType, MappedType]
(Int, Int, Int)

val spec = StateSpec


.function((K, Option[V], State[StateType]) => Option[MappedType])
.timeout(t: Duration)

(Int, Int, Int)


Stateful Operations
• updateStateByKey
• mapWithState
mapWithState[StateType, MappedType](spec : StateSpec[K, V, …., ….])
: MapWithStateDStream[K, V, StateType, MappedType]

F E D C B A

Current State
RDD RDD
Key State

C ( 3 , 8 , 6 )
myStream.mapWithState(…).print()
D ( 12 , 4 , 7 )
myStream.mapWithState(…).stateSnapshots().print()
Cardinality Estimation using HyperLogLog
HLL - Cardinality estimation
Unique observations

Product “X” unique visitors


in a certain timeframe

Timeframe expanded
Cardinality Estimation using HyperLogLog
HLL - Cardinality estimation
Unique observations
Naïve solution 3 Billion events > 60 GB of memory
HLL solves this in < 100s of KB
Based on bit pattern observables
Associative Data Structure
HLL HLL HLL
Summary
§ Window Operations
§ Stateful Transformations
o updateStateByKey
Streaming Ingest with
Apache Kafka
o mapWithState

§ Cardinality Estimation using HyperLogLog


§ Algebird Library
§ CountMinSketch
§ Bloom Filters
§ Priority Queues

You might also like