Professional Documents
Culture Documents
Unit 5
Unit 5
Unit 5
HDFS read
Iteration1 Iteration2
Iterative operations on MapReduce
Iterative operations on Spark RDD
Interactive operations on MapReduce
Interactive operations on Spark RDD
Sort competition
Hadoop MR Spark
Record (2013) Record (2014)
Data Size 102.5 TB 100 TB
Spark, 3x
Elapsed Time 72 mins 23 mins
faster with
# Nodes 2100 206 1/10 the
# Cores 50400 physical 6592 virtualized nodes
Cluster disk 3150 GB/s
618 GB/s
throughput (est.)
dedicated data virtualized (EC2) 10Gbps
Network
center, 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
Apache Spark Architecture
Apache Spark supports data analysis, machine learning, graph data
processing, streaming data analytics. It can read/write from a range
of data types and allows development in multiple languages.
Spark Core
Data Sources
• RDD Persistence
– RDD.persist()
– Storage level:
• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
DISK_ONLY,…….
18
Discretized Stream Processing
Apache Spark Discretized Stream or, in short, a Spark Dstream represents a
stream of data divided into small batches.
Run a streaming computation as a series of very small, deterministic batch
jobs
live data stream Spark
Batch sizes as low as ½ second, latency ~ 1 Streaming
second
batches of X
Potential for combining batch processing and
streaming processing in the same system seconds
Spark
processed
results
19
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
batch @ batch @
Twitter Streaming API batch @ t
t+1 t+2
tweets DStream
new DStream transformation: modify data in one Dstream to create another DStream
batch @ batch @
batch @ t
t+1 t+2
tweets DStream
…
[#cat, #dog, … ] every batch
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap
Batches of input data are replicated in
memory of multiple worker nodes,
therefore fault-tolerant hashTags
RDD lost partitions
recomputed on
Data lost due to worker failure, can be other workers
recomputed from input data
Key concepts
DStream – sequence of RDDs representing a stream of data
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
…
reduceByKe reduceByKe reduceByKe
tagCounts y y y
[(#cat, 10), (#dog, 25), ... ]
Example 3 – Count the hashtags over last 10 mins
sliding window
window length sliding interval
operation
Window Operations
• Spark Streaming also provides windowed computations,
which allows to apply transformations over a sliding
window of data.
• Any window operation needs to specify two parameters.
1. window length - The duration of the window.
2. sliding interval - The interval at which the window operation is
performed.
Example 3 – Counting the hashtags over last 10 mins
countByValue
tagCounts count over all
the data in the
window
Window Operations
Window Operations
Smart window-based countByValue