Professional Documents
Culture Documents
+Apache+Spark
+Apache+Spark
1
About
upGrad
Course: Data Engineering - I
Lecture
Edit On: Apache
Master textSpark
styles
Instructor: Vishwa Mohan
3
Segment Learning Objectives
Understanding disk-based processing systems
HDFS
RAM
Hard Disk
Reduce
Phase Hard Disk
RAM
Hard Disk
Hard Disk
RAM
Hard Disk Map Phase
Delays in Hard Disk
Transfer Time:
Seek Time:
This is the time
This is the time taken
taken to read/write
by read/write head to
Rotational Delay: the data from/in the
move from the current
This is the time taken data chunk in the
position to a new
by the disk to rotate hard disk.
position.
so that read/write
head points to the
beginning of the data
chunk.
It is suitable for storing Since the historical Even for batch It cannot be used for
and managing large data is stored in large processing of data, the real-time data for
amounts of data. Since amounts, it is suitable disk I/O consumes a lot immediate results.
the data is in hard for batch processing of of job’s run-time.
drive, it is much safer the data.
and fault tolerant.
Why In-Memory Processing?
In-Memory
This involves disk-based processing; hence, the This involves in-memory processing; hence, the
processing is slow. processing is fast.
This involves only batch processing. This involves batch and real-time processing.
It supports HDFS as data storage. It can connect to multiple data sources: local files
and/or various distributed file systems, i.e. HDFS, S3
etc.
It was originally developed in Java but supports It was originally developed in Scala but also supports
C++, Ruby, Groovy, and Perl using Hadoop Java. Further, it supports Python, R and SQL to the
streaming library. extent possible.
Raw API, which does not have much support. Rich APIs.
Spark vs MapReduce Processing Speed
Spark Map Reduce
Sorting Sorting
100 TB of 100 TB of
data data
Understanding RDD
fxh
Reserved Memory
Executor
1 2 3 4 5 6
2 4
1
5 6
3
Spark Lineage
Consider the code in the following slides:
1
rdd1=sc.parallelize([11,12,13,14,15]) rdd1
rdd1 = [11,12,13,14,15]
map()
2
rdd2 = rdd1.map(lambda x:x+1)
rdd2 = [12,13,14,15,16] rdd2
3 filter()
filter()
map()
4
rdd4 = rdd1.filter(lambda x:x>11)
rdd2 rdd4
rdd4 = [12,13,14,15]
filter() map()
5
rdd5 = rdd4.map(lambda x:x*2) rdd3 rdd5
rdd5 = [24,26,28,30]
Spark Lineage
rdd1
filter()
map()
6
rdd6 = rdd3.union(rdd5) rdd2 rdd4
rdd6 = [13,14,15,16,24,26,28,30]
filter() map()
7
rdd3 rdd5
rdd6.collect()
Union()
rdd6 = [24,26,28,30]
rdd6
Directed Acyclic Graph (DAG)
Spark Lineage
rdd1
map() filter()
rdd2 rdd4
filter() map()
rdd3 rdd5
Union()
rdd6
Segment Summary
Runs Everywhere
Generality
lambda x:x*2
Assume: Assume:
1 worker node = 8 cores 1 worker node = 8 cores
1 Executor = 1 core 1 Executor = 2 cores
1 worker node = 8 Executors 1 worker node = 4 Executors
Executor Memory
Executor
Thank You