Professional Documents
Culture Documents
BDT UNIT 3
BDT UNIT 3
RDD’s
Revisiting the YARN workflow summary
Your MR or Spark App. (It
contains your spark program + In case of SPARK, the component
resource negotiator + task which negotiates resource and
scheduler.) schedules task is called the spark
context or spark session.
Spark is a fast and general engine for large-
scale data processing.
The Spark Shell
The spark context is the application master. It schedules the spark operations on the cluster using the YARN resource manager.
The Spark Shell - Help
Scala REPL shell
Available Commnads
We will be running our applications on a Standalone Spark Installation, i.e. a Spark
Installation on a single machine. However, for discussion purposes, we will assume that
our applications will run on a multi-node cluster.
At a high level, every Spark application consists of a driver program that launches various parallel operations on a cluster.
Typical driver program could be the Spark shell itself, and you could just type in the operations you wanted to run.
Driver program access Spark through a SparkContext object, which represents a connection to a computing cluster.
In the shell, a SparkContext is automatically created for you as the variable called ‘sc’.
To run these operations, driver programs typically manage a number of nodes called executors
How Spark Works: A Simplified View
master node
Problem 1, 2, 3, 4, 5, 6, 7
Find the sum of a set of numbers e.g.:
1, 2, 3, 4, 5, 6, 7 phase 1
Phase 1: Partition and distribute the data worker node 1 worker node 2 worker node 3
over the worker nodes.
1, 2 5, 6, 7 3, 4
How Spark Works: A Simplified View
phase 2
Parallel 7, 21
Network step 3
Communication 7 + 21 = 28
How Spark Works: A Simplified View
• In Spark, distributed form of data is called Resilient Distributed
Dataset (RDD).
• An RDD is immutable.
• RDD’s are computed from scratch every time they are used –unless persisted.
• RDD’s are fault tolerant,i.e. It posses self-recovery in the case of failure.
Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recomputed missing
or damaged partitions due to node failures.
Distributed, since Data resides on multiple nodes.
Dataset represents records of the data you work with. The user can load the data set externally which
can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
Programmers can also call a persist method to indicate which RDDs they want to reuse in future
operations.
Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough
RAM.
Ways to create an RDD in Spark
emp.parq-0001 parqfileRDD-0001
(node 1) (node 1) Parquet files are already
partitioned and stored in a
emp.parq-0002 parqfileRDD-0002 distributed fashion over the
(node 2) read.parquet (node 2) cluster nodes.
emp.parq-0003 parqfileRDD-0003
(node 3) (node 3)
emp.parq-0004 parqfileRDD-0004
(node 4) (node 4)
(ii) flatMap: Apply function to each element of RDD and return RDD of
content of iterators returned.Often used to extract words
flatMap provides – one to
many transformation
Note: RDDSP.flatMap(…) returns an RDD.
: RDDSP.flatMap(…).collect(..) returns a Array.
(iii)filter : returns RDD of only elements that pass the
condition of the filter
Action returns scala objects which are not distributed ,its not RDD
The collect()
2
• Takes an RDD and converts it to a Array.
3 collect() • Used mainly for debugging purposes, since most RDD’s
Array[1, 2, 3,4,3,5] are expected to be very large.
4 • take(num) is a similar action, but returns only “num”
elements of the RDD.
3
A List.
5
iRDD
(Spark RDD)
count: counts number of elements in RDD
3849
+
To Compute the average, we need the count of the integers. How do we get the count ?
Generating RDD’s with a different
intermediate type: Computing Average
(3849,7)
+
Intermediate Type:
(2158,4)+ + (Int, Int)
(1691,3)
(123, 1) (456, 1) (678, 1) (901, 1) (234, 1) (567, 1) (890, 1) map( x => (x, 1) )
123 456 678 901 234 567 890 Initial Type: Int
Generating RDD’s with a different
intermediate type: Computing the average
Fold(zero)
Fold() is like reduce(). Besides, it takes “zero value” as input, which is used for the initial call on each
partition. But, the condition with zero value is that it should be the identity element of that operation.
The return type of fold() is same as that of the element of RDD we are operating on.
Aggregate the elements of each partition, and then the results for all the partitions, using a given
associative function and a neutral "zero value".
Syntax
def fold[T](acc:T)((acc,value) => acc)
The above is kind of high-level view of fold API. It has the following three things:
T is the data type of RDD
acc is accumulator of type T which will be return value of the fold operation
A function , which will be called for each element in rdd with previous accumulator
Aggregate()
It gives us the flexibility to get data type different from the input type. The aggregate() takes two
functions to get the final result. Through one function we combine the element from our RDD with the
accumulator, and the second, to combine the accumulator. Hence, in aggregate, we supply the initial zero
value of the type which we want to return.
file “log.txt”
output
A Simple Spark Application
“sc.texFile” converts a text file into an RDD.
inputRDD is a Resilient Distributed Dataset (RDD).
Partitions on
same machine.
union
(In parallel)
The action we take Our entire life is The action we take
. Our entire life is
theRDDS lifeRDDS
thelifeRDDS
RDD Computation Example
val first4 = thelifeRDDS.take(4)
sc.textFile() and rdd.filter() do not get executed immediately, it will only get executed once you
call an Action on the RDD - here filtered.count(). An Action is used to either save result to
some location or to display it.
You can cache an RDD in memory by calling rdd.cache(). When you cache an
RDD, it’s Partitions are loaded into memory of the nodes that hold it.
Caching can improve the performance of your application to a great extent. when an action is
performed on a RDD, it executes it’s entire lineage. If an action is applied multiple times on the
same RDD which has a long lineage, this will cause an increase in execution time. Caching
stores the computed result of the RDD in the memory thereby eliminating the need to
recompute it every time. You can think of caching as if it is breaking the lineage, but it does
remember the lineage so that it can be recomputed in case of a node failure.
• caching which can be used to avoid recomputation of RDD lineage by saving its contents in
memory. If there is not enough memory in the cluster, you can tell spark to use disk also
for saving the RDD by using the method persist().
Compute RDD_1.
Compute RDD_2. (RDD_2 does not depend upon RDD_1. Can’t discard RDD_1 because it is
required later.)
Other operations. (Does not depend upon RDD_1, RDD_2. Can’t discard RDD_1 and RDD_2
because they are required later.)
Compute RDD_3 from RDD_1. Discard RDD_1.
Compute RDD_4 from RDD_2. Discard RDD_2.
So use and discard model is failing to discard immediately after computation because RDD may be
used much later than when instruction for computing it appears in code.
Spark solves the problem using Lineage Graphs and Lazy evaluation.
Lineage Graph of RDD’s
Lineage Graph:
An RDD lineage graph is a graph of transformations that need to be executed after an action has been called.
Each action will have its own lineage graph maintained by the spark context.
log.txt
inputRDD re-computed
from scratch.
sc.textfile sc.textfile
val inputRDD = sc.textFile("log.txt")
val scalaRDD = inputRDD.filter(line => line.contains("scala")) inputRDD inputRDD
val sparkRDD = inputRDD.filter(line => line.contains("spark“))
val sparkscalaRDD = scalaRDD.union(sparkRDD) filter filter
val first4 = sparkscalaRDD.take(4)
first4.foreach(println) scalaRDD sparkRDD
union
first4
Lineage Graphs Continued log.txt
sc.textfile sc.textfile
val inputRDD = sc.textFile("log.txt")
val scalaRDD = inputRDD.filter(line => line.contains("scala")) inputRDD inputRDD
val sparkRDD = inputRDD.filter(line => line.contains("spark“))
filter filter
val sparkscalaRDD = scalaRDD.union(sparkRDD)
val first4 = sparkscalaRDD.take(4) scalaRDD sparkRDD
first4.foreach(println)
union
If sequence of operations in lineage graph is followed, it is guaranteed that an RDD will be used “almost”
immediately after computed (note: scalaRDD will have to wait for sparkRDD if computed sequentially before
scalasparkRDD can be computed following which scalaRDD and sparkRDD can be discarded.
Lazy Evaluation of RDD’s using Lineage Graphs
Lazy Evaluation: Use lineage graphs to execute transformations when a
log.txt
result is
required instead of when it appears in the application code. sc.textfile sc.textfile
Actions compute results.
inputRDD inputRDD
val inputRDD = sc.textFile("log.txt") inputRDD not created.
val scalaRDD = inputRDD.filter(line => line.contains("scala")) scalaRDD not created. filter
val sparkRDD = inputRDD.filter(line => line.contains("spark“)) sparkRDD not created.
filter
val sparkscalaRDD = scalaRDD.union(sparkRDD) sparkscalaRDD not created.
scalaRDD sparkRDD
val first4 = sparkscalaRDD.take(4) “take” is an action. Lineage
first4.foreach(println) graph for “take” is executed.
union
first4
RDD Persistence
If RDD recomputation is not desired, then it has to be explicitly persisted.
• An RDD is immutable.
• RDD’s are computed from scratch every time they are used – unless RDD
persisted. characteristics
• RDD’s are resilient or fault tolerant. mentioned
earlier.
Operation Fusion Using Lineage Graphs
Spark looks for opportunities to optimize for performance by fusing
successive operations using the lineage graph.
//Example to explain operation fusion
def fn1(n:Int) = n * n
def fn2(n:Int) = n-10
Val rdd1 = sc.parallelize(Range(0,10))
Val rdd2 = rdd1.map(x => fn1(x))
Val rdd3 = rdd2.map( x => fn2(x))
Rdd3.collect()
Spark will automatically fuse as above. Avoids overhead of intermediate RDD creation.
There is no single user defined function as in previous example which can do a map and a filter
simultaneously. In cases as above, fusion can be done internally by Spark, but not at the spark
program level.
Working with Key, Value
Pairs: Pair RDD’s
Key, Value Pairs
• Records usually have a key field.
• Database tables have a primary key.
• Key uniquely identifies an entity.
• Entity can be a person, institution, team etc.
• Value consists of information pertinent to the key element.
• Address (of person), Address (of Institute), Players (of team).
key
value
• e.g. (authors, book titles) is a key, value pair.
• (Chetan Bhagat, Two States), (Chetan Bhagat, Revolution), (Aravind Adiga, White Tiger), (Aravind Adiga,
Last Man in Tower), (Aravind Adiga, Between the Assassinations)
RDD’s with (key, value) pair elements are called Pair RDD’s.
Note: A pair RDD does not have the semantics of the scala Map (i.e. can have repeated keys).
Transformations on pair RDD’s
Operations such as filter or map, which work on simple RDD’s, also work on
pair RDD’s.
CombineByKey()
combineByKey( createCombiner, mergeValue, mergeCombiners, partitioner): similar to reduceByKey but used to
return a different data type than initial datatype.
Partition 1 mergeValu
createCombine (oracle, (2, 1))
(oracle, 2) r e
(oracle, 14) (oracle, (2+14, 1+1))
(TCS, 14) mergeCombiner
= (oracle, (16, 2))
(oracle, 14) s (oracle, (16+31,
2+1))
Partition 2
createCombiner (oracle, (31, 1)) = (oracle, (47, 3))
(oracle, 31)
Transformations of two pair RDD
subtractByKey: Remove element with a key present in other RDD
join() Movie genre Rating
Movie Genre
Rambo Action Movie Rating Rambo Action 4.2
HarryPotter Fantasy Rambo 4.2 HarryPotter Fantasy 4.8
HarryPotter Drama HarryPotter 4.8 HarryPotter Drama 4.8
Starbucks Comedy
A join combines records from two tables such that the records have the same key. Same concept as SQL join.
Join
Perform an inner join between two RDDs
Perform join between two RDDS where key must be present in first RDD
Perform join between two RDDS where key must be present in second RDD
Group data from both RDD sharing same key
Actions on Pair RDD’s
countByKey : Count the number of elements for each key
by employeeID
resultrdd = rdd1.join(rdd2)
Can be modelled as rdd1 and rdd2.
Join Implementation by Hash Partitioning
Node 1
Node 1
Node 1 (odd keys)
Node 2
Spark hashes every
Node 2 element by the Next Spark performs
join key (lots of Node 2 (even keys) the join (no shuffle
RDD elements randomly required).
shuffling).
distributed on two nodes.
RDD partitioning and shuffle performance
• Spark operations may cause shuffling.
Note: RDD elements are always stored in partitions. However, the elements in the partition may be randomly assigned.
When we talk about partitions here, we imply partitions having elements which follow some pattern.
What is a partition in Spark?
Resilient Distributed Dataset are collection of various data items that are so huge in size, that they
cannot fit into a single node and have to be partitioned across various nodes. Spark automatically
partitions RDDs and distributes the partitions across different nodes.
Partitions are basic units of parallelism in Apache Spark.
The number of partitions in a Spark RDD can always be found by using the partitions method of RDD. For the RDD that we
created the partitions method will show an output of 5 partitions
Scala> rdd.partitions.size
Output = 5
If an RDD has too many partitions, then task scheduling may take more time than the actual execution time.
To the contrary, having too less partitions is also not beneficial as some of the worker nodes could just be sitting idle
resulting in less concurrency.
This could lead to improper resource utilization and data skewing i.e. data might be skewed on a single partition and a
worker node might be doing more than other worker nodes.
Thus, there is always a trade off when it comes to deciding on the number of partitions.
Types of Partitioning in Apache Spark
Partitions
Data within an RDD is split into Several partitions
Properties of Partition:
1. Partitions never span multiple machines
2. Each machine in the cluster contains one or more partitions
3. The number of partitions to use is configurable. By default, it equals to number of cores
on all executer nodes
P =K.Hashcode()%numPartitions
Range Partitioning
Some Spark RDDs have keys that follow a particular ordering, for such RDDs, range partitioning is
Operations that maintains partitions (same partition if parent RDD has a partitioner) :
- mapValues()
- flatMapValues() “keys” remain unchanged.
- filter()
6, 7 8, 9, 10
Working at the Partition Level - Continued
mapPartitionsWithIndex gives partition Index and
an iterator to the partition.
Make 2 partitions.
Use HashPartitioner –
which hashes on “key” of
two pair RDD.
Partitioning and Performance
• RDD1 (employees): Big dataset. Changes infrequently.
• RDD2 (empByDept): Small dataset. Changes frequently.
• RDD1 join RDD2 required everytime RDD2 changes.
• Hash partition RDD1. Then persist it.
• Everytime RDD2 changes and a join is done:
• RDD1 is already hash partitioned and persisted. No shuffle required.
• RDD2 gets hash partitioned. Shuffle required but RDD2 is small.
Node 1
Node 1
Node 1 (odd keys)
Node 2
Node 2 Spark needs to hash Next Spark performs
employees RDD hashed and persisted. only empByDept Node 2 (even keys) the join (no shuffle
empByDept randomly distributed (shuffling reduced). required).
when it is created.
Partitioning Tuple RDDs with more than 2
elements
Even if the RDD element is a tuple and not a (key, value) pair, e.g. (“Alok Nath”, 36554, “India”),
it can be hash partitioned by any element of the tuple, i.e. by Name, ID, or Country.
This is left as an exercise.
Examples:
Pipelining and Shuffle Videos (Optional)
A Deeper
Link to video on PC– will
Understanding of Spark
work only if video file is on
Internals - Aaron
PC.
Davidson
(Databricks).mp4
Spark uses BitTorrent like protocol for sending the broadcast variable across the cluster, i.e.,
for each variable that has to be broadcasted, initially the driver will act as the only source.
The data will be split into blocks at the driver and each receiver will start fetching the block
to it’s local directory. Once a block is completely received, then that receiver will also act as
a source for this block for the rest of the receiver (This reduces the load at the machine
running driver). This is continued for rest of the blocks. So initially, only the driver is the
source and later on the number of sources increases - because of this, rate at which the
blocks are fetched by a node increases over time.
Broadcast example-2
factor is broadcast twice to all the worker nodes.