BDT UNIT 3

Big Data with Spark
RDD’s
Revisiting the YARN workflow summary
Your MR or Spark App. (It
contains your spark program + In case of SPARK, the component
resource negotiator + task which negotiates resource and
scheduler.) schedules task is called the spark
context or spark session.
Spark is a fast and general engine for large-
scale data processing.
The Spark Shell
A spark context (sc) is

automatically created..
The spark context is the application master. It schedules the spark operations on the cluster using the YARN resource manager.
The Spark Shell - Help
Scala REPL shell
Available Commnads
We will be running our applications on a Standalone Spark Installation, i.e. a Spark
Installation on a single machine. However, for discussion purposes, we will assume that
our applications will run on a multi-node cluster.
At a high level, every Spark application consists of a driver program that launches various parallel operations on a cluster.
Typical driver program could be the Spark shell itself, and you could just type in the operations you wanted to run.
Driver program access Spark through a SparkContext object, which represents a connection to a computing cluster.
In the shell, a SparkContext is automatically created for you as the variable called ‘sc’.
Once you have a SparkContext, you can use it to build RDDs
To run these operations, driver programs typically manage a number of nodes called executors
How Spark Works: A Simplified View
master node
Problem 1, 2, 3, 4, 5, 6, 7
Find the sum of a set of numbers e.g.:
1, 2, 3, 4, 5, 6, 7 phase 1
Phase 1: Partition and distribute the data worker node 1 worker node 2 worker node 3
over the worker nodes.
1, 2 5, 6, 7 3, 4
phase 2
worker node 1 worker node 2 worker node 3

Problem
Find the sum of a set of numbers e.g.: Parallel 1, 2 5, 6, 7 3, 4
1, 2, 3, 4, 5, 6, 7 step 1
1+2=3 5 + 6 + 7 = 18 3+4=7
worker node 1 worker node 3

Phase 1: Partition and distribute the data
over the worker nodes. Parallel
3, 18 7
step 2
3 + 18 = 21
Phase 2: Compute the sum, in parallel.
worker node 3
Parallel 7, 21
Network step 3
Communication 7 + 21 = 28
• In Spark, distributed form of data is called Resilient Distributed
Dataset (RDD).
• Parallel operations in Spark (e.g. parallel implementation of sum) fall

into two main categories, transformations and actions.
• An object called Spark Context contains information about the spark

cluster, and is used to distribute the datasets, as well as tasks to the
nodes in the Spark cluster.
Resilient Distributed Dataset (RDD)
If the data to be processed is not already distributed, it needs to be distributed before initiating distributed
processing. The basic distributed dataset form in Spark is called a resilient distributed dataset or RDD.
• An RDD in Spark is a distributed collection of data.

• Each RDD is split into multiple partitions.
• Spark operations – actions and transformations – are performed on RDD partitions in
parallel.
• An RDD is immutable.
• RDD’s are computed from scratch every time they are used –unless persisted.
• RDD’s are fault tolerant,i.e. It posses self-recovery in the case of failure.
Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recomputed missing
or damaged partitions due to node failures.
Distributed, since Data resides on multiple nodes.
Dataset represents records of the data you work with. The user can load the data set externally which
can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
Programmers can also call a persist method to indicate which RDDs they want to reuse in future
operations.
Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough
RAM.
Ways to create an RDD in Spark
1. Parallelizing already existing collection in driver program.
2. Referencing a dataset in an external storage system (e.g. HDFS,
Hbase, shared file system).
3. Creating RDD from already existing RDDs.

Create RDDs in Apache Spark
Parallelized collection
External Datasets (Referencing a dataset)
Creating RDD from existing RDD

Distributing Data - RDD Creation - Example 1
val RDDS = sc.textFile("/root/Desktop/text.txt")
Our entire life is RDD partition 1

Our entire life is (1 line of text)
made up of choices made up of choices RDD partition 2
What we decide sc.textFile (1 line of text)
Each partition
The action we take has roughly equal
The attitude we display What we decide number of
RDD partition 3 elements.
all represent The action we take (2 lines of text)
Spark
The steps of life
operation
The attitude we display RDD partition 4
all represent (3 lines of text)
The steps of life
File on disk: .
text.txt (Has 7
RDDS
lines of text)
(Spark RDD)
Distributed over multiple worker nodes.
RDD Creation – Example 2
1
• parallelize() takes a collection and converts it into a
sc.parallelize() 2,3 RDD.
List[1, 2, 3,4,5]
• Used for small datasets.
4,5 • Large datasets converted into an RDD directly from
Spark
source file by using various loading methods such as
operation inputRDD sc.textFile().
(Spark RDD)
Distributed over multiple worker nodes.
RDD Creation – Large Datasets - Example 3
val parqfileRDD = sqlContext.read.parquet(“employee.parquet”)
emp.parq-0001 parqfileRDD-0001
(node 1) (node 1) Parquet files are already
partitioned and stored in a
emp.parq-0002 parqfileRDD-0002 distributed fashion over the
(node 2) read.parquet (node 2) cluster nodes.
(node 3) (node 3)
(node 4) (node 4)
The rdd is created in parallel at each worker node.

In previous two examples, rdd created sequentially, since source data was in single file / data
structure.
Transformations on an RDD
Transformations transform one RDD into another RDD.

(i)map: Apply function to each element of RDD and return RDD as result
map provides – one to one

transformation
(ii) flatMap: Apply function to each element of RDD and return RDD of
content of iterators returned.Often used to extract words
flatMap provides – one to
many transformation
Note: RDDSP.flatMap(…) returns an RDD.
: RDDSP.flatMap(…).collect(..) returns a Array.
(iii)filter : returns RDD of only elements that pass the
condition of the filter
(iv)distinct : removes duplicates

Transformations on RDD pair’s
Action
Action returns scala objects which are not distributed ,its not RDD
The collect()
2
• Takes an RDD and converts it to a Array.
3 collect() • Used mainly for debugging purposes, since most RDD’s
Array[1, 2, 3,4,3,5] are expected to be very large.
4 • take(num) is a similar action, but returns only “num”
elements of the RDD.
3
A List.
5
iRDD
(Spark RDD)
count: counts number of elements in RDD
take(num): returns no of elements from RDD
Top(num): returns top number elements from RDD
countByValue: counts number of times each elements occurs in RDD

Actions on an RDD
foreach doesn’t return

any value. Example of
where it can be useful:
to insert all the
elements of an RDD into
a database.
The reduce() Action
• Any function can be used as long as it satisfies the following:

• take two arguments of same type
• return type should be same as argument type
• if function is not commutative or associative, then result may be non-deterministic.
The reduce() Action revisited: Computing Sum
3849
+
+ + Intermediate Type: Int

2158 1691
579+ 1579+ 801 +
890 Initial Type: Int

567
123 456 678 901 234
To Compute the average, we need the count of the integers. How do we get the count ?
Generating RDD’s with a different
intermediate type: Computing Average
(3849,7)
+
Intermediate Type:
(2158,4)+ + (Int, Int)
(1691,3)
reduce( (x, y) =>

(x._1 + y._1,
(579,2)+ (1579,2)+ (801,2)+ x._2 + y._2) )
(123, 1) (456, 1) (678, 1) (901, 1) (234, 1) (567, 1) (890, 1) map( x => (x, 1) )
123 456 678 901 234 567 890 Initial Type: Int
Generating RDD’s with a different
intermediate type: Computing the average
Fold(zero)
Fold() is like reduce(). Besides, it takes “zero value” as input, which is used for the initial call on each
partition. But, the condition with zero value is that it should be the identity element of that operation.
The return type of fold() is same as that of the element of RDD we are operating on.
Aggregate the elements of each partition, and then the results for all the partitions, using a given
associative function and a neutral "zero value".
Syntax
def fold[T](acc:T)((acc,value) => acc)
The above is kind of high-level view of fold API. It has the following three things:
T is the data type of RDD
acc is accumulator of type T which will be return value of the fold operation
A function , which will be called for each element in rdd with previous accumulator
Aggregate()
It gives us the flexibility to get data type different from the input type. The aggregate() takes two
functions to get the final result. Through one function we combine the element from our RDD with the
accumulator, and the second, to combine the accumulator. Hence, in aggregate, we supply the initial zero
value of the type which we want to return.
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U

Foreach
When we have a situation where we want to apply operation on each element
of RDD, but it should not return value to the driver. In this
case, foreach() function is useful. For example, inserting a record into the
database.
foreach() is an action. Unlike other actions, foreach do not return any value. It
simply operates on all the elements in the RDD. foreach() can be used in
situations, where we do not want to return any result, but want to initiate a
computation. A good example is ; inserting elements in RDD into database.
A Simple Spark Application
file “log.txt”
val inputRDD = sc.textFile("/root/Desktop/log.txt")

val scalaRDD = inputRDD.filter(line => line.contains("scala"))
val sparkRDD = inputRDD.filter(line => line.contains("spark“))
val sparkscalaRDD = scalaRDD.union(sparkRDD) SPARK
sparkscalaRDD.take(8).foreach(println) application
output
A Simple Spark Application
“sc.texFile” converts a text file into an RDD.
inputRDD is a Resilient Distributed Dataset (RDD).
Val RDDS = sc.textFile ("/root/Desktop/text.txt")

val theRDDS = RDDS.filter(line => line.contains("the"))
val lifeRDDS = RDDS.filter(line => line.contains("life“))
val thelifeRDDS = theRDDS.union(lifeRDDS)
val first4 = thelifeRDDS.take(4)
first4.foreach(println)
filter, union are Spark transformations.

take is a Spark action.
RDD Computation Example
val theRDDS = RDDS.filter(line => line.contains("the"))
Our entire life is

Our entire life is
made up of choices made up of choices
What we decide sc.textFile filter on “the”
The action we take The action we take
(In parallel)
The attitude we display What we decide
all represent The action we take
Spark
The steps of life
operation
The attitude we display
The attitude we display The steps of life
all represent .
The steps of life
File on disk .
Input RDDS
(Spark RDD) theRDDS
. (Spark RDD)
val lifeRDDS = RDDS.filter(line => line.contains(“life"))
Our entire life is
Our entire life is
Our entire life is
made up of choices made up of choices
What we decide sc.textFile filter on “life”
The action we take
(In parallel)
The attitude we display What we decide
all represent The action we take
Spark
The steps of life
operation
The steps of life
The attitude we display .
all represent
The steps of life
File on disk .
Input RDDS
lifeRDDS
(Spark RDD)
(Spark RDD)
.
RDD Computation Example Continued
val thelifeRDDS = theRDDS.union(lifeRDDS)
Partitions on
same machine.
union
(In parallel)
The action we take Our entire life is The action we take
. Our entire life is
The attitude we display The steps of life The attitude we display

The steps of life . The steps of life
The steps of life
theRDDS lifeRDDS
thelifeRDDS
val first4 = thelifeRDDS.take(4)
The action we take

Our entire life is
The action we take
take(4) Our entire life is creation of the
(In parallel) Array is a
sequential op.
The steps of life
The steps of life
The steps of life
thelifeRDDS first4 (is an

Array. not RDD)
Passing functions to Spark Operations
Define the function inside an object.
• Since reduce expects two

input parameters, the
function should also have
two input parameters.
• Since reduce returns a
result of the same type
as the input parameter
type, same is expected
from passed function.
Pass the function to spark operation.
When you call a transformation, Spark does not execute it immediately,
instead it creates a lineage. A lineage keeps track of what all transformations
has to be applied on that RDD, including from where it has to read the data.
For example, consider the below example
val DD = sc.textFile(“spam.txt")
val filtered = RDD.filter(line => line.contains("scala"))
Filetered.count()
sc.textFile() and rdd.filter() do not get executed immediately, it will only get executed once you
call an Action on the RDD - here filtered.count(). An Action is used to either save result to
some location or to display it.
You can cache an RDD in memory by calling rdd.cache(). When you cache an
RDD, it’s Partitions are loaded into memory of the nodes that hold it.
Caching can improve the performance of your application to a great extent. when an action is
performed on a RDD, it executes it’s entire lineage. If an action is applied multiple times on the
same RDD which has a long lineage, this will cause an increase in execution time. Caching
stores the computed result of the RDD in the memory thereby eliminating the need to
recompute it every time. You can think of caching as if it is breaking the lineage, but it does
remember the lineage so that it can be recomputed in case of a node failure.
• caching which can be used to avoid recomputation of RDD lineage by saving its contents in
memory. If there is not enough memory in the cluster, you can tell spark to use disk also
for saving the RDD by using the method persist().
• Rdd.persist(StorageLevel.MEMORY AND DISK)
• In fact Caching is a type of persistence with StorageLevel -MEMORY_ONLY. If you use

MEMORY_ONLY as the Storage Level and if there is not enough memory in your cluster to
hold the entire RDD, then some partitions of the RDD cannot be stored in memory and
will have to be recomputed every time it is needed.
• If you don’t want this to happen, you can use the StorageLevel - MEMORY_AND_DISK in
which if an RDD does not fit in memory, the partitions that do not fit are saved to disk.
• In the above example, the RDD has 3 partitions and there are 2 nodes in the cluster.
• Also, memory available in the cluster can hold only 2 out of 3 partitions of the RDD.
Here, partitions 1 and 2 can be saved in memory where as partition 3 will be saved to
disk.
• Another Storage Level, DISK_ONLY stores all the partitions on the disk.
• In the above method, the RDDs are not serialized before saving to Memory, there are
two other Storage Levels - MEMORY_ONLY_SER and MEMORY_AND_DISK_SER, which
will store the RDDs as serialized java objects.
RDD Computation
• When dealing with big data,
RDD 1 intermediate RDD’s are also
big.
RDD 2
• If all RDD’s are persisted
(stored in memory/disk) by
RDD 3
default, disk requirements
could be huge.
• N RDD’s in this case.
RDD N-1 • Hence, by default Spark
does not persist RDD’s.
RDD N
RDD Computation – Compute, Use, Discard
1.Compute RDD 1
2.Compute RDD 2 from RDD
RDD 1
1.
RDD 2 3.Discard RDD 1.
4.Compute RDD 3 from RDD
RDD 3
2.
5.Discard RDD 2.
…..
RDD N-1
... Compute RDD N from RDD
RDD N
(N-1).
… Discard RDD (N-1).
At most, 2 RDD’s need to be stored if
computed as above.
When to Discard ?
val RDD_1 = …
val RDD_2 = …
….. (other operations where RDD_1, RDD_2 is not used)
val RDD_3 = RDD_1.transformation
val RDD_4 = RDD_2.transformation
Compute RDD_1.
Compute RDD_2. (RDD_2 does not depend upon RDD_1. Can’t discard RDD_1 because it is
required later.)
Other operations. (Does not depend upon RDD_1, RDD_2. Can’t discard RDD_1 and RDD_2
because they are required later.)
Compute RDD_3 from RDD_1. Discard RDD_1.
Compute RDD_4 from RDD_2. Discard RDD_2.
So use and discard model is failing to discard immediately after computation because RDD may be
used much later than when instruction for computing it appears in code.
Spark solves the problem using Lineage Graphs and Lazy evaluation.
Lineage Graph of RDD’s
Lineage Graph:
An RDD lineage graph is a graph of transformations that need to be executed after an action has been called.
Each action will have its own lineage graph maintained by the spark context.
log.txt
inputRDD re-computed
from scratch.
sc.textfile sc.textfile
val inputRDD = sc.textFile("log.txt")
val scalaRDD = inputRDD.filter(line => line.contains("scala")) inputRDD inputRDD
val sparkscalaRDD = scalaRDD.union(sparkRDD) filter filter
val first4 = sparkscalaRDD.take(4)
first4.foreach(println) scalaRDD sparkRDD
union
Note: Due to the recomputation sparkscalaRDD

requirement, “inputRDD” appears
twice in the graph. Lineage Graph of take(4) take4
first4
Lineage Graphs Continued log.txt
sc.textfile sc.textfile
val scalaRDD = inputRDD.filter(line => line.contains("scala")) inputRDD inputRDD
filter filter
val sparkscalaRDD = scalaRDD.union(sparkRDD)
val first4 = sparkscalaRDD.take(4) scalaRDD sparkRDD
first4.foreach(println)
union
Lineage graph tracks dependence of operations on data sparkscalaRDD

directly through arcs in the graph which do not depend
upon the position in the code where the data was take4
computed.
first4
If sequence of operations in lineage graph is followed, it is guaranteed that an RDD will be used “almost”
immediately after computed (note: scalaRDD will have to wait for sparkRDD if computed sequentially before
scalasparkRDD can be computed following which scalaRDD and sparkRDD can be discarded.
Lazy Evaluation of RDD’s using Lineage Graphs
Lazy Evaluation: Use lineage graphs to execute transformations when a
log.txt
result is
required instead of when it appears in the application code. sc.textfile sc.textfile
Actions compute results.
inputRDD inputRDD
val inputRDD = sc.textFile("log.txt") inputRDD not created.
val scalaRDD = inputRDD.filter(line => line.contains("scala")) scalaRDD not created. filter
val sparkRDD = inputRDD.filter(line => line.contains("spark“)) sparkRDD not created.
filter
val sparkscalaRDD = scalaRDD.union(sparkRDD) sparkscalaRDD not created.
scalaRDD sparkRDD
val first4 = sparkscalaRDD.take(4) “take” is an action. Lineage
first4.foreach(println) graph for “take” is executed.
union
Ensures RDD’s can be computed and discarded – sparkscalaRDD

because lineage graphs are used for computations.
take4
first4
RDD Persistence
If RDD recomputation is not desired, then it has to be explicitly persisted.

inputRDD.persist()
val scalaRDD = inputRDD.filter(…)
val sparkRDD = inputRDD.filter(…)
val sparkscalaRDD = scalaRDD.union(…) val first4 =
sparkscalaRDD.take(4) first4.foreach(println)
inputRDD persisted. Hence computed once and kept

in memory / disk.
RDD
• An RDD is immutable. characteristi
• RDD’s are computed from scratch every time they are used – unless cs
persisted. mentioned
earlier.
• RDD’s are resilient or fault tolerant.
Lineage Graphs and Fault Tolerance
• If any node which is being used to execute a lineage graph goes down, the spark
context will use the lineage graph to recreate the RDD.
• This requires that the parent RDD’s have not changed.
• RDD’s are forced to be immutable so that the above guarantee of RDD not
changing can be given.
• Lineage Graph along with RDD immutability provides fault tolerance.
• An RDD is immutable.
• RDD’s are computed from scratch every time they are used – unless RDD
persisted. characteristics
• RDD’s are resilient or fault tolerant. mentioned
earlier.
Operation Fusion Using Lineage Graphs
Spark looks for opportunities to optimize for performance by fusing
successive operations using the lineage graph.
//Example to explain operation fusion
def fn1(n:Int) = n * n
def fn2(n:Int) = n-10
Val rdd1 = sc.parallelize(Range(0,10))
Val rdd2 = rdd1.map(x => fn1(x))
Val rdd3 = rdd2.map( x => fn2(x))
Rdd3.collect()
Val rdd4 = rdd1.map(x =>fn2(fn1(x))

rdd4.collect()
Spark will automatically fuse as above. Avoids overhead of intermediate RDD creation.
There is no single user defined function as in previous example which can do a map and a filter
simultaneously. In cases as above, fusion can be done internally by Spark, but not at the spark
program level.
Working with Key, Value
Pairs: Pair RDD’s
Key, Value Pairs
• Records usually have a key field.
• Database tables have a primary key.
• Key uniquely identifies an entity.
• Entity can be a person, institution, team etc.
• Value consists of information pertinent to the key element.
• Address (of person), Address (of Institute), Players (of team).
key
value
• e.g. (authors, book titles) is a key, value pair.
• (Chetan Bhagat, Two States), (Chetan Bhagat, Revolution), (Aravind Adiga, White Tiger), (Aravind Adiga,
Last Man in Tower), (Aravind Adiga, Between the Assassinations)
RDD’s with (key, value) pair elements are called Pair RDD’s.
Note: A pair RDD does not have the semantics of the scala Map (i.e. can have repeated keys).
Transformations on pair RDD’s
Keys: Returns RDD of the keys
values: Returns RDD of just values
sortByKey: Returns RDD sorted by key
groupByKey: groups the values with same key

mapValues: Apply function to each value of pair RDD without changing key
flatMapValues:Returns an Iterator to each value of pair RDD
reduceByKey :Combines values with same Key
reduceByKey is a transformation.It produces an RDD.

Non-key based RDD operations on Pair RDD’s
Operations such as filter or map, which work on simple RDD’s, also work on
pair RDD’s.
CombineByKey()
combineByKey( createCombiner, mergeValue, mergeCombiners, partitioner): similar to reduceByKey but used to
return a different data type than initial datatype.
createCombiner: Creates an accumulator for a the first occurrence of a key in a partition.

mergeValue: Used to merge the subsequent values of a key with the accumulator of the key for a partition.
mergeCombiners: Merges the accumulators for a key from two different partitions.
partitioner: Defines how elements with same key get partitioned (???).
CombineByKey()
Partition 1 mergeValu
createCombine (oracle, (2, 1))
(oracle, 2) r e
(oracle, 14) (oracle, (2+14, 1+1))
(TCS, 14) mergeCombiner
= (oracle, (16, 2))
(oracle, 14) s (oracle, (16+31,
2+1))
Partition 2
createCombiner (oracle, (31, 1)) = (oracle, (47, 3))
(oracle, 31)
Transformations of two pair RDD
subtractByKey: Remove element with a key present in other RDD
join() Movie genre Rating
Movie Genre
Rambo Action Movie Rating Rambo Action 4.2
HarryPotter Fantasy Rambo 4.2 HarryPotter Fantasy 4.8
HarryPotter Drama HarryPotter 4.8 HarryPotter Drama 4.8
Starbucks Comedy
A join combines records from two tables such that the records have the same key. Same concept as SQL join.
Join
Perform an inner join between two RDDs
Perform join between two RDDS where key must be present in first RDD
Perform join between two RDDS where key must be present in second RDD
Group data from both RDD sharing same key
Actions on Pair RDD’s
countByKey : Count the number of elements for each key
collectAsMap : collect the result as map for easy lookup
Lookup : Return all values associated with key

The Shuffle
• When an RDD is mapped, source RDD
partitions and result RDD partitions are on
same node.
• Hence a map creates no network traffic due
to RDD data. (Similarly filter etc.)
• When an RDD is joined, it may cause network traffic.

• A join requires two elements and both may not be on the same node. Hence
one of the elements may have to be moved across the network.
Movement of RDD elements across the network is called shuffling.

Illustrating the shuffle using a Join Example
by employeeID
resultrdd = rdd1.join(rdd2)
Can be modelled as rdd1 and rdd2.
Join Implementation by Hash Partitioning
Node 1
Node 1
Node 1 (odd keys)
Node 2
Spark hashes every
Node 2 element by the Next Spark performs
join key (lots of Node 2 (even keys) the join (no shuffle
RDD elements randomly required).
shuffling).
distributed on two nodes.
RDD partitioning and shuffle performance
• Spark operations may cause shuffling.
• Smart partitioning of RDD’s can help by reducing amount of shuffling

required.
• Hence partitioning can improve performance.
Note: RDD elements are always stored in partitions. However, the elements in the partition may be randomly assigned.
When we talk about partitions here, we imply partitions having elements which follow some pattern.
What is a partition in Spark?
Resilient Distributed Dataset are collection of various data items that are so huge in size, that they
cannot fit into a single node and have to be partitioned across various nodes. Spark automatically
partitions RDDs and distributes the partitions across different nodes.
Partitions are basic units of parallelism in Apache Spark.
Creating a Partition in Spark

Here’s a simple example that creates a list of 10 integers with 3 partitions –
Val RDD = sc.parallelize (range (10), 3)
Characteristics of Partitions in Apache Spark

Every machine in a spark cluster contains one or more partitions.
The number of partitions in spark are configurable and having too few or too many partitions is
not good.
Partitions in Spark do not span multiple machines.
Communication is very expensive in distributed programming, thus laying out data to minimize
network traffic greatly helps improve performance.
a spark program can control RDD partitioning to reduce communications.
Having too large a number of partitions or too few - is not an ideal solution.
The number of partitions in spark should be decided thoughtfully based on the cluster configuration and requirements of
the application.
Increasing the number of partitions will make each partition have less data or no data at all.
Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the
cluster.
val rdd= sc.textFile (“file.txt”, 5)
The above line of code will create an RDD named textFile with 5 partitions. Suppose that you have a cluster with four cores
and assume that each partition needs to process for 5 minutes.
In case of the above RDD with 5 partitions, 4 partition processes will run in parallel as there are four cores and the
5th partition process will process after 5 minutes when one of the 4 cores, is free.
The entire processing will be completed in 10 minutes and during the 5th partition process, the resources (remaining 3
cores) will remain idle.
The best way to decide on the number of partitions in an RDD is to make the number of partitions equal to the number of
cores in the cluster so that all the partitions will process in parallel and the resources will be utilized in an optimal way.
The number of partitions in a Spark RDD can always be found by using the partitions method of RDD. For the RDD that we
created the partitions method will show an output of 5 partitions
Scala> rdd.partitions.size
Output = 5
If an RDD has too many partitions, then task scheduling may take more time than the actual execution time.
To the contrary, having too less partitions is also not beneficial as some of the worker nodes could just be sitting idle
resulting in less concurrency.
This could lead to improper resource utilization and data skewing i.e. data might be skewed on a single partition and a
worker node might be doing more than other worker nodes.
Thus, there is always a trade off when it comes to deciding on the number of partitions.
Types of Partitioning in Apache Spark
Partitions
Data within an RDD is split into Several partitions
Properties of Partition:
1. Partitions never span multiple machines
2. Each machine in the cluster contains one or more partitions
3. The number of partitions to use is configurable. By default, it equals to number of cores
on all executer nodes
Two kinds of Partitioning is available in spark

1. Hash Partitioning
2. Range Partitioning
Hash Partitioning
Intutuion: Hash code Partitioning attempts to spreads data evenly across partitions based on key
P =K.Hashcode()%numPartitions
Range Partitioning
Some Spark RDDs have keys that follow a particular ordering, for such RDDs, range partitioning is
an efficient partitioning technique.

Hash Partitioning : Example
Consider a Pair RDD, with keys [8, 96, 240, 400, 401, 800], and the desired
number of partitions of 4.
Furthermore, suppose that hashCode() is the identity (n.hash€ode() ==n).
In this case, hash partitioning distributes the keys as follows among the
partitions:
– partition 0: [8, 96, 240, 400, 800]
– partition 1: [401]
– partition 2: []
– partition 3: []
The result is a very unbalanced distribution which hurts performance.
Range Partitioning: Example
Using range partitioning the distribution can be improved significantly:
Assumptions: (a) keys non-negative, (b) 800 is biggest key in the RDD.
Set of ranges: [1, 200], [201, 400], [401, 600], [601, 800]
In this case, range partitioning distributes the keys as follows among the partitions:
-partition 0: [8, 96]
-partition 1: [240, 400]
-partition 2: [401]
-partition 3: [800]
The resulting partitioning is much more balanced.
Partitioning Data
There are two ways to create RDDs with Specific Partitioning

1. Call partitionBy on an RDD, providing explicit partitioner
2. Using Transformations that return RDDs with specific partitioners
Partitioning Data: partitionBy
Invoking partitionBy creates an RDD with a specified partitioner.
Example:
val pairs = purchasesRdd.map(p => (p.customerId, p.price))
val tunedPartitioner = new RangePartitioner(8, pairs)
val partitioned = pairs.partitionBy(tunedPartitioner).persist()
Creating a RangePartitioner requires:
1. Specifying the desired number of partitions.
2. Providing a Pair RDD with ordered keys. This RDD is sampled to create a
suitable set of sorted ranges.
Important: the result of partitionBy should be persisted. Otherwise, the
partitioning is repeatedly applied (involved shuffling!) each time the
partitioned RDD is used.
Effect of operations on partitioning
Operations that creates partitions:
- cogroup(), groupWith(), join(), leftOuterJoin(), rightOuterJoin()
- groupByKey(), reduceByKey(), combineByKey() These work on “keys”.
- partitionBy()
- sort() (partitioned according to sort order)
Operations that maintains partitions (same partition if parent RDD has a partitioner) :
- mapValues()
- flatMapValues() “keys” remain unchanged.
- filter()
Operations that destroys partitionings: “keys” get changed. Hence

- map() may destroy the existing
pattern.
Operations that benefit from partitioning:
- cogroup(), groupWith(), leftOuterJoin(), join(), lookup() These work on “keys”.
- reduceByKey(), groupByKey(), combineByKey() etc.
Working at the Partition Level
• RDD’s are partitioned by Spark.
• By default, the number of partitions is same as number of worker node 2
worker node 1
cores. This behaviour may vary depending upon other
settings. 3, 4, 5 1, 2
• Spark decides which worker node to store the partition in.
6, 7 8, 9, 10
Working at the Partition Level - Continued
mapPartitionsWithIndex gives partition Index and
an iterator to the partition.
function to map partition collection into new

collection.
Must return an iterator to new collection.
Advantages of working at the Partition Example:

• Lets say a database connection has to be opened during a map
operation..
• This implies that the connection will be opened for every
element in the RDD.
• If mapPartitions used, then the database connection can be
opened once for every partition.
Partitioning Pair RDD’s using Hashing
Create two pair RDD of (key, value) pairs.
Make 2 partitions.
Use HashPartitioner –
which hashes on “key” of
two pair RDD.
Partitioning and Performance
• RDD1 (employees): Big dataset. Changes infrequently.
• RDD2 (empByDept): Small dataset. Changes frequently.
• RDD1 join RDD2 required everytime RDD2 changes.
• Hash partition RDD1. Then persist it.
• Everytime RDD2 changes and a join is done:
• RDD1 is already hash partitioned and persisted. No shuffle required.
• RDD2 gets hash partitioned. Shuffle required but RDD2 is small.
Performance gain since RDD1 need not be shuffled repeatedly.

Result of Hash Partitioning on our Join Example
Node 1
Node 1
Node 1 (odd keys)
Node 2
Node 2 Spark needs to hash Next Spark performs
employees RDD hashed and persisted. only empByDept Node 2 (even keys) the join (no shuffle
empByDept randomly distributed (shuffling reduced). required).
when it is created.
Partitioning Tuple RDDs with more than 2
elements
Even if the RDD element is a tuple and not a (key, value) pair, e.g. (“Alok Nath”, 36554, “India”),
it can be hash partitioned by any element of the tuple, i.e. by Name, ID, or Country.
This is left as an exercise.
Examples:
Pipelining and Shuffle Videos (Optional)
https://www.youtube.com/watc YouTube Link – requires

h?v=dmL0N3qfSc8 internet connection.
A Deeper
Link to video on PC– will
Understanding of Spark
work only if video file is on
Internals - Aaron
PC.
Davidson
(Databricks).mp4
From: A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)

Broadcasting, Accumulators
etc.
Broadcast variables
• A broadcast variable, is a type of shared variable, used for broadcasting data across the cluster. When a
variable needs to be shared in multiple operations, and the variable is large (such as a lookup table),
broadcasting improves efficiency significantly. Functions can also be broadcast.
Here, both names and addresses will be shuffled over the network for performing the join which is not
efficient since any data transfer over the network will reduce the execution speed.
Another approach is, if one of the RDDs is small in size, we can choose to send
it along with each task. Consider the below example
This is also inefficient since we are sending sizable amount of data over the
network for each task. So how do we overcome this problem? By means
of broadcast variables.
If a variable is broadcasted, it will be sent to each node only once, thereby reducing network traffic.
Broadcast variables are read-only, broadcast.value is an immutable object
Spark uses BitTorrent like protocol for sending the broadcast variable across the cluster, i.e.,
for each variable that has to be broadcasted, initially the driver will act as the only source.
The data will be split into blocks at the driver and each receiver will start fetching the block
to it’s local directory. Once a block is completely received, then that receiver will also act as
a source for this block for the rest of the receiver (This reduces the load at the machine
running driver). This is continued for rest of the blocks. So initially, only the driver is the
source and later on the number of sources increases - because of this, rate at which the
blocks are fetched by a node increases over time.
Broadcast example-2
factor is broadcast twice to all the worker nodes.
fact is broadcast to all the worker nodes.
since fact has been broadcasted, it is not

broadcasted again.
Accumulators
Accumulators, as the name suggests accumulates data during execution. An accumulator is initialized at the
driver and is then modified (added) by each executors. Finally all these values are aggregated back at the
driver.
• aggregates are write only.
• fault tolerance is not
guaranteed.
• use only for debugging.
Spark Context vs Spark Session
• Prior to Spark 2.0.0 sparkContext was used as a channel to access all spark functionality.
The spark driver program uses spark context to connect to the cluster through a resource manager (YARN or
Mesos..).
sparkConf is required to create the spark context object, which stores configuration parameter like appName
(to identify your spark driver), application, number of core and memory size of executor running on worker
node.
• In order to use APIs of SQL, HIVE, and Streaming, separate contexts need to be created.
• SPARK 2.0.0 onwards, SparkSession provides a single point of entry to interact with underlying Spark
functionality and
allows programming Spark with DataFrame and Dataset APIs. All the functionality available with
sparkContext are also available in sparkSession.
• In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession
includes all the APIs.

BDT UNIT 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDT UNIT 3

Uploaded by

Copyright:

Available Formats

Big Data with Spark

A spark context (sc) is

Once you have a SparkContext, you can use it to build RDDs

worker node 1 worker node 2 worker node 3

worker node 1 worker node 3

• Parallel operations in Spark (e.g. parallel implementation of sum) fall

• An object called Spark Context contains information about the spark

• An RDD in Spark is a distributed collection of data.

1. Parallelizing already existing collection in driver program.

2. Referencing a dataset in an external storage system (e.g. HDFS,

Hbase, shared file system).

3. Creating RDD from already existing RDDs.

External Datasets (Referencing a dataset)

Creating RDD from existing RDD

Our entire life is RDD partition 1

The rdd is created in parallel at each worker node.

Transformations transform one RDD into another RDD.

map provides – one to one

(iv)distinct : removes duplicates

take(num): returns no of elements from RDD

Top(num): returns top number elements from RDD

countByValue: counts number of times each elements occurs in RDD

foreach doesn’t return

• Any function can be used as long as it satisfies the following:

+ + Intermediate Type: Int

579+ 1579+ 801 +

890 Initial Type: Int

reduce( (x, y) =>

def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U

val inputRDD = sc.textFile("/root/Desktop/log.txt")

Val RDDS = sc.textFile ("/root/Desktop/text.txt")

filter, union are Spark transformations.

Our entire life is

The attitude we display The steps of life The attitude we display

The action we take

thelifeRDDS first4 (is an

• Since reduce expects two

• Rdd.persist(StorageLevel.MEMORY AND DISK)

• In fact Caching is a type of persistence with StorageLevel -MEMORY_ONLY. If you use

Note: Due to the recomputation sparkscalaRDD

Lineage graph tracks dependence of operations on data sparkscalaRDD

Ensures RDD’s can be computed and discarded – sparkscalaRDD

val inputRDD = sc.textFile("log.txt")

inputRDD persisted. Hence computed once and kept

Val rdd4 = rdd1.map(x =>fn2(fn1(x))

Keys: Returns RDD of the keys

values: Returns RDD of just values

sortByKey: Returns RDD sorted by key

groupByKey: groups the values with same key

flatMapValues:Returns an Iterator to each value of pair RDD

reduceByKey :Combines values with same Key

reduceByKey is a transformation.It produces an RDD.

createCombiner: Creates an accumulator for a the first occurrence of a key in a partition.

collectAsMap : collect the result as map for easy lookup

Lookup : Return all values associated with key

• When an RDD is joined, it may cause network traffic.

Movement of RDD elements across the network is called shuffling.

• Smart partitioning of RDD’s can help by reducing amount of shuffling

• Hence partitioning can improve performance.

Creating a Partition in Spark

Characteristics of Partitions in Apache Spark

Two kinds of Partitioning is available in spark

an efficient partitioning technique.

There are two ways to create RDDs with Specific Partitioning