Professional Documents
Culture Documents
Lecture 3 - Introduction To Apache Spark - 1691899519972
Lecture 3 - Introduction To Apache Spark - 1691899519972
Lecture 3 - Introduction To Apache Spark - 1691899519972
Siddhartha
DR. MARWAN Neupane
FUAD
2020- 2021
2023
Outline 2
o Part2 - On MapReduce
o Part3 - Introduction to Spark
• MapReduce Shortcomings
• Spark’s Key Features - Ideal Applications - Architecture
Lecture3 - Part 1
Memory and Big Data
Introduction 4
o In the previous lecture, we saw how in Big Data, we avoid moving data
around as much as possible, and how, when possible, data is read from the
replica located on the closest node.
o In fact the principle in Big Data is to “bring computation to data not the
other way around”
o In this first part of the lecture we will dive deeper into these concepts to get
a better insight into how Big Data systems are built and why they are built
this way.
Memory Latency (1) 5
o Adding two quantities, x and y requires, in the first step that x is read from
the storage into a register on the CPU. The time it takes to complete a step
is called latency.
o In the second step, y is read from the storage into a register. In the third step
the addition operation is performed and the result is stored on a register. In
the last step the result is written on the storage device.
o The time needed to perform all these steps is called the total latency.
Memory Latency (2) 6
o It may come as a surprise to you to know that most of the time needed to
perform this addition operation is spent on the read and write operations
rather than the addition itself.
o
o In Big Data, this problem is aggravated because of the large volume of the
data involved.
o In fact, in Big Data, we might even have to move data on a network, which
substantially aggravates the problem. (As a matter of fact, networking
bottleneck is a serious problem in Big Data performance). However, this is
not related to this discussion on memory latency.
Memory Latency (3) 7
(from www.xtremegaminerd.com)
Access Locality 8
o Having good access locality means accessing the same page or neighboring
pages repeatedly. This speeds up the processing.
Lecture3 - Part 2
On MapReduce
On MapReduce (1) 10
o Given a list myList = [0,1,2,3,4,5,6,7]. Say the task is calculate the sum
of the squared of the elements of myList using MapReduce (parallel
computation).
o Each reduce operation will take as an input two of the squared elements,
and outputs their sum.
0 1 2 3 4 5 6 7
0 1 4 9 16 25 36 49
1 13 41 85
14 126
140
On MapReduce (4) 13
o Notice that this parallel system needed 3 (runs of) reduce operations to
perform this computation whereas a traditional system (sequential
computation, not parallel) would not 7 reduce operations (why?). Again, we
can clearly see the benefit of parallel computation.
o Notice how we started with a list of 8 elements (or 8 dimensions) and the
final output is a list of 1 dimension
On MapReduce (5) 14
(There are other ways of performing the same computation, which we are not
going to use here)
o Order independence means that the output of map and reduce should not
depend on the order of the input.
o Any thoughts about how the computation in the previous example was
executed???
1
7
Lecture3 - Part 3
Introduction to Spark
Introduction 18
o Spark is a new distributed computing
technology whose popularity
is rapidly increasing
o It was developed by
Matei Zaharia* at the UC Berkeley
RAD Lab in 2014.
o Spark is a general distributed data processing engine built for speed, ease of use, and flexibility.
o Spark attempts to replace MapReduce as the preferred Big Data processing platform,
although it does not attempt to replace HDFS.
o MapReduce job results need to be stored in HDFS before they can be used
by another job.
o Iterative Algorithms: These are data processing algorithms that iterate over
the same data multiple times. This makes Spark suitable for machine
learning. The reason why Spark is suitable for iterative algorithms is its in-
memory capabilities
1. The head node: which consists of the Spark Driver and the Cluster Manager.
The program that the programmer writes runs on the Spark driver. This executes
the main function of the code. Spark programs access the code through a
SparkContext object. There is only one SparkContext object on the cluster. It
is easily initialized as sc= SparkContext().
2. The workers: This is where the code is actually executed. Usually we have one
worker for each core
Spark Architecture (2) 23
1. Spark Core.
2. Spark SQL
3. Spark MLlib
4. Spark Streaming
5. Spark GraphX
6. SparkR
• *Some references call it “Spark Components”, I prefer the term “Stack” because the term “Components” gives the impression that you should have
them all to run Spark, which is not the case at all. In fact when we mention “Spark”, we’re usually referring to “Spark Core”
Lecture3 - Part 4
Spark Core
Spark Core 26
o all work Spark does is expressed as either creating new RDDs, transforming existing
RDDs, or calling operations on RDDs to compute a result.
RDD (2) 28
o Spark automatically distributes the data contained in RDDs across your
cluster and parallelizes the operations you perform on them
o Once an RDD is distributed on the cluster you don’t have access to it. You
manipulate it through RDD operations that are designed to run in parallel.
1. From a file
Example
lines = sc.textFile("sample.txt")
map ()
Applies a function to each element in the RDD and return an RDD of the result.
filter ()
Returns an RDD consisting of only elements that pass the condition passed to
filter().
RDD Operations - Transformations (2) 32
distinct ()
Removes duplicates.
union()
Produces an RDD containing elements from both RDDs.
intersection()
Produces an RDD containing only elements found in both RDDs.
cartesian()
Cartesian product with the other RDD
Transformations – Examples 33
o Let:
rdd1=sc.parallelize ([1,2,3,3])
rdd2=sc.parallelize ([1,2,3])
rdd3=sc.parallelize ([3,4,5])
reduce (func)
Takes a function func (e.g. +) that operates on two elements of the type in your
RDD and returns a new element of the same type.
collect ()
Returns the RDDS’s content to the driver program so that we can “see” it. collect ()
is an important operation. But It is also a very slow operation. In addition, it requires
that all data can be fit in one machine (the master node)
count ()
Returns the number of elements in the RDD.
countByValue ()
Returns the number of times each element occurs in the RDD.
foreach (func)
Applies a function func to each element in the RDD.
Actions – Examples 36
o Let:
rdd1=sc.parallelize ([1,2,3,3])
rdd1.count() = 4
rdd1.collect() = [1,2,3,3]
rdd1.countByValue() = [(1,1),(2,1),(3,2)]
3
7
Lecture3 - Part 5
Advanced Concepts
Lazy Evaluation (1) 38
o Lazy evaluation means that Spark will not begin to execute until it sees an
action.
o This means that if we call an operation on a RDD, say map (), the operation
is not immediately performed. Instead, Spark will internally records that this
operation has been requested.
o The purpose of lazy evaluation is to reduce the number of passes Spark has
to make over the data, by simply grouping the operations together, and
then performing them, all together, when necessary.
Materialization and Persistence (1) 40
o Remember the example at the beginning of this lecture, summing the square of
the elements of
o Notice, which is very important, we do not need to store the squares and
can process them immediately. In other words, RDD2 does not need to be
materialized (materialized = stored in memory).
(from https://i.stack.imgur.com)
Materialization and Persistence (3) 42
o This can be expensive in the case were we need to use the RDD multiple
times (in iterative algorithms, for instance).
o In this case, we can force Spark to persist an RDD, so the nodes that use it
will cash it in their partitions. This in-memory cashing feature is what allows
the massive speedups in Spark.
Lecture3 - Part 6
Key/Value RDDs
Key/Value RDDs (Pair RDDs)(1) 44
o The operations we presented work with RDDs of any data type. There are
however there are RDDs with a special data type which is key/value. These
RDDs have their own operations.
o Key/value RDDs are a useful building block in many programs, as they allow
you to act on each key in parallel or regroup data across the network.
Creating a Key/Value RDD 45
o RDD=sc.parallelize
((‘UK’,1),(‘France’,3),(‘Germany’,2),(‘UK’,2))
reduceByKey(func)
Combines values with the same key.
groupByKey()
Groups values with the same key.
mapValues(func)
Applies a function func to each value of a pair RDD without changing the key.
Key/Value RDD – Transformations (2) 47
keys ()
Returns an RDD of just the keys.
values()
Returns an RDD of just the values.
sortByKey ()
Returns an RDD sorted by key.
Key/Value RDD – Transformations – Examples48
o Let:
rdd= [(1,2),(3,4),(3,6)]
lookup(key)
Returns all values associated with the provided key.
collectAsMap()
Collects the result as a Map (=Dictionary)to provide easy lookup.
Key/Value RDD – Actions – Examples 50
o Let:
rdd= [(1,2),(3,4),(3,6)]
rdd.countByKey() = {1:1,3:2}
rdd.collectAsMap() = {1:[2],3:[4,6]}
rdd.lookup(3)=[4,6]
A Look On The Different Operations 51
• Most transformations (like map ()) do not require any communication between
nodes.
• Actions (like reduce (), count()) require more communication because they
generate results in the head node.
• Some shuffle transformations (like sort()), which sort an RDD and return the same
RDD sorted, require a lot of communication between workers
Lecture3 - Part 7
Spark Unified Stack
Introduction (1) 53
o Spark provides a unified data processing engine known as the Spark Unified Stack.
o This stack is built on top of the Spark Core, which we studied in the previous part.
o Spark Core provides the functionalities needed to run this stack.
o Each component of this stack is designed for a specific data processing workload.
Introduction (2) 54
2. It is more efficient to combine different types of data processing tasks when different
components are applied to the same data
o Spark SQL was built to process structured data at the petabyte scale.
o Spark SQL provides the ability to read data from and write data to various
structured formats such as JSON and CSV.
o Spark SQL smoothly integrates with other Spark libraries such as Spark
Streaming, Spark ML, and GraphX.
Spark SQL (2) 56
o Spark SQL supports a number of relational databases, but also NoSQL ones.
o Spark SQL makes data processing applications run faster using several
techniques such as reduced disk I/O, in-memory columnar caching, query
optimization, and code generation.
Spark Streaming (1) 57
o In batch processing, used in (early versions of) Hadoop and Spark, data is collected for a
period of time and processed in batches.
o Batch processing systems have high latency. It may take a few hours to process a batch.
Thus, there is a long wait before you can see the results of a batch processing application
o Processing streaming data is an important topic in Big Data, and data science in general.
o The MLlib library is a set of tools to build and evaluate ML models for very
large datasets.
o MLlib also provides classes and functions for common statistical analysis
Spark GraphX (1) 61
o SparkR is in fact an R package that allows to use Spark from R. It’s a tool to
run Spark on R.
Spark Unified Stack and Various Interactions 64
o Spark Unified Stack and various runtime interactions and storage options.
Summary 65
o In Part 1 of this lecture we presented some memory aspects related to big data such as memory latency
and access locality.
o In Part 3 we introduced Spark, its key features, and its ideal applications. We also presented a very brief
description of its architecture.
o In Part 4 we presented Spark Core, where we talked about RDDs, the main abstraction Spark was built on.
We defined some transformations and actions on RDDs.
o In Part 5 we presented advanced concepts such as lazy evaluation, materialization and persistence.
o In Part 6 we talked about key/value RDDs and the transformations and actions applied to them.
o Finally, in Part 7 we talked about the Spark Unified Stack and gave a very brief description of its
components: Spark SQL, Spark Streaming, MLlib, Spark GraphX, and SparkR
66
References 67
• Advanced Analytics with Spark, S. Ryza, U. Laserson, S.Owen, and J. Wills (2010)
• Data Analytics with Hadoop, Benjamin Bengfort and Jenny Kim (2016)
• Spark: The Definitive Guide, Bill Chambers and Matei Zaharia (2018)