Lecture 3 - Introduction To Apache Spark - 1691899519972

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

7082 CEM

Lecture 3 - Introduction to Apache Spark

Siddhartha
DR. MARWAN Neupane
FUAD

2020- 2021
2023
Outline 2

o Part 1 – Memory and Big Data


• Memory Latency - Access Locality

o Part2 - On MapReduce
o Part3 - Introduction to Spark
• MapReduce Shortcomings
• Spark’s Key Features - Ideal Applications - Architecture

o Part4 - Spark Core


• RDDs - RDD Operations
• Creating RDDs - Transformations – Actions

o Part5 – Advanced Concepts


• Lazy Evaluation - Materialization and Persistence

o Part6 - Key/Value RDDs


• Key/Value RDDs
• Creating a Key/Value RDD - Transformations - Actions

o Part7 - Spark Unified Stack


• Spark SQL , Spark Streaming, MLlib, Spark GraphX, SparkR
3

Lecture3 - Part 1
Memory and Big Data
Introduction 4

o In the previous lecture, we saw how in Big Data, we avoid moving data
around as much as possible, and how, when possible, data is read from the
replica located on the closest node.

o In fact the principle in Big Data is to “bring computation to data not the
other way around”

o In this first part of the lecture we will dive deeper into these concepts to get
a better insight into how Big Data systems are built and why they are built
this way.
Memory Latency (1) 5

o Performing a computation on a computer requires an interaction between


the CPU and the storage

o Adding two quantities, x and y requires, in the first step that x is read from
the storage into a register on the CPU. The time it takes to complete a step
is called latency.

o In the second step, y is read from the storage into a register. In the third step
the addition operation is performed and the result is stored on a register. In
the last step the result is written on the storage device.

o The time needed to perform all these steps is called the total latency.
Memory Latency (2) 6

o It may come as a surprise to you to know that most of the time needed to
perform this addition operation is spent on the read and write operations
rather than the addition itself.
o

o In Big Data, this problem is aggravated because of the large volume of the
data involved.

o In fact, in Big Data, we might even have to move data on a network, which
substantially aggravates the problem. (As a matter of fact, networking
bottleneck is a serious problem in Big Data performance). However, this is
not related to this discussion on memory latency.
Memory Latency (3) 7

o The problem of memory latency is related to the type of memory used.

(from www.xtremegaminerd.com)
Access Locality 8

o Computer memory is divided into pages.

o Having good access locality means accessing the same page or neighboring
pages repeatedly. This speeds up the processing.

o Hardware is designed to speed up access locality of software

o We have two kinds of locality:


• Temporal Locality: Accessing the same memory location repeatedly
• Spatial Locality: Accessing neighbouring memory locations

o The programmer can store/process data to take advantage of access locality


(sorting, for example, increases spatial locality)
9

Lecture3 - Part 2
On MapReduce
On MapReduce (1) 10

o Given a list myList = [0,1,2,3,4,5,6,7]. Say the task is calculate the sum
of the squared of the elements of myList using MapReduce (parallel
computation).

o Each map operation will calculate the square of an element of myList

o Each reduce operation will take as an input two of the squared elements,
and outputs their sum.

o This operation will be repeated until we get the final sum.


On MapReduce(2) 11

o On a parallel system the map operation will be executed as follows:

0 1 2 3 4 5 6 7

0 1 4 9 16 25 36 49

o Notice how this allows us to execute these computations in parallel.


On MapReduce (3) 12

o On a parallel system of 4 cores, or more (why?), the reduce operation is


executed as follows:
0 1 4 9 16 25 36 49

1 13 41 85

14 126
140
On MapReduce (4) 13

o Notice that this parallel system needed 3 (runs of) reduce operations to
perform this computation whereas a traditional system (sequential
computation, not parallel) would not 7 reduce operations (why?). Again, we
can clearly see the benefit of parallel computation.

o Notice how we started with a list of 8 elements (or 8 dimensions) and the
final output is a list of 1 dimension
On MapReduce (5) 14

o Using the lambda operator in Python the previous computations could be


expressed as:

reduce (lambda x,y:x+y, map (lambda i:i*i, myList))

(There are other ways of performing the same computation, which we are not
going to use here)

o Notice that technically, what we are doing is to give the system an


execution plan of how to perform the required computation, but the system
will decide how it will run it on several machine and the order in which it is
going to do that.
On MapReduce(6) 15

o Order independence means that the output of map and reduce should not
depend on the order of the input.

o Order independence is important the order computation should be chosen


by the compiler, which is optimized for efficient use of the hardware.

o Order independence is important because without it we couldn’t perform


parallel computation. It also important that parallel computation is very hard
to be managed/controlled by the programmer, so the programmer should
be able to tell the parallel system what he/she, the programmer, wants to
do and let the system decide on how to do it.
On MapReduce(7) 16

o Any thoughts about how the computation in the previous example was
executed???
1
7

Lecture3 - Part 3
Introduction to Spark
Introduction 18
o Spark is a new distributed computing
technology whose popularity
is rapidly increasing

o It was developed by
Matei Zaharia* at the UC Berkeley
RAD Lab in 2014.

o Spark is a general distributed data processing engine built for speed, ease of use, and flexibility.

o Spark supports Python, Java, Scala, SQL, and, R.

o Spark attempts to replace MapReduce as the preferred Big Data processing platform,
although it does not attempt to replace HDFS.

* Two of the references of this lecture were written by Matei Zaharia


MapReduce Shortcomings 19

o Although Hadoop is widely-used as a Big Data platform, MapReduce, one


of its main components, has a few drawbacks.

o MapReduce job results need to be stored in HDFS before they can be used
by another job.

o Decomposing every problem in terms of map and reduce is not always


easy

o Hadoop is rather a low-level framework, so it is not ideal to combine it with


high-level languages, tools, or real-time processing frameworks.
Spark’s Key Features 20

o Easy to use: It has a rich application programming interface (API) for


developing Big Data applications

o Fast: As mentioned previously, because it allows in-memory computations,


and also because it implements an advanced execution engine.

o General-purpose: It provides a unified integrated platform for different


types of data processing jobs.

o Scalable: The data processing capacity of a Spark cluster can be


increased by just adding more nodes to the cluster.

o Fault tolerant: Spark automatically handles the failure of a node in a cluster


Spark’s Ideal Applications 21

o Iterative Algorithms: These are data processing algorithms that iterate over
the same data multiple times. This makes Spark suitable for machine
learning. The reason why Spark is suitable for iterative algorithms is its in-
memory capabilities

o Interactive Analysis: This involves exploring a dataset interactively. Spark is


ideal for this application because of, again, its in-memory computing
capabilities. The first query reads data from disk, but subsequent queries
read the cached data from memory, so they execute orders of magnitude
faster than on data on disk
Spark Architecture (1) 22

o On a high-level, Spark’s architecture has two parts:

1. The head node: which consists of the Spark Driver and the Cluster Manager.
The program that the programmer writes runs on the Spark driver. This executes
the main function of the code. Spark programs access the code through a
SparkContext object. There is only one SparkContext object on the cluster. It
is easily initialized as sc= SparkContext().

2. The workers: This is where the code is actually executed. Usually we have one
worker for each core
Spark Architecture (2) 23

o A high-level representation of Spark


Spark Unified Stack* 24

o Its main elements are**:

1. Spark Core.
2. Spark SQL
3. Spark MLlib
4. Spark Streaming
5. Spark GraphX
6. SparkR

• *Some references call it “Spark Components”, I prefer the term “Stack” because the term “Components” gives the impression that you should have
them all to run Spark, which is not the case at all. In fact when we mention “Spark”, we’re usually referring to “Spark Core”

• ** Some references add “Spark Programming Tools”


2
5

Lecture3 - Part 4
Spark Core
Spark Core 26

o This is the main component of Spark Unified Stack.

o It contains basic Spark functionalities required for running jobs.

o The most important of these functionalities is Resilient Distributed Dataset


(RDD).
RDD (1) 27

o This is the principal component of Spark.

o is one of the main differentiators between Spark and other cluster


computing frameworks.

o It is called resilient because it’s capable of rebuilding datasets in case of


node failures.

o It’s an abstraction of a distributed collection of items with operations and


transformations applicable to the dataset

o all work Spark does is expressed as either creating new RDDs, transforming existing
RDDs, or calling operations on RDDs to compute a result.
RDD (2) 28
o Spark automatically distributes the data contained in RDDs across your
cluster and parallelizes the operations you perform on them

o An RDD is immutable. This means it cannot be changed after it has been


created.

o Each RDD is split into multiple partitions, which may be computed on


different nodes of the cluster

o Once an RDD is distributed on the cluster you don’t have access to it. You
manipulate it through RDD operations that are designed to run in parallel.

o An RDD can be translated back to the main node through a collect


operation
Creating RDDs 29

o There are two ways to create an RDD:

1. From a file
Example
lines = sc.textFile("sample.txt")

2. From a list on the master node.


Example (Revisited example):
RDD=sc.parallelize ([0,1,2,3,4,5,6,7])

Now we can perform map, reduce


RDD.map(lambda x:x*x).reduce (lambda x,y:x+y)
RDD Operations 30

o The operations on an RDD can be grouped in two categories:

1. Transformations: These are operations on an RDD that return a new RDD.

2. Actions: These are operations that return a result.

o While working with RDDs it is important to understand if the operation is a


transformation or action because these operations are treated very
differently by Spark
RDD Operations - Transformations (1) 31

o Before we introduce transformation operation, we have to mention that


transformed RDDs are computed lazily, only when you use them in an
action. This is an important concept in Spark that we will elaborate on later.

o The most common transformations are:

map ()
Applies a function to each element in the RDD and return an RDD of the result.

filter ()
Returns an RDD consisting of only elements that pass the condition passed to
filter().
RDD Operations - Transformations (2) 32

distinct ()
Removes duplicates.

union()
Produces an RDD containing elements from both RDDs.

intersection()
Produces an RDD containing only elements found in both RDDs.

cartesian()
Cartesian product with the other RDD
Transformations – Examples 33

o Let:

rdd1=sc.parallelize ([1,2,3,3])
rdd2=sc.parallelize ([1,2,3])
rdd3=sc.parallelize ([3,4,5])

rdd1.map(x => x + 1) = [2,3,4,4]


rdd1.filter(x => x != 1) = [2,3,3]
rdd1.distinct() = [1,2,3]
rdd2.union(rdd3)=[1,2,3,3,4,5]
rdd2.intersection(rdd3)=[3]
rdd2.cartesian(rdd3)=[(1,3),(1,4),(1,5),(2,3),…,(3,5)]
RDD Operations - Actions (1) 34

o The most common actions are:

reduce (func)
Takes a function func (e.g. +) that operates on two elements of the type in your
RDD and returns a new element of the same type.

collect ()
Returns the RDDS’s content to the driver program so that we can “see” it. collect ()
is an important operation. But It is also a very slow operation. In addition, it requires
that all data can be fit in one machine (the master node)

Remember: Once you collect data, parallelism is lost.


RDD Operations - Actions (2) 35

count ()
Returns the number of elements in the RDD.

countByValue ()
Returns the number of times each element occurs in the RDD.

foreach (func)
Applies a function func to each element in the RDD.
Actions – Examples 36

o Let:

rdd1=sc.parallelize ([1,2,3,3])

rdd1.count() = 4
rdd1.collect() = [1,2,3,3]
rdd1.countByValue() = [(1,1),(2,1),(3,2)]
3
7

Lecture3 - Part 5
Advanced Concepts
Lazy Evaluation (1) 38

o We mentioned that transformations are lazily computed.

o Lazy evaluation means that Spark will not begin to execute until it sees an
action.

o This means that if we call an operation on a RDD, say map (), the operation
is not immediately performed. Instead, Spark will internally records that this
operation has been requested.

o In other words, a transformation is not performed until it is necessary.


Lazy Evaluation (2) 39

o The purpose of lazy evaluation is to reduce the number of passes Spark has
to make over the data, by simply grouping the operations together, and
then performing them, all together, when necessary.
Materialization and Persistence (1) 40

o Materialization is also an important concept in Spark.

o Remember the example at the beginning of this lecture, summing the square of
the elements of

RDD1 = sc.parallelize ([0,1,2,3,4,5,6,7])

Using MapReduce we perfumed this task by


RDD1 -> map (lambda i:i*i, myList) -> RDD2
reduce (lambda x,y:x+y) -> result (on the master node)

We say that RDD1 is a lineage to RDD2


Materialization and Persistence (2) 41

o Notice, which is very important, we do not need to store the squares and
can process them immediately. In other words, RDD2 does not need to be
materialized (materialized = stored in memory).

(from https://i.stack.imgur.com)
Materialization and Persistence (3) 42

o By default an RDD is not materialized. In other words, Spark will re-compute


it and all of its dependencies each time we call an action on this RDD.

o This can be expensive in the case were we need to use the RDD multiple
times (in iterative algorithms, for instance).

o In this case, we can force Spark to persist an RDD, so the nodes that use it
will cash it in their partitions. This in-memory cashing feature is what allows
the massive speedups in Spark.

o This feature can, however, consume a lot of memory.


4
3

Lecture3 - Part 6
Key/Value RDDs
Key/Value RDDs (Pair RDDs)(1) 44

o The operations we presented work with RDDs of any data type. There are
however there are RDDs with a special data type which is key/value. These
RDDs have their own operations.

o These RDDs are simply a collection of key-value pairs.

o Key/value RDDs are a useful building block in many programs, as they allow
you to act on each key in parallel or regroup data across the network.
Creating a Key/Value RDD 45

o Say we have a dataset representing the number of times the name of


country is mentioned:

o RDD=sc.parallelize
((‘UK’,1),(‘France’,3),(‘Germany’,2),(‘UK’,2))

o Notice that the key does not have to be unique.


Key/Value RDD – Transformations (1) 46

o The most common transformations on key/value RDDs are:

reduceByKey(func)
Combines values with the same key.

groupByKey()
Groups values with the same key.

mapValues(func)
Applies a function func to each value of a pair RDD without changing the key.
Key/Value RDD – Transformations (2) 47
keys ()
Returns an RDD of just the keys.

values()
Returns an RDD of just the values.

sortByKey ()
Returns an RDD sorted by key.
Key/Value RDD – Transformations – Examples48
o Let:

rdd= [(1,2),(3,4),(3,6)]

rdd.reduceByKey((x,y) => x + y) = [(1,2),(3,10)]


rdd.groupByKey() = [(1,[2]),(3,[4,6])]
rdd.mapValues (x => x + 1) = [(1,3),(3,5),(3,7)]
rdd.keys()=[1,3,3]
rdd.values()=[2,4,6]
rdd.sortByKey()=[(1,2),(3,4),(3,6)]
Key/Value RDD – Actions 49
countByKey ()
Returns a dictionary with the number of pairs for each key.

lookup(key)
Returns all values associated with the provided key.

collectAsMap()
Collects the result as a Map (=Dictionary)to provide easy lookup.
Key/Value RDD – Actions – Examples 50

o Let:

rdd= [(1,2),(3,4),(3,6)]

rdd.countByKey() = {1:1,3:2}
rdd.collectAsMap() = {1:[2],3:[4,6]}
rdd.lookup(3)=[4,6]
A Look On The Different Operations 51

o These different operations which we studied require different levels of


communications between different workers:

• Most transformations (like map ()) do not require any communication between
nodes.

• Actions (like reduce (), count()) require more communication because they
generate results in the head node.

• Some shuffle transformations (like sort()), which sort an RDD and return the same
RDD sorted, require a lot of communication between workers

o The more communication is needed the slowest the operation is.


5
2

Lecture3 - Part 7
Spark Unified Stack
Introduction (1) 53

o Spark provides a unified data processing engine known as the Spark Unified Stack.
o This stack is built on top of the Spark Core, which we studied in the previous part.
o Spark Core provides the functionalities needed to run this stack.
o Each component of this stack is designed for a specific data processing workload.
Introduction (2) 54

o Combining different components in a unified stack has several advantages


for Big Data applications:

1. Applications are simpler to develop when they use a unified stack

2. It is more efficient to combine different types of data processing tasks when different
components are applied to the same data

3. Having different components combined together opens up horizons to new


applications that cannot be handled by a single component
Spark SQL (1) 55

o Despite the increasing popularity of NoSQL databases, many business


applications still rely a lot on traditional RDBMS which use SQL. This is the
reason why several new Big Data projects were developed to address this
aspect.

o Spark SQL was built to process structured data at the petabyte scale.

o Spark SQL provides the ability to read data from and write data to various
structured formats such as JSON and CSV.

o Spark SQL smoothly integrates with other Spark libraries such as Spark
Streaming, Spark ML, and GraphX.
Spark SQL (2) 56

o Spark SQL supports a number of relational databases, but also NoSQL ones.

o Spark SQL makes data processing applications run faster using several
techniques such as reduced disk I/O, in-memory columnar caching, query
optimization, and code generation.
Spark Streaming (1) 57

o In batch processing, used in (early versions of) Hadoop and Spark, data is collected for a
period of time and processed in batches.

o Batch processing systems have high latency. It may take a few hours to process a batch.
Thus, there is a long wait before you can see the results of a batch processing application

o Processing streaming data is an important topic in Big Data, and data science in general.

o It is also vital in many business applications.

o Streaming data concerns the Velocity aspect of Big Data.

o Streaming processing is about the continuous processing of unbounded streams of data.


Doing this for large volumes of data, on a fault-tolerant system, is quite challenging.
Spark Streaming (2) 58

o Spark Streaming processes a data stream in micro-batches. It splits a data


stream into batches of very small fixed-sized time intervals. Data in each
micro-batch is stored as an RDD, which is then processed using Spark core.
MLlib (1) 59

o Machine Learning (ML) is a branch of Artificial Intelligence (AI) that


analyzes data to build a model through a learning process.

o One of the motivations behind developing Spark was to build a


computational framework that can run iterative algorithms.

o The MLlib library is a set of tools to build and evaluate ML models for very
large datasets.

o Because Spark allows an application to cache a dataset in memory,


machine learning applications built with MLlib are fast.
MLlib (2) 60

o MLlib provides several types of machine learning algorithms such as


classification, regression, clustering, and many others.

o It is important to mention that MLlib contains only ML algorithms that can


run in parallel, so some ML algorithms are not included because they were
not designed for parallel computing.

o MLlib also provides classes and functions for common statistical analysis
Spark GraphX (1) 61

o A graph is a data structure composed of vertices and edges

o Once represented as a graph, some problems become easier to solve. This


is why graph-processing algorithms are important
Spark GraphX (2) 62

o Spark GraphX is a distributed graph analytics framework.

o GiraphX is a collection of common graph processing algorithms that


extends Spark for large-scale graph processing.

o GraphX however suffers from a number of significant performance


problems.
SparkR 63

o R is popular statistical programming language.

o However, R was not designed to handle large datasets

o SparkR leverages Spark’s distributed computing engine to enable data


analysis of very large datasets.

o SparkR is in fact an R package that allows to use Spark from R. It’s a tool to
run Spark on R.
Spark Unified Stack and Various Interactions 64

o Spark Unified Stack and various runtime interactions and storage options.
Summary 65

o In Part 1 of this lecture we presented some memory aspects related to big data such as memory latency
and access locality.

o In Part 2 we revisited MapReduce to present an example which we built on

o In Part 3 we introduced Spark, its key features, and its ideal applications. We also presented a very brief
description of its architecture.

o In Part 4 we presented Spark Core, where we talked about RDDs, the main abstraction Spark was built on.
We defined some transformations and actions on RDDs.

o In Part 5 we presented advanced concepts such as lazy evaluation, materialization and persistence.

o In Part 6 we talked about key/value RDDs and the transformations and actions applied to them.

o Finally, in Part 7 we talked about the Spark Unified Stack and gave a very brief description of its
components: Spark SQL, Spark Streaming, MLlib, Spark GraphX, and SparkR
66
References 67
• Advanced Analytics with Spark, S. Ryza, U. Laserson, S.Owen, and J. Wills (2010)

• Beginning Apache Spark 2, Hien Luu (2018)

• Big Data Analytics with Spark, Mohammed Guller (2015)

• Data Algorithms, Mahmoud Parsian (2015)

• Data Analytics with Hadoop, Benjamin Bengfort and Jenny Kim (2016)

• Data Analytics with Spark Using Python, Jeffrey Aven (2018)

• High Performance Spark, Holden Karau and Rachel Warren (2017)

• Learning Spark, H. Karau, A. Konwinski, P. Wendell, and M. Zaharia (2015)

• Practical Apache Spark, S. Chellappan and D. Ganesan (2018)

• Spark in Action, Petar Zecevic, Marko Bonaci (2017)

• Spark: The Definitive Guide, Bill Chambers and Matei Zaharia (2018)

You might also like