Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

Apache Spark

Fundamentals

Abderrahmane EZ-ZAHOUT

Master ------ 2020- 2021

1
Objectives
After completing this cours, you should e able to:
 Learn the fundamentals of Spark, the technology that is
revolutionizing the analytics and big data world!
 Learn how it performs at speeds up to 100 times faster than Map
Reduce for iterative algorithms or interactive data mining.
 Learn how it provides in-memory cluster computing for lightning fast
speed and supports Java, Python, R, and Scala APIs for ease of
development.
 Learn how it can handle a wide range of data processing scenarios by
combining SQL, streaming and complex analytics together seamlessly
in the same application.
 Learn how it runs on top of Hadoop, Mesos, standalone, or in the
cloud. It can access diverse data sources such as HDFS, Cassandra,
HBase, or S3.

2
Introduction to Spark

I
 What is Spark and what is its purpose?
 Components of the Spark unified stack,
 Resilient Distributed Dataset (RDD),
 Downloading and installing Spark standalone,
 Scala and Python overview,
 Launching and using Spark’s Scala and
Python shell © 3
RDD and DataFrames

II
 Understand how to create parallelized
collections and external datasets
 Work with Resilient Distributed Dataset
(RDD) operations
 Utilize shared variables and key-value pair

4
Spark application programming

III
 Understand the purpose and usage of the
SparkContext
 Initialize Spark with the various programming
languages
 Describe and run some Spark examples
 Pass functions to Spark
 Create and run a Spark standalone application
 Submit applications to the cluster
5
Spark libraries

IV
 Understand and use the various Spark
libraries

6
Understand and use the various
Spark libraries

V
 Understand components of the Spark cluster
 Configure Spark to modify the Spark properties,
environmental variables, or logging properties
 Monitor Spark using the web UIs, metrics, and
external instrumentation
 Understand performance tuning considerations

7
Spark Uses Case

VI
 Projects

8
Apache Spark
 Apache Spark is an open-source cluster-
computing framework.
 Originally developed at the University of California,
Berkeley's AMPLab,
 the Spark codebase was later donated to the Apache
Software Foundation,
 Spark provides an interface for programming entire
clusters with implicit data parallelism and fault
tolerance.
Apache Spark™ is a unified analytics engine
for large-scale data processing.
9
Big Data and Spark
 Speed
 Run workloads 100x faster:100 times fast than Hadoop’s MapReduce for
in-memory computations and 10 times faster for on disk,
 Winner of Daytona GraySort contest, sorting a petabyte 3 times faster and
using 10 times less hardware than Hadoop’s MapReduce
 (https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-
in-largescale-sorting.html),
 Apache Spark achieves high performance for both batch and streaming data,
using a state-of-the-art DAG scheduler, a query optimizer, and a physical
execution engine,
 Well suited for iterative algorithms in machine learning,
 Fast, real-time response to user queries on large in-memory data set,
 Low latency data analysis applied to processing live data streams,

10
Big Data and Spark
 Ease of Use

 Write applications quickly in Java, Scala, Python, R, and SQL.

 Spark offers over 80 high-level operators that make it easy to build parallel
apps. And you can use it interactively from the Scala, Python, R, and SQL
shells.

11
Big Data and Spark
 Generality
 Combine SQL, streaming, and complex analytics.

 Existing libraries and API makes it easy to write programs combining batch,
streaming, iterative machine learning and complex queries in a single
application,
 Interactive shell is available for Python and Scala,
 Built for performance and reliability, written in Scala and runs on top of JVM,
 Operational and debugging tools from the Java stack are available for
programmers,
 Spark powers a stack of libraries including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.

12
Big Data and Spark
 Runs Everywhere
 Spark runs on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud. It can access diverse data
sources.
 Spark provides API for data munging, ETL, machine
learning, graph processing, streaming, interactive and
batch processing. Can replace several SQL, streaming and
complex analytics systems with one unified environment,
 Strong integration with variety of tools in the Hadoop
ecosystem,
 Can read and write to different data formats and data
sources including HDFS, Cassandra, S3 and Hbase,
 You can run Spark using its standalone cluster mode,
on EC2, on Hadoop YARN, on Mesos, or on Kubernetes.
Access data in HDFS, Apache Cassandra, Apache
HBase, Apache Hive, and hundreds of other data sources.

13
Spark is NOT a data storage
system
 Spark is not a data store but is versatile in reading
from and writing to a variety of data sources,
 Can read and write to different data formats and
data sources including HDFS, Cassandra, S3 and
Hbase,
 Can access traditional BI tools using a server mode
that provides standard JDBC and ODBC connectivity,
 The DataFrame API provides a pluggable mechanism
to access structured data using Spark SQL,
 API provides tight optimization integration, thus
enhancing the speed of the Spark jobs that process
vast amounts of data
14
Spark History

15
Spark History
 Apache Spark started as a research project at the UC Berkeley AMPLab in
2009,
 Apache Spark was open sourced in early 2010, and transferred to Apache
in 2013,
 Top-level Apache project today, Today, the project is developed
collaboratively by a community of hundreds of developers from hundreds
of organizations.
 2015 was the year of superstardom for Spark (source:
http://fortune.com/2015/09/25/apache-spark-survey/)
 Many of the ideas behind the system were presented in various research
papers over the years,
 Motivated by MapReduce and the need to apply machine learning in a
scalable fashion,

16
Use cases of Spark

17
Use cases of Spark
 Apache Spark use cases in e-commerce
Industry

 It helps with Information about a real-time transaction. Those


are passed to streaming clustering algorithms. Such as
alternating least squares or K-means clustering algorithm. It
also helps to enhance the recommendations to customers
based on new trends. Some Real-time examples like Alibaba,
eBay using Spark in e-commerce.

18
Use cases of Spark
 Apache Spark use cases in Healthcare
 Analysis of patient records along with past clinical data.
 It helps to identify which patients are likely to face health
issues later on.
 This step prevents hospital re-admittance. Since it is possible to deploy home services to
the identified patient now. Also, saves costs for both the hospitals and patients.
 Spark is used in genomic sequencing.

19
Use cases of Spark
 Apache Spark use cases in Media &
Entertainment Industry
 In the gaming, we use Spark to identify patterns from the real-
time in-game events. It helps to respond in order to harvest
lucrative business opportunities. For Example targeted
advertising, auto adjustment of gaming level complexity,
player retention etc.
 Some video sharing websites using spark along with MongoDB.
It helps to show relevant advertisements to its users based on
the videos they view, share and browse. Some Real-time
companies which are using Spark are Yahoo, NetFlix, Pinterest,
Conviva etc.

20
Use cases of Spark
 Apache Spark use cases in Travel Industry
 It helps users to plan a perfect trip by speed up the
personalized recommendations.
 They also use it to provide advice to travelers by comparing
many websites to find the best hotel prices.
 Also, the Review process of the hotels in a readable format is
done by using Spark.
 some apps using Spark to provide us a platform for online
reservation(real time). They are using Spark to manage ample
of restaurants and dinner reservations at the same time.

21
Use cases of Spark
 Fraud detection: Spark streaming and Machine Learning
applied to prevent fraud
 Network Intrusion detection: Machine Learning applied to
detect cyber hacks
 Customer segmentation and personalization: Spark SQL and
Machine Learning applied to maximize customer lifetime value
 Social media sentiment analysis: Spark streaming, Spark SQL
and Stanford’s
 CoreNLP wrapper helps achieve sentiment analysis
 Real-time ad targeting: Spark used to maximize online ad
revenues
 Predictive healthcare: Spark used to optimize healthcare costs

22
Use cases of Spark

23
Spark at Uber
 ”It’s fundamentally a data problem” - Head of Data at Uber,
Aaron Schildkrout
 Business Problem: A simple problem of getting people around a city
with an army
 of more than 100,000 drivers and to use data to intelligently perfect
the business
 in an automated and real time way.
 Accurately paying drivers as per the trips data set
• Maximize profits by positioning cars optimally
• Help drivers avoid accidents
• Calculate surge pricing
 Solution: Use Spark Streaming and Spark SQL used as the ETL system,
Spark Mllib and GraphX for advanced analytics
 Reference: Talk by Uber engineers at Apache Spark meetup
https://www.youtube.com/watch?v=zKbds9ZPjLE

24
Spark at Netflix
 Netflix uses Apache Spark for real-time stream processing to provide
online recommendations to its customers.
 Streaming devices at Netflix send events which capture all member
activities and play a vital role in personalization.
 "There are 33 million different versions of Netflix.” – Joris Evers,
Director of Global Communications
 It processes 450 billion events per day which flow to server side
applications and are directed to Apache Kafka.
 Business Problem: A video streaming service with emphasis on data
quality, agility and availability. Using analytics to help users discover
movies and shows that they like is key to Netflix’s success.
• Streaming applications are long running tasks that need to be resilient in cloud
deployments,
• Optimize content buying
• Renowned personalization algorithms
 Solution: Use Spark streaming in AWS cloud, Spark GraphX for
recommender system Reference: Talk by Netflix engineers at Apache25
Spark meetup https://www.youtube.com/watch?v=gqgPtcDmLGs
Spark at Pinterest
 Pinterest is using apache spark to discover trends in high value user
engagement data so that it can react to developing trends in real-
time by getting an in-depth understanding of user behaviour on the
website.
 ”Data driven decision making is in our company DNA” - Head of
Data Engineering at Pinterest, Krishna Gade
 Business Problem: To provide a recommendation and visual
bookmarking tool that lets users discover, save and share ideas and to
get an immediate view of Pinterest engagement activity with high
throughput and minimal latency.
• Real time analytics to process user’s activity
• Process petabytes of data to provide recommendations and personalization
• Apply sophisticated deep learning techniques to a Pin image to suggest related Pins
 Solution: Use Spark Streaming, Spark SQL, MemSQL’s Spark connector
used for real
 time analytics, Spark MLlib for machine learning use cases
 Reference: https://engineering.pinterest.com/blog/real-time- 26

analytics-pinterest
Spark at ADAM, Big Data
Genomics project
 “And that’s the promise of precision medicine -- delivering the
right treatments, at the right time, every time to the right
person.“ (President Obama, 2015 State of the Union address)
 Business Problem: ADAM is a genomics analysis platform providing
large scale analyses to support population based genomics study that
is essential for precision medicine.
• Parallelize genomics analysis in the cloud
• Replace developing custom distributed computing code
• Support for file formats well suited to genomic data like Apache Parquet and Avro
• Support
 Solution: Use Spark on Amazon EMR
 Reference: https://github.com/bigdatagenomics/adam,
http://bdgenomics.org/
https://blogs.aws.amazon.com/bigdata/post/Tx1GE3J0NATVJ39/Will-Spark-Power-the-Data-
behind-Precision-Medicine for languages like R which are popular in the genomics
community

27
Spark at Yahoo
 ”CaffeOnSpark Open Sourced for Distributed Deep Learning on Big
Data Clusters” – Andy Feng, VP Architecture at Yahoo
 Business Problem: Deep learning is critical for Yahoo’s product teams
to acquire intelligence from huge amounts of online data. Examples
are image recognition and speech recognition for improved search on
photo sharing service Flickr.

• Run deep learning software on existing infrastructure


• Distribute deep learning processes across multiple big data clusters
• To handle potential system failures on long running deep learning jobs

 Solution: Create a way to run deep learning system Caffe on Spark


 Reference: HUG Meetup Apr 2016: CaffeOnSpark: Distributed Deep
Learning on Spark Clusters
https://www.youtube.com/watch?v=bqj7nML-aHk

28
Spark Components

29
Spark Components
Apache Spark Core

 All the functionalities being provided by Apache Spark are built on the
top of Spark Core.
 It delivers speed by providing in-memory computation capability.
 Spark Core is the foundation of parallel and distributed processing of
huge dataset.

30
Spark Components
The key features of Apache Spark Core are:

 It is in charge of essential I/O functionalities.


 Significant in programming and observing the role of the Spark
cluster.
 Task dispatching.
 Fault recovery.
 It overcomes the snag of MapReduce by using in-memory
computation.

31
Spark Components
 Spark Core is embedded with a special collection
called RDD (resilient distributed dataset). RDD is among the
abstractions of Spark. Spark RDD handles partitioning data
across all the nodes in a cluster. It holds them in the memory
pool of the cluster as a single unit. There are two operations
performed on RDDs: Transformation and Action-
 Transformation: It is a function that produces new RDD from
the existing RDDs.
 Action: In Transformation, RDDs are created from each other.
But when we want to work with the actual dataset, then, at
that point we use Action.
 Refer these guides to learn more about Spark RDD
Transformations & Actions API and Different ways to create
RDD in Spark.
32
Spark Components
Apache Spark SQL
 The Spark SQL component is a distributed framework for structured data
processing.
 Spark SQL works to access structured and semi-structured information.
 It also enables powerful, interactive, analytical application across both
streaming and historical data.
 Spark SQL is Spark module for structured data processing. Thus, it acts as
a distributed SQL query engine.

33
Spark Components
Apache Spark SQL : Features of Spark SQL include

 Cost based optimizer.


 Mid query fault-tolerance: This is done by scaling thousands of
nodes and multi-hour queries using the Spark engine.
 Full compatibility with existing Hive data.
 Data Frames and SQL provide a common way to access a
variety of data sources. It includes Hive, Avro, Parquet, ORC,
JSON, and JDBC.
 Provision to carry structured data inside Spark programs, using
either SQL or a familiar Data Frame API.

34
Spark Components
Apache Spark Streaming
 It is an add-on to core Spark API which allows scalable, high-
throughput, fault-tolerant stream processing of live data streams.

 Spark can access data from sources like Kafka, Flume, Kinesis or TCP
socket.

 It can operate using various algorithms. Finally, the data so received


is given to file system, databases and live dashboards. Spark
uses Micro-batching for real-time streaming.

 Micro-batching is a technique that allows a process or task to treat a


stream as a sequence of small batches of data. Hence Spark
Streaming, groups the live data into small batches. It then delivers it
to the batch system for processing. It also provides fault tolerance
characteristics. 35
Spark Components
Apache Spark Streaming: Spark Streaming Works on 3 phases

 1. GATHERING
The Spark Streaming provides two categories of built-in streaming sources:
 Basic sources: These are the sources which are available in theStreamingContext API.
Examples: file systems, and socket connections.
 Advanced sources: These are the sources like Kafka, Flume, Kinesis, etc. are available
through extra utility classes. Hence Spark access data from different sources like Kafka,
Flume, Kinesis, or TCP sockets.
 2. PROCESSING
The gathered data is processed using complex algorithms expressed with a high-level
function. For example, map, reduce, join and window.
 3. DATA STORAGE

 The Processed data is pushed out to file systems, databases, and live dashboards.
 Spark Streaming also provides high-level abstraction. It is known as discretized stream or
DStream.
 DStream in Spark signifies continuous stream of data. We can form DStream in two ways
either from sources such as Kafka, Flume, and Kinesis or by high-level operations on other
36
DStreams. Thus, DStream is internally a sequence of RDDs.
Spark Components
Apache Spark MLlib (Machine Learning
Library)
MLlib in Spark is a scalable Machine learning library that discusses both
high-quality algorithm and high speed.
 Goal is to make practical Machine Learning scalable and easy.
 Includes common machine Learning algorithms like classification,
regression, clustering, collaborative filtering.
 Provides utilities for feature extraction, transformation,
dimensionality reduction and selection.
 Provides tools for constructing Machine Learning pipelines, evaluating
and tuning them.
 Supports persistence of models and pipelines.
 Includes convenient utilities for linear algebra, statistics, data
handling etc.
37
Spark Components
Apache Spark MLlib (Machine Learning
Library): Spark MLlib in 2.0

 Spark 2.0 declared the DataFrame based API in the spark.ml


package as the main Machine Learning library going forward

 The original RDD based Machine Learning library in the


spark.mllib package has entered maintenance mode and will
be deprecated by Spark 2.2 and removed in Spark 3.0

38
Spark Components
Apache Spark MLlib (Machine Learning
Library): Example use cases of Spark Mllib
 Fraud detection: Spark streaming and Machine Learning applied to
prevent fraud,
 Network Intrusion detection: Machine Learning applied to detect
cyber hacks,
 Customer segmentation and personalization: Spark SQL and Machine
Learning applied to maximize customer lifetime value,
 Real-time ad targeting: Spark used to maximize online ad revenues,
 Predictive healthcare: Spark used to optimize healthcare costs,
 Genomics analysis to provide precision medicine;

39
Spark Components
Spark ML Pipelines
 Provide high level API that help users create and tune practical
machine learning pipelines.
 Allows to combine multiple machine learning algorithms and utilities
into a single pipeline
Key concepts in the Pipeline API:
• DataFrame: Is the ML dataset and can hold a variety of data type
• Transformer: Is an algorithm which can transform one DataFrame into
another DataFrame.
• Estimator: Is an algorithm which can be fit on a DataFrame to produce
a Transformer.
• Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow.
• Parameter: This API allows specifying parameters on all Transformers
and Estimators
40
Spark Components
Spark ML Pipeline: DataFrames
• Machine Learning can be applied to a wide variety of data types such as
images, text, audio clips, numerical data. The ML dataset and can
hold a variety of data type
• The DataFrame API supports a variety of data types and hence is
suitable
• Conceptually equivalent to a table in a relational database or a data
frame in R/Python
• Columns in DataFrame are named
• Can be constructed from a wide array of sources such as: structured
data files, tables in Hive, external databases, or existing RDDs
• It contains richer optimizations under the hood for better performance

41
Spark Components
Spark ML Pipeline: Transformer

• Is an algorithm that transforms one DataFrame into another, generally


by appending one or more columns

• Can be a feature transformer that converts a column in a DataFrame to


another type and appends the new column

• Can be a learning model that reads the features in the DataFrame,


makes predictions and appends the predicted label to the DataFrame.

• Implements the transform() method to perform the transformation

42
Spark Components
Spark ML Pipeline: Estimator

• Abstracts the concept of a learning algorithm that trains on data

• Implements the method fit() which accepts a DataFrame and produces


a Model which is a Transformer

• For example, a learning algorithm like LogisticRegression is an


Estimator and calling fit() trains a LogisticRegressionModel which is a
Model and a Transformer

43
Spark Components
Spark ML Pipeline: Pipeline

• Combines multiple transformers and estimators into a pipeline


• Pipeline is specified as a sequence of stages and each stage is a
Transformer or an Estimator.
• Stages are specified as an ordered array
• In a linear pipeline, each stage uses data produced by the previous
stage
• A non-linear pipeline is valid as long as it forms a Direct Acyclic graph.
Graph is specified based on the input and output column names of
each stage
• Pipelines help ensure that training and test data go through identical
feature extraction steps

44
Spark Components
Spark ML Pipeline: Parameters

• Parameters can be specified on Estimators and Transformer for tuning


the algorithms and models

• Example: if lr is an instance of LogisticRegression, call lr.setMaxIter(10)


to make lr.fit() use at most 10 iterations.

• A Param is a named parameter while a ParamMap is a set of


(parameter, value) pairs

45
Spark Components
Spark ML Pipeline Example
 A text document classification pipeline has the following workflow:
 Training flow: Input is a set of text documents where each document
is labelled. Stages while training the Machine Learning model are:
• Split each text document into words
• Convert each document’s words into a numerical feature vector
• Learn a prediction model using the feature vectors and labels
 Test/Predict flow: Input is a set of text documents and the goal is
predict a label for each
 document. Stages while testing or making predictions with the
Machine Learning model are:
• Split each text document into words
• Convert each document’s words into a numerical feature vector
• Use the trained model and make predictions on the feature vector
46
Spark Components
Spark ML Pipeline Training Time
 Raw data is processed through the Tokenizer and HashingTF and then
fed into the LogisticRegression algorithm to create a
LogisticRegression Model

47
Spark Components
Spark ML Pipeline Testing Time
 The raw data on which the predictions need to be made are passed
through the same Tokenizer and HashingTF for feature extraction and
then passed through the LogisticRegression Model to make
predictions.

48
Spark Components
Spark ML Tuning
 Important task in ML is model selection. Using data find a best model
or parameters for making predictions.
 Tuning can be done on individual Estimators and on entire pipelines
 Tools such as CrossValidator and TrainValidationSplit are used for
tuning
 • At a high level, these model selection tools work as follows:

 split the input data into separate training and test datasets.
 For each (training, test) pair, and for each set of ParamMaps, they
evaluate the model’s performance using the Evaluator
 They select the Model produced by the best-performing set of
parameters.

49
Spark Components
Spark Streaming
 Extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams
 Data can be ingested from many sources like Kafka, Flume, Twitter,
ZeroMQ, Kinesis, or TCP sockets
 Data can further be processed using MLlib, Graph processing or high
level functions like map, reduce, join, window etc
 Processed data can be pushed out to file systems, databases and live
dashboards

50
Spark Components
Internal working of Spark Streaming:
 Spark streaming receives live input data streams and divides data into
batches
 Batches of data are processed by the Spark engine to produce batches
of results
 High level API called DStream(discretized stream) represents
continuous stream of data
 Internally DStreamis represented as a sequence of RDDs
 DStreams can be created either from input data streams from sources
such as Kafka, Flume, and Kinesis, or by applying high-level
operations on other DStreams.

51
Spark Components
Spark GraphX
GraphX is Apache Spark's API for graphs and graph-parallel computation.
 Graph processing library with APIs to manipulate graphs and
performing graph-parallel computations
 Extends the Spark RDD API by introducing a new Graph abstraction, a
directed multigraph with properties attaches to each vertex and edge
 Like RDDs, property graphs are immutable, distributed, and fault-
tolerant.
 Provides various operators for manipulating graphs (e.g., subgraph
and mapVertices)
 Provides library of common graph algorithms (e.g., PageRank and
triangle counting)
 Growing collection of algorithms and builders to simplify graph
analytics tasks
52
Spark Components
Spark GraphX
Flexibility
 Seamlessly work with both graphs and collections.
 GraphX unifies ETL, exploratory analysis, and iterative graph
computation within a single system. You can view the same data as
both graphs and collections, transform and join graphs with RDDs
efficiently, and write custom iterative graph algorithms using
the Pregel API.

53
Spark Components
 Spark GraphX
Speed
 Comparable performance to the fastest specialized graph processing
systems.
 GraphX competes on performance with the fastest graph systems
while retaining Spark's flexibility, fault tolerance, and ease of use.

54
Spark Components
Spark GraphX
Algorithms
 Choose from a growing library of graph algorithms.
 In addition to a highly flexible API, GraphX comes with a variety of
graph algorithms, many of which were contributed by our users.

55
Spark Components
Spark GraphX
Community
 GraphX is developed as part of the Apache Spark project. It thus gets
tested and updated with each Spark release.

 GraphX is in the alpha stage and welcomes contributions.

56
Spark Components
Spark GraphX Example:
 Define vertex array and edge array
 Construct RDDs from vertex and edge arrays
 Build a Property graph using RDD of vertices and edges
 Perform filter operation on graph. Example: list users at least 30
years of age
 Use the graph.triplets view to display relationships in the graph

57
Spark Components
Spark R:
 SparkR is an R package that provides a light-weight frontend to use
Apache Spark from R.
 SparkR provides a distributed data frame implementation that
supports operations like selection, filtering, aggregation etc. on large
datasets.
 SparkR also supports distributed machine learning using MLlib.
 SparkR uses the SparkDataFrame API, SparkDataFramess can be
constructed from a wide array of sources such as structured data
files, tables in Hive, external databases, or existing local R data
frames.
 SparkR shell can be invoked using the ./bin/sparkR command
 Also can connect to SparkR from Rstudio or other R IDEs.

58
Spark Components
Spark R:

 Allows converting R dataframes to SparkDataFrames


 Can read from JSON, Parquet files, Hive and other data sources
 Similar to lapply in native R, spark.lapply runs a function over a list of
elements and distributes the computations with Spark.
 A SparkDataFrame can also be registered as a temporary view in
Spark SQL and that allows you to run SQL queries over its data.
 SparkR supports the following machine learning algorithms currently:
Generalized Linear Model, Accelerated Failure Time (AFT) Survival
Regression Model, Naive Bayes Model and KMeansModel. Under the
hood, SparkR uses MLlib to train the model.

59
Spark Architecture
The Spark Stack:

60
Spark Architecture
Spark Cluster Managers
 Cluster managers allocate resources across applications on a cluster.

Spark Currently supports three cluster managers:

• Standalone – a simple cluster manager included with Spark that


makes it easy to set up a cluster.

• Apache Mesos – a general cluster manager that can also run Hadoop
MapReduce and service applications.

• Hadoop YARN – the resource manager in Hadoop 2.

61
Spark Architecture
Spark Runtime Architecture
• In distributed mode, Spark uses a
master/slave architecture
• The master is called the “driver” and
the slaves are the “executors”
• Drivers and executors run in their own
Java processes
• A driver and its executors put together
form a Spark application
• Spark application is launched using the
cluster manager

62
Spark Architecture
SparkContext

 Main entry point to everything Spark


 Defined in the main/driver program
 Tells Spark how and where to access a cluster
 Connects to cluster managers
 Coordinates Spark processes running on different cluster nodes
 Used to create RDDs and shared variables on the cluster

63
Spark Architecture
The Driver
 The process where the main() method of
the Spark program runs
 Responsible for converting a user program
into tasks
 Driver schedules the tasks on executors
 Results from these tasks are delivered
back to the driver

64
Spark Architecture
Executors

 Launched once at the beginning of the


application and typically run for the entire
lifetime of an application
 Executors register themselves with the
driver, thus allowing the driver to schedule
tasks on the executors
 Worker processes run the individual tasks
and return results to the driver
 Provide in-memory storage for RDDs, as well
as disk storage

65
Spark Architecture
Spark running on clusters

 Application: User program built on Spark; consists of a driver


program and executors on the cluster
 Cluster Manager: A pluggable service for acquiring resources on the
cluster
 Worker node: Any node that can run application code in the cluster
 Driver program: The process running the main() function of the
application and creating the SparkContext
 Executor: A process launched on a worker node that runs tasks
 Task: A smallest unit of work sent to one executor; tasks are
bundled into “stages”

66
Spark Architecture
Execution steps in a Spark program
 User submits an application that launches the driver program
 Driver program invokes the main() method specified by the user
 Driver program contacts cluster manager to ask for resources to
launch executors
 Cluster manager launches executors
 Driver program divides the user program into tasks and sends them to
the executors
 Executors runs the tasks, computes and saves results, returns results
to the driver
 When driver’s main() method exits or SparkContext.stop() is called,
executors are terminated and the cluster manager releases the
resources

67
Spark Architecture
Summary
 Spark core is the base engine in the Spark stack. Spark API libraries
like Spark SQL, MLlib, Spark streaming, GraphX, etc. are built on top
of Spark core and inherit all of its features like fault tolerance and
scalability.
 Spark core also integrates with pluggable cluster managers. In
addition to the standalone cluster manager built within Spark, Spark
core can integrate with Apache MESOS and Hadoop YARN.
 Spark program is made up of a driver program and executor programs.
SparkContext is the entry point to everything Spark and is defined in
the driver program.
 Driver programs divide and parallelize tasks and sends them to the
executors.
 Executors perform the tasks, saving results to in-memory or disk
storage, and returns results to driver when task is complete.
68
RDD: Resilient Distributed
Datasets
 Resilient Distributed Dataset (aka RDD) is the primary data
abstraction in Apache Spark and the core of Spark (that I often refer
to as "Spark Core").
The origins of RDD
 The original paper that gave birth to the concept of RDD is Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory
Cluster Computing by Matei Zaharia, et al.
 A RDD is a resilient and distributed collection of records spread
over one or many partitions.
Note
 One could compare RDDs to collections in Scala, i.e. a RDD is
computed on many JVMs while a Scala collection lives on a single
JVM.

69
RDD: Resilient Distributed
Datasets
 Using RDD Spark hides data partitioning and so distribution that in
turn allowed them to design parallel computational framework with a
higher-level programming interface (API) for four mainstream
programming languages.
The features of RDDs (decomposing the name):
 Resilient, i.e. fault-tolerant with the help of RDD lineage graph and
so able to recompute missing or damaged partitions due to node
failures.
 Distributed with data residing on multiple nodes in a cluster.
 Dataset is a collection of partitioned data with primitive values or
values of values, e.g. tuples or other objects (that represent records
of the data you work with).

 A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.


Represents an immutable, partitioned collection of elements that can
be operated on in parallel. 70
RDD: Resilient Distributed
Datasets
 The data abstraction - RDD) it has the following additional traits:
 In-Memory, i.e. data inside RDD is stored in memory as much (size)
and long (time) as possible.
 Immutable or Read-Only, i.e. it does not change once created and
can only be transformed using transformations to new RDDs.
 Lazy evaluated, i.e. the data inside RDD is not available or
transformed until an action is executed that triggers the execution.
 Cacheable, i.e. you can hold all the data in a persistent "storage" like
memory (default and the most preferred) or disk (the least preferred
due to access speed).
 Parallel, i.e. process data in parallel.
 Typed — RDD records have types, e.g. Long in RDD[Long] or (Int,
String) in RDD[(Int, String)].
 Partitioned — records are partitioned (split into logical partitions) and
distributed across nodes in a cluster.
 Location-Stickiness — RDD can define placement preferences to 71
compute partitions (as close to the records as possible).
RDD: Resilient Distributed
Datasets
RDDs support two kinds of operations:
 transformations - lazy operations that return another RDD.
 actions - operations that trigger computation and return values.

72
RDD: Resilient Distributed
Datasets
Types of RDDs:
There are some of the most interesting types of RDDs:
 ParallelCollectionRDD
 CoGroupedRDD
 HadoopRDD is an RDD that provides core functionality for reading data
stored in HDFS using the older MapReduce API. The most notable use
case is the return RDD of SparkContext.textFile.
 MapPartitionsRDD - a result of calling operations
like map, flatMap, filter, mapPartitions, etc.
 CoalescedRDD - a result of repartition or coalesce transformations.
 ShuffledRDD - a result of shuffling, e.g. after repartition or
coalesce transformations

73
RDD: Resilient Distributed
Datasets
Types of RDDs:
PipedRDD - an RDD created by piping elements to a forked external
process.
 PairRDD (implicit conversion by PairRDDFunctions) that is an RDD of
key-value pairs that is a result of groupByKey and join operations.
 DoubleRDD (implicit conversion
as org.apache.spark.rdd.DoubleRDDFunctions) that is an RDD
of Double type.
 SequenceFileRDD (implicit conversion
as org.apache.spark.rdd.SequenceFileRDDFunctions) that is an RDD
that can be saved as a SequenceFile.
 Appropriate operations of a given RDD type are automatically
available on a RDD of the right type, e.g. RDD[(Int, Int)], through
implicit conversion in Scala.

74
RDD: Resilient Distributed
Datasets
Transformations
 Transformations are lazy operations on a
RDD that create one or many new RDDs,
e.g. m ap, filter, reduceByKey, join, cogroup, ra
ndom Split.

t
ransform ation: RDD => RDD
t
ransform ation: RDD => Seq[RDD]

75
RDD: Resilient Distributed
Datasets
Transformations
 There are two kinds of transformations:
 Narrow transformations
 Wide transformations

76
RDD: Resilient Distributed
Datasets
Narrow transformations
 Narrow transformations are the result of map, filter and such that is
from the data from a single partition only, i.e. it is self-sustained.
 An output RDD has partitions with records that originate from a single
partition in the parent RDD. Only a limited subset of partitions used
to calculate the result.
 Spark groups narrow transformations as a stage which is
called pipelining.

77
RDD: Resilient Distributed
Datasets
Wide Transformations
 Wide transformations are the result of groupByKey and reduceByKey.
The data required to compute the records in a single partition may
reside in many partitions of the parent RDD.
 Note
 Wide transformations are also called shuffle transformations as they
may or may not depend on a shuffle.
 All of the tuples with the same key must end up in the same partition,
processed by the same task. To satisfy these operations, Spark must
execute RDD shuffle, which transfers data across cluster and results in
a new stage with a new set of partitions.

78
RDD: Resilient Distributed
Datasets
Actions
An action is an operation that triggers execution of RDD
transformations and returns a value (to a Spark driver - the user
program).
 Actions are RDD operations that produce non-RDD values. They
materialize a value in a Spark program. In other words, a RDD
operation that returns a value of any type but RDD[T] is an action.
action: RDD => a value

Actions are one of two ways to send data from executors to


the driver (the other being accumulators).

79
RDD: Resilient Distributed
Datasets
Actions
Actions in org.apache.spark.rdd.RDD:
 aggregate
 collect
 count
 countApprox*
 countByValue*
 first
 fold
 foreach
 foreachPartition
 max
 min
 reduce
 saveAs* actions, e.g. saveAsTextFile, saveAsHadoopFile
 take
 takeOrdered
 takeSample
 toLocalIterator
 top
 treeAggregate 80
 treeReduce
RDD: Resilient Distributed
Datasets
Actions
Actions run jobs using SparkContext.runJob or
directly DAGScheduler.runJob.
scala> words.count (1)
res0: Long = 502

81
RDD: Resilient Distributed
Datasets
Actions
AsyncRDDActions
AsyncRDDActions class offers asynchronous actions that you can use on
RDDs (thanks to the implicit conversion rddToAsyncRDDActions in RDD
class). The methods return a FutureAction.
The following asynchronous methods are available:
 countAsync
 collectAsync
 takeAsync
 foreachAsync
 foreachPartitionAsync

82

You might also like