Unit 5

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 109

Apache Spark

Dr. K. Venkateswara Rao


Professor, CSE
CVR College of Engineering
Hadoop 2
Apache Spark
• Apache Spark is a cluster computing framework for large-scale
data processing.
• Originally developed in 2009 in UC Berkeley’s AMP Lab.
• Fully open sourced in 2010 – now a Top Level Project at the
Apache Software Foundation.
• Spark does not use MapReduce as an execution engine
• Spark is closely integrated with Hadoop: it can run on YARN and
works with Hadoop file formats and storage backends like HDFS.
• Spark is best known for its ability to keep large working datasets
(RDDs) in memory between jobs.
• Two styles of application that benefit greatly from Spark’s
processing model are
1. iterative algorithms
2. interactive analysis
Apache Spark
• Provide distributed memory abstractions for clusters to
support applications with working data sets
• Spark provides APIs in three languages: Scala, Java, and
Python.
• Retain the attractive properties of MapReduce:
➢ Fault tolerance (for crashes & stragglers)
➢ Data locality
➢ Scalability

Solution: augment data flow model with


“resilient distributed datasets” (RDDs)
Key Concept: RDD’s
Write programs in terms of operations on
distributed datasets

Resilient Distributed Operations


Datasets • Transformations
• Collections of objects spread (e.g. map, filter,
across a cluster, stored in RAM groupBy)
or on Disk • Actions
• Built through parallel (e.g. count, collect,
transformations save)
• Automatically rebuilt on failure
Spark Uses Memory instead of Disk

Hadoop: Use Disk for Data Sharing

HDFS HDFS HDFS


read HDFS
read Write
Write
Iteration1 Iteration2

Spark: In-Memory Data Sharing

HDFS read

Iteration1 Iteration2
Iterative operations on MapReduce
Iterative operations on Spark RDD
Interactive operations on MapReduce
Interactive operations on Spark RDD
Sort competition
Hadoop MR Spark
Record (2013) Record (2014)
Data Size 102.5 TB 100 TB
Spark, 3x
Elapsed Time 72 mins 23 mins
faster with
# Nodes 2100 206 1/10 the
# Cores 50400 physical 6592 virtualized nodes
Cluster disk 3150 GB/s
618 GB/s
throughput (est.)
dedicated data virtualized (EC2) 10Gbps
Network
center, 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
Apache Spark Architecture
Apache Spark supports data analysis, machine learning, graph data
processing, streaming data analytics. It can read/write from a range
of data types and allows development in multiple languages.

Scala, Java, Python, R, SQL


DataFrames ML Pipelines
Spark
Spark SQL Streaming
MLlib GraphX

Spark Core

Data Sources

Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON,


MySQL, and HPC-style (GlusterFS, Lustre)
Apache Spark Components
• Apache Spark Core
– Spark Core is the underlying general execution engine for spark
platform that all other functionality is built upon.
– It provides In-Memory computing and referencing datasets in
external storage systems.
– Spark Core provides distributed task dispatching, scheduling,
and basic I/O functionalities.
– Spark uses a specialized fundamental data structure known as
RDD (Resilient Distributed Datasets) that is a logical collection
of data partitioned across machines.
– RDDs can be created in two ways;
• by referencing datasets in external storage systems
• by applying transformations (e.g. map, filter, reducer, join)
on existing RDDs.
Apache Spark Components
• Spark SQL
– Spark SQL is a component on top of Spark Core that
introduces a new data abstraction called SchemaRDD,
which provides support for structured and semi-
structured data.
• Spark Streaming
– Spark Streaming leverages Spark Core's fast scheduling
capability to perform streaming analytics.
– It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on
those mini-batches of data.
Apache Spark Components
• MLlib (Machine Learning Library)
– MLlib is a distributed machine learning framework above Spark
core in the distributed memory-based Spark architecture
– Spark MLlib is nine times as fast as the Hadoop disk-based
version of Apache Mahout (before Mahout gained a Spark
interface).
• GraphX
– GraphX is a distributed graph-processing framework on top of
Spark. It provides an API for expressing graph computation that
can model the user-defined graphs by using Pregel abstraction
API. It also provides an optimized runtime for this abstraction.
– Pregel: a system for large-scale graph processing
DataFrames
• A DataFrame is the most common Structured API and simply represents
a table of data with rows and columns. The list that defines the
columns and the types within those columns is called the schema.
• Think of “a DataFrame” as “a spreadsheet with named columns”. Only
difference is that “a spreadsheet” sits on one computer in one specific
location, whereas “a DataFrame” can span thousands of computers.
Spark’s APIs
• Spark has two fundamental sets of APIs:
– Higher-level structured APIs, and
– Low-level “unstructured” APIs
• The higher level Structured APIs are a tool for manipulating all sorts of data,
from unstructured log files to semi-structured CSV files and highly structured
Parquet files. These APIs refer to three core types of distributed collection APIs:
1. Datasets
2. DataFrames
3. SQL tables and views
• There are times when higher-level manipulation will not meet the business or
engineering problem needs to solve. For those cases, one might need to use
Spark’s lower-level APIs, specifically the Resilient Distributed Dataset (RDD), the
SparkContext, and distributed shared variables like accumulators and broadcast
variables.
• There are two sets of low-level APIs: one for manipulating distributed data
(RDDs), and another for distributing and manipulating distributed shared
variables
Spark’s Language APIs
• Spark’s language APIs make it possible to run Spark code using various programming
languages (Scala, Java, Python, SQL, R). For the most part, Spark presents some core
“concepts” in every language; these concepts are then translated into Spark code that
runs on the cluster of machines.
1. Scala
➢ Spark is primarily written in Scala, making it Spark’s “default” language.
2. Java
➢ Even though Spark is written in Scala, Spark’s authors have been careful to ensure
that developers can write Spark code in Java.
3. Python
➢ Python supports nearly all constructs that Scala supports
4. SQL
➢ Spark supports a subset of the ANSI SQL 2003 standard. This makes it easy for
analysts and nonprogrammers to take advantage of the big data powers of Spark.
5. R
➢ Spark has two commonly used R libraries: one as a part of Spark core (SparkR) and
another as an R community-driven package (sparklyr).
Installing Spark
• Download a stable release of the Spark binary
distribution from the downloads page and unpack
the tarball in a suitable location:
% tar xzf spark-x.y.z-bin-distro.tgz
• Keep the Spark binaries on your path as follows:
% export SPARK_HOME=~/sw/spark-x.y.z-bin-distro
% export PATH=$PATH:$SPARK_HOME/bin
• Start up the shell with the following:
% spark-shell
• Spark context available as sc.
• scala>
Spark Context
• Sparkcontext is the entry point for spark environment.
• Every spark application needs to create the sparkcontext
object.
• org.apache.spark provides
– Class SparkContext
• Sparkcontext is a main entry point for Spark functionality.
• A SparkContext represents the connection to a Spark
cluster.
• It allows the Spark Application to access Spark Cluster
with the help of Resource Manager.
• The resource manager can be one of these three- Spark
Standalone, YARN, Apache Mesos.
Drivers and Executors
• At a high level, every Spark application consists of a
driver program that launches various parallel operations
on a cluster.
• Typical driver program could be the Spark shell itself.
• Driver program access Spark through a SparkContext
object, which represents a connection to a computing
cluster.
• In the shell, a SparkContext is automatically created as
the variable called ‘sc’.
• To run the parallel operations, driver programs typically
manage a number of nodes called executors.
Spark Applications: Driver and Executors
• Spark Applications consist of a driver process and a set of executor
processes. The driver process runs main() function, sits on a node in
the cluster, and is responsible for three things:
1. maintaining information about the Spark Application;
2. responding to a user’s program or input; and
3. analyzing, distributing, and scheduling work across the executors
• The driver process is the heart of a Spark Application and maintains
all relevant information during the lifetime of the application.
• The executors are responsible for actually carrying out the work that
the driver assigns them. This means that each executor is responsible
for only two things:
1. executing code assigned to it by the driver, and
2. reporting the state of the computation on that executor back to
the driver node.
Spark Execution Model
• Suppose that you want to execute application A1 with Spark. Spark
will create one driver node and some executers to execute
application A1.
• If you want to execute application A2, the same process will be
followed for A2.
Creating Spark Context
• To create SparkContext, first SparkConf should be made.
• Sparkconf is the class which enable to provide configuration
parameters.
• public class SparkConf extends java.lang.Object
• SparkConf is in org.apache.spark.SparkConf
• SparkConf is used to set various Spark configuration parameters
as key-value pairs.
• Some of these parameters defines properties of Spark driver
application. While some are used by Spark to allocate resources
on the cluster, like the number, memory size, and cores used by
executor running on the worker nodes.
Creating Spark Context
• Most of the time, you would create a SparkConf object with new
SparkConf().
• All setter methods in this class support chaining. For example, you
can write new SparkConf().setMaster("local").setAppName(“test").
Val Conf = new
sparkConf().setMaster(“local[*]”).setAppName(“test”)
Val SC = new sparkContext(Conf)
• The spark configuration is passed to spark context by Spark driver
application.
• Note that once a SparkConf object is passed to Spark, it can not be
modified by the user. Spark does not support modifying the
configuration at runtime.
Role of Spark Context
• After the creation of a SparkContext object, functions
such as textFile, sequenceFile, parallelize etc., can be
invoked.
• The different contexts in which it can run are local, yarn-
client, Mesos URL and Spark URL.
• Once the SparkContext is created, it can be used
to create RDDs, broadcast variable, and accumulator,
ingress Spark service and run jobs.
• All these things can be carried out until SparkContext is
stopped.
sc.stop()
• Only one SparkContext may be active per JVM.
Spark Applications, Jobs, Stages, and Tasks
• Application: one spark submit.
• Spark has the concept of a job.
• Job: A piece of code which reads some input from HDFS or local,
performs some computation on the data and writes some output
data.
• A Spark job is more general than a MapReduce job, though, since it
is made up of an arbitrary directed acyclic graph (DAG) of stages.
• Stages: Jobs are divided into stages. Stages are divided based on
computational boundaries.
• Each stage is roughly equivalent to a map or reduce phase in
MapReduce.
• Stages are split into tasks by the Spark runtime and are run in
parallel on partitions of an RDD spread across the cluster—just like
tasks in MapReduce.
Spark Application
• Each Application divided into one or
more Jobs
• Each Job is divided into one or more
stages
• Each stage is split into one or more tasks.
• One task is executed on one executor.
• Driver is responsible for running the Job.
• Job runs in the application context.
Spark Applications, Jobs, Stages, and Tasks
• One task is executed on one partition of data by an executor.
• Executor: The process responsible for executing a task.
• Driver: The program/process responsible for running the Job
over the Spark Engine.
• A job always runs in the context of an application (represented
by a SparkContext instance)
• An application can run more than one job, in series or in
parallel, and provides the mechanism for a job to access an
RDD that was cached by a previous job in the same
application.
• An interactive Spark session, such as a sparkshell session, is
just an instance of an application.
Programming Model
• Resilient distributeddatasets (RDDs)
–Immutable collections partitioned across cluster
that can be rebuilt if a partition is lost
–Created by transforming data in stable storage
using data flow operators (map, filter, group-
by, …)
–Can be cached across parallel operations
• Parallel operations on RDDs
–Reduce, collect, count, save, …
• Restricted shared variables
–Accumulators, broadcast variables
Example: Log Mining
•Load error messages from a log into memory, then
interactively search for various patterns
. . . BaseTransformed
RDD RDD
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
tasks Block 1
messages = errors.map(_.split(‘\t’)(2)) Driver
cachedMsgs = messages.cache() Cached RDD
Parallel operation
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Worker Block 2
Result: full-text search of Wikipedia
in <1 sec (vs 20 sec for on-disk data) Block 3
RDDs: Creation
• There are three ways of creating RDDs:
1. from an in-memory collection of objects (known as
parallelizing a collection)
➢ Useful for doing CPU-intensive computations on
small amounts of input data in parallel
2. using a dataset from external storage (such as HDFS)
➢ Useful to create an RDD that has a reference to an
external dataset.
3. transforming an existing RDD
➢ Useful for creating an RDD is by transforming an
existing RDD
RDDs: Transformations and Actions
• Spark provides two categories of operations on RDDs
1. Transformations
➢ A transformation generates a new RDD from an existing
one
➢ are lazy operations
2. Actions
➢ an action triggers a computation on an RDD and does
something with the results—either returning them to the
user, or saving them to external storage
➢ Actions have an immediate effect
• One way of telling if an operation is a transformation or an
action is by looking at its return type: if the return type is
RDD, then it’s a transformation; otherwise, it’s an action.
RDD Operations
Transformations Parallel Operations (Actions)
(define a new RDD) (return a result to driver)
map reduce
filter collect
sample count
union save
groupByKey lookupKey
reduceByKey …
join
cache

RDDs: Transformations and Actions
• Spark’s library contains a rich set of operators,
including transformations for mapping, grouping,
aggregating, repartitioning, sampling, and joining
RDDs, and for treating RDDs as sets.
• There are also actions for materializing RDDs as
collections, computing statistics on RDDs,
sampling a fixed number of elements from an
RDD, and saving RDDs to external storage.
RDD Fault Tolerance
• RDDs maintain lineage information that can be used to
reconstruct lost partitions
• e.g.:
cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
.cache()

HdfsRDD FilteredRDD MappedRDD


CachedRDD
path: hdfs://… func: contains(...) func: split(…)
RDD Persistence in Spark
• Spark RDD persistence is an optimization technique. It
saves the result of RDD evaluation.
• We can make RDD persistence through cache() and
persist() methods.
• The cache() method store all the RDD in-memory.
• The difference between cache() and persist() is that using
cache() the default storage level is MEMORY_ONLY while
using persist() we can use various storage levels.
• The Benefits of RDD Persistence is it makes the whole
system
1. Time efficient
2. Cost efficient
3. Lessen the execution time.
Storage levels of Persisted RDDs
• Spark offers different types of persistence behavior that may be
selected by calling persist() with an argument to specify the
StorageLevel.
• MEMORY_ONLY is a default level. It uses the regular in-memory
representation of objects.
• MEMORY_ONLY_SER is a more compact representation. It stores
by serializing the elements in a partition as a byte array. it incurs
CPU overhead compared to MEMORY_ONLY, but is worth it if the
resulting serialized RDD partition fits in memory when the regular
in-memory representation doesn’t.
• If recomputing a dataset is expensive, then either
MEMORY_AND_DISK (spill to disk if the dataset doesn’t fit in
memory) or MEMORY_AND_DISK_SER (spill to disk if the serialized
dataset doesn’t fit in memory) is appropriate.
Storage levels of Persisted RDDs
• In DISK_ONLY storage level, RDD is stored only on
disk.
• Cached RDDs can be retrieved only by jobs in the
same application. To share datasets between
applications, they must be written to external
storage using one of the saveAs*() methods
(saveAsText File(), saveAsHadoopFile(), etc.) in the
first application, then loaded using the
corresponding method in SparkContext (textFile(),
hadoopFile(), etc.) in the second application.
RDD Persistence and Removal

• RDD Persistence
– RDD.persist()
– Storage level:
• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
DISK_ONLY,…….

• RDD Persistence Removal


– RDD.unpersist()
Serialization in Spark
• Spark will use Java serialization to send data over the
network from one executor to another, or when caching
(persisting) data in serialized form as described in
“Persistence levels”.
• A better choice for most Spark programs is Kryo
serialization. Kryo is a more efficient general-purpose
serialization library for Java.
• In order to use Kryo serialization, set the spark.serializer
as follows on the SparkConf in your driver program:
conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
• It is much more efficient to register classes with Kryo
before using them.
Serialization in Spark
• Registering classes with Kryo is straightforward.
• Create a subclass of KryoRegistrator, and override the
registerClasses() method:
class CustomKryoRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[WeatherRecord])
}
}
• Finally, in the driver program, set the
spark.kryo.registrator property to the fully qualified
classname of above KryoRegistrator implementation:
conf.set("spark.kryo.registrator", "CustomKryoRegistrator")
Broadcast Variables and Accumulators
(Shared Variables )
• A broadcast variable is serialized and sent to each executor,
where it is cached so that later tasks can access it if needed.
• Broadcast variables allow the programmer to keep a read-
only variable, cached on each node, rather than sending a
copy of it with tasks
• A broadcast variable is created by passing the variable to be
broadcast to the broadcast() method on SparkContext.
• It returns a Broadcast[T] wrapper around the variable of type
T:
>broadcastV1 = sc.broadcast([1, 2, 3,4,5,6])
>broadcastV1.value
[1,2,3,4,5,6]
Accumulators (Shared Variables )
• Accumulators are variables that are only “added” to through an
associative operation and can be efficiently supported in parallel
accum = sc.accumulator(0)
accum.add(x)
accum.value
• After a job has completed, the accumulator’s final value can be
retrieved from the driver program.
• To understand the progress of running stages, tracking
accumulators in UI is useful.
• A numeric accumulator can be created by calling
SparkContext.longAccumulator().
• An accumulator of type double can be created by
SparkContext.doubleAccumulator().
How Spark Works?
• Spark has a small code base and the system that is divided in
various layers. Each layer has some responsibilities. The layers
are independent of each other.
1. The first layer is the interpreter, Spark uses a Scala
interpreter, with some modifications.
2. As you enter your code in spark console (creating RDD’s and
applying operators), Spark creates an operator graph.
3. When the user runs an action (like collect), the Graph is
submitted to a DAG Scheduler. The DAG scheduler divides
operator graph into (map and reduce) stages.
4. A stage is comprised of tasks based on partitions of the
input data.
How Spark Works?
4. The DAG scheduler pipelines operators together to
optimize the graph. For e.g. Many map operators can be
scheduled in a single stage. This optimization is key to
Sparks performance. The final result of a DAG scheduler
is a set of stages.
5. The stages are passed on to the Task Scheduler. The task
scheduler launches tasks via cluster manager. (Spark
Standalone/Yarn/Mesos). The task scheduler doesn’t
know about dependencies among stages.
6. The Worker executes the tasks. A new JVM is started per
task. The worker knows only about the code that is
passed to it.
How Spark Works?
Anatomy of a Spark Job Run
• At the highest level, there are two independent entities
involved in Spark job run:
1. the driver
– which hosts the application (SparkContext)
– schedules tasks for a job
2. the executors
– exclusive to the application
– run for the duration of the application
– execute the application’s tasks
• A Spark job is submitted automatically when an action (such
as count()) is performed on an RDD. Internally, this causes
runJob() to be called on the SparkContext.
Anatomy of a Spark Job Run
DAG Construction
• A Spark job is broken up into stages.
• There are two types of tasks that can run in a
stage
1. shuffle map tasks
2. result tasks
• Shuffle map tasks
– Each shuffle map task runs a computation on one
RDD partition and, based on a partitioning function,
writes its output to a new set of partitions, which are
then fetched in a later stage. Shuffle map tasks run in
all stages except the final stage.
DAG Construction
• Result tasks
– Result tasks run in the final stage that returns the result
to the user’s program (such as the result of a count()).
– Each result task runs a computation on its RDD partition,
then sends the result back to the driver, and the driver
assembles the results from each partition into a final
result.
• The simplest Spark job is one that does not need a shuffle
and therefore has just a single stage composed of result
tasks. This is like a map-only job in MapReduce.
• More complex jobs involve grouping operations and require
one or more shuffle stages.
Task Scheduling
• When the task scheduler is sent a set of tasks, it uses
its list of executors that are running for the application
and constructs a mapping of tasks to executors that
takes placement preferences into account.
• For a given executor the scheduler will first assign
process-local tasks, then node-local tasks, then rack-
local tasks, before assigning an arbitrary (nonlocal)
task.
• the task scheduler assigns tasks to executors that have
free cores and it continues to assign more tasks as
executors finish running tasks, until the task set is
complete.
Task Scheduling
• Each task is allocated one core by default.
• Assigned tasks are launched through a scheduler backend
(step 4 in fig).
• The scheduler backend sends a remote launch task message
(step 5) to the executor backend.
• the executor backend tell the executor to run/launch the task
(step 6).
• Executors also send status update messages to the driver
when a task has finished or if a task fails.
• if a task fails, the task scheduler will resubmit the task on
another executor.
• The task scheduler will also launch speculative tasks for tasks
that are running slowly, if this is enabled.
Task Execution
• Spark relies on executors to run the tasks that
make up a Spark job.
• An executor runs a task as follows (step 7).
1. First, it makes sure that the JAR and file
dependencies for the task are up to date.
2. Second, it deserializes the task code from the
serialized bytes that were sent as a part of the launch
task message.
3. Third, the task code is executed.
• The tasks are run in the same JVM as the executor, so
there is no process overhead for task launch
Task Execution
• Tasks can return a result to the driver.
– The result is serialized and sent to the executor
backend, and then back to the driver as a status
update message.
• A shuffle map task returns information that allows
the next stage to retrieve the output partitions.
• A result task returns the value of the result for the
partition it ran on, which the driver assembles
into a final result to return to the user’s program.
Executors and Cluster Managers
• Managing the lifecycle of executors is the responsibility of
the cluster manager.
• Spark provides a variety of cluster managers with different
characteristics:
Local, Standalone, Mesos, YARN
• The Mesos and YARN cluster managers are superior to the
standalone manager since they take into account the
resource needs of other applications running on the cluster
and enforce a scheduling policy across all of them.
• The standalone cluster manager uses a static allocation of
resources from the cluster, and therefore is not able to
adapt to the varying needs of other applications over time.
• YARN is the only cluster manager that is integrated with
Hadoop’s Kerberos security mechanisms
Spark Cluster Managers
• In local mode there is a single executor running in the same JVM
as the driver. This mode is useful for testing or running small
jobs.
• The standalone cluster manager is a simple distributed
implementation that runs a single Spark master and one or
more workers.
• Apache Mesos is a general-purpose cluster resource manager
that allows fine-grained sharing of resources across different
applications according to an organizational policy. By default
(fine-grained mode), each Spark task is run as a Mesos task.
• YARN is the resource manager used in Hadoop. Each running
Spark application corresponds to an instance of a YARN
application, and each executor runs in its own YARN container.
Spark on YARN
• Spark offers two deploy modes for running on YARN:
• YARN client mode
– the driver runs in the client
– is required for programs that have any interactive
component, such as spark-shell or pyspark
– is also useful when building Spark programs, since any
debugging output is immediately visible.
• YARN cluster mode
– the driver runs on the cluster in the YARN application
master.
– is appropriate for production jobs
– enables to retain logfiles for later inspection.
– retry the application if the application master fails
Starting of Spark executors in YARN Client Mode
YARN Client Mode
• The interaction with YARN starts when a new SparkContext
instance is constructed by the driver program (step 1 in fig).
• The context submits a YARN application to the YARN
resource manager (step 2).
• The YARN resource manager starts a YARN container on a
node manager in the cluster and runs a Spark
ExecutorLauncher application master in it (step 3).
• The job of the ExecutorLauncher is to start executors in
YARN containers, which it does by requesting resources
from the resource manager (step 4), then launching
ExecutorBackend processes as the containers are allocated
to it (step 5).
YARN Client Mode
• As each executor starts, it connects back to the SparkContext and
registers itself. This gives the SparkContext information about the
number of executors available for running tasks and their locations,
which is used for making task placement decisions.
• The number of executors that are launched is set in spark-shell,
spark-submit, or pyspark (if not set, it defaults to two), along with the
number of cores that each executor uses (the default is one) and the
amount of memory (the default is 1,024 MB).
• An example showing how to run spark-shell on YARN with four
executors, each using one core and 2 GB of memory:
% spark-shell --master yarn-client \
--num-executors 4 \
--executor-cores 1 \
--executor-memory 2g
Starting of Spark executors in YARN cluster mode
YARN cluster mode
• In YARN cluster mode,
– the user’s driver program runs in a YARN application master
process.
– The spark-submit command is used with a master URL of
yarn-cluster:
– % spark-submit --master yarn-cluster ...
– All other parameters, like --num-executors and the
application JAR (or Python file), are the same as for YARN
client mode.
– The spark-submit client will launch the YARN application
(step 1 in fig), but it doesn’t run any user code. The rest of
the process is the same as client mode, except the
application master starts the driver program (step 3b) before
allocating resources for executors (step 4).
Spark RDD – Prominent Features
References
1. Chapter 19: Hadoop: The Definitive Guide,
Tom White, Oreilly, 4th Edition
2. https://spark.apache.org/docs/latest/
3. https://techvidvan.com/tutorials/spark-
tutorial/
What is Spark Streaming?
• Framework for large scale stream
processing
– Scales to 100s of nodes
– Can achieve second scale latencies
– Integrates with Spark’s batch and interactive
processing
– Provides a simple batch-like API for
implementing complex algorithm
– Can absorb live data streams from Kafka,
Flume, ZeroMQ, etc.
Motivation
 Many important applications must process large
streams of live data and provide results in near-
real-time
- Social network trends
- Website statistics
- Intrustion detection systems
- etc.

 Require large clusters to handle workloads


 Require latencies of few seconds
Stateful Stream Processing
Requirements of Stream Processing
System
• Scalable to large clusters
• Second-scale latencies
• Simple programming model
• Integrated with batch & interactive processing
• Efficient fault-tolerance in stateful
computations
What is Spark streaming?
• Spark streaming is an extension of the core Spark API that
enables scalable, high-throughput stream processing of live data
streams.
• Data can be ingested from many sources like Kafka, Flume,
Twitter, ZeroMQ, Kinesis, or TCP sockets
How Spark streaming works?
• Spark Streaming receives live input data streams and divides the
data into batches.
• Spark Streaming provides a high-level abstraction called
discretized stream or Dstream.
• Spark Dstream represents a stream of data divided into small
batches.
• Each batch is processed by the Spark engine to generate the
final stream of results.
Discretized Stream Processing
Apache Spark Discretized Stream or, in short, a Spark Dstream represents a
stream of data divided into small batches.
Run a streaming computation as a series of very small, deterministic batch
jobs
live data stream Spark
 Chop up the live stream into batches of X Streaming
seconds
batches of X
 Spark treats each batch of data as RDDs and
processes them using RDD operations seconds

 Finally, the processed results of the RDD Spark


operations are returned in batches processed
results

18
Discretized Stream Processing
Apache Spark Discretized Stream or, in short, a Spark Dstream represents a
stream of data divided into small batches.
Run a streaming computation as a series of very small, deterministic batch
jobs
live data stream Spark
 Batch sizes as low as ½ second, latency ~ 1 Streaming
second
batches of X
 Potential for combining batch processing and
streaming processing in the same system seconds

Spark
processed
results

19
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

DStream: a sequence of RDD representing a stream of data

batch @ batch @
Twitter Streaming API batch @ t
t+1 t+2

tweets DStream

stored in memory as an RDD


(immutable, distributed)
Example 1 – Get hashtags from Twitter

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


val hashTags = tweets.flatMap (status => getTags(status))

new DStream transformation: modify data in one Dstream to create another DStream

batch @ batch @
batch @ t
t+1 t+2

tweets DStream

flatMap flatMap flatMap

hashTags Dstream new RDDs created for


[#cat, #dog, … ] every batch
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

output operation: to push data to external storage


batch @ batch @
batch @ t
t+1 t+2
tweets DStream
flatMa flatMa flatMa
p p p
hashTags DStream
save save save
every batch saved
to HDFS
Fault-tolerance

 RDDs remember the sequence of tweets input data


operations that created it from the RDD replicated
original fault-tolerant input data in memory

flatMap
 Batches of input data are replicated in
memory of multiple worker nodes,
therefore fault-tolerant hashTags
RDD lost partitions
recomputed on
 Data lost due to worker failure, can be other workers
recomputed from input data
Key concepts
 DStream – sequence of RDDs representing a stream of data
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

 Transformations – modify data from one DStream to another


- Standard RDD operations – map, countByValue, reduce, join, …
- Stateful operations – window, countByValueAndWindow, …

 Output Operations – send data to external entity


- saveAsHadoopFiles – saves to HDFS
- foreach – do anything with each batch of results
Transformations on DStreams
• Similar to that of RDDs, transformations allow the data
from the input DStream to be modified. DStreams
support many of the transformations available on normal
Spark RDD’s. Some of the common ones are as follows.
Transformations on DStreams
Transformations on DStreams
Example 2 – Count the hashtags

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.countByValue()
batch @ batch @
batch @ t
t+1 t+2
tweets
flatMa flatMa flatMa
p p p
hashTags
map map map


reduceByKe reduceByKe reduceByKe
tagCounts y y y
[(#cat, 10), (#dog, 25), ... ]
Example 3 – Count the hashtags over last 10 mins

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

sliding window
window length sliding interval
operation
Window Operations
• Spark Streaming also provides windowed computations,
which allows to apply transformations over a sliding
window of data.
• Any window operation needs to specify two parameters.
1. window length - The duration of the window.
2. sliding interval - The interval at which the window operation is
performed.
Example 3 – Counting the hashtags over last 10 mins

val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

t-1 t t+1 t+2 t+3


hashTags
sliding window

countByValue
tagCounts count over all
the data in the
window
Window Operations
Window Operations
Smart window-based countByValue

val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1))

t-1 t t+1 t+2 t+3


hashTags
countByValue
add the counts
from the new
batch in the
subtract the
counts from – + window
tagCounts batch before ?
the window +
Output Operations on DStreams
• Output operations allow DStream’s data to be pushed
out to external systems like a database or a file systems.
• Since the output operations actually allow the
transformed data to be consumed by external systems,
they trigger the actual execution of all the DStream
transformations.
• Currently, the following output operations are defined:
Output Operations on DStreams
References:
https://spark.apache.org/docs/latest/streaming-
programming-guide.html
GraphX: A Resilient Distributed Graph System on Spark
• GraphX is a new component in Spark for graphs and graph-parallel
computation.
• GraphX extends Spark’s Resilient Distributed Dataset (RDD) abstraction
to introduce the Resilient Distributed Property Graph (RDG).
• The property graph is a directed multigraph. It has multiple edges in
parallel. Here, every vertex and edge have user-defined properties
associated with it. Moreover, parallel edges allow multiple
relationships between the same vertices.
• To support graph computation, GraphX exposes a set of fundamental
operators (e.g., subgraph, joinVertices, and aggregateMessages) as well
as an optimized variant of the Pregel API. In addition, GraphX includes
a growing collection of graph algorithms and builders to simplify graph
analytics tasks.
• Also GraphX has new operations to view, filter, and transform graphs,
that substantially simplify the process of graph ETL and analysis.
Getting Started with GraphX
To get started, You first need to import Spark and
GraphX into your project, To get started, code as
Follows:
import org.apache.spark._
import org.apache.spark.graphx._

// To make some of the examples work we will also


need RDD
import org.apache.spark.rdd.RDD

Note: you will also need a SparkContext if you are not


using the Spark shell.
The Property Graph
• The property graph is a directed multigraph with user defined
objects attached to each vertex and edge.
• A directed multigraph is a directed graph with potentially multiple
parallel edges sharing the same source and destination vertex.
• The ability to support parallel edges simplifies modeling scenarios
where there can be multiple relationships between the same
vertices.
• Each vertex is keyed by a unique 64-bit long identifier (VertexId).
• GraphX does not impose any ordering constraints on the vertex
identifiers.
• Similarly, edges have corresponding source and destination vertex
identifiers.
• The property graph is parameterized over the vertex (VD) and edge
(ED) types.
The Property Graph
• GraphX optimizes the representation of vertex and edge types
when they are primitive data types (e.g., int, double, etc…)
reducing the in memory footprint by storing them in
specialized arrays.
• As RDDs, property graphs are also immutable, distributed,
and fault-tolerant.
• we can produce a new graph with the desired changes by
changing the values or structure of the graph. We can reuse
substantial parts of the original graph in the new graph.
• The graph is partitioned across the executors using a range of
vertex partitioning heuristics. As with RDDs, each partition of
the graph can be recreated on a different machine in the
event of a failure.
Example Property Graph
• Suppose we want to construct a property graph
consisting of the various collaborators on the
GraphX project.
• The vertex property might contain the username
and occupation.
• We could annotate edges with a string describing
the relationships between collaborators.
• The resulting graph would have the type
signature:
val userGraph: Graph[(String, String), String]
Example Property Graph
Example Property Graph
// Assume the SparkContext has already been constructed
val sc: SparkContext
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
(5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"),
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")
// Build the initial Graph
val graph = Graph(users, relationships, defaultUser)
triplet view of property graph
• In addition to the vertex and edge views of the property
graph, GraphX also exposes a triplet view.
• The triplet view logically joins the vertex and edge
properties yielding an RDD[EdgeTriplet[VD, ED]]
containing instances of the EdgeTriplet class.
• The EdgeTriplet class extends the Edge class by adding
the srcAttr and dstAttr members which contain the
source and destination properties respectively.
Spark GraphX Operators
• Just as RDDs have basic operations like map, filter,
and reduceByKey, property graphs also have a
collection of basic operators that take user
defined functions and produce new graphs with
transformed properties and structure.
• The graph Class provides basic operations to
access and manipulate the data associated with
vertices and edges as well as the underlying
structure.
• The GraphOps Class contains convenient
operators that are expressed as a compositions of
the core operators and graph algorithms.
Spark GraphX Operators
• Property Operators
– mapVertices
– mapEdges
– mapTriplets
• Structural Operators
– The reverse operator
– The subgraph operator
– The mask operator
– The groupEdges operator
• Join Operators
• Aggregate Messages
– In GraphX, the core aggregation operation is
aggregateMessages.
RFERENCES:
https://spark.apache.org/docs/latest/graphx-programming-
guide.html
https://data-flair.training/blogs/graphx-api-spark/

You might also like