Spark Architecture

Spark 2018
INTRODUCTION TO APACHE SPARK
Fast, interactive, language-integrated cluster

computing frame work.
Spark is an open source cluster computing
framework.
Originally developed by the University of
California “Berkeley” at AMP Lab.
Later on hand over to the apache software
foundation.
Mr. Virendra 1
Spark 2018
Why Apache spark

Hadoop is having performance issue.
Hadoop support only batch processing
Problem with batch processing.
1. Doesn’t interact directly with user.

2. Each user prepares his job/program on an off-
line and submits it to computer operator.
3. The programmer leaves their program with the
operator & the operator start the program
with similar requirements into batches.
Mr. Virendra 2
Spark 2018
Job
|----- > CPU time
|----- > I/O time
If the job is fully completed, then only

another job will be scheduled.
It increases the CPU time.
The throughputs of the system will increases.
Spark supports batch & real time processing.
Mr. Virendra 3
Spark 2018
(Types of OS:Batch Processing,Time

sharing,distributed,network,real time,hard
real time,soft real time)
In Hadoop, we Force to write MapReduce.
We cannot control internals of framework.
Mapper stores its output into local.
Final output of reducer goes to HDFS.
In complex processing 10-20 MR job required.

L/H L/H
MR1 –-> MR2 --> MR3
Mr. Virendra 4
Spark 2018
Grouping is done at reduce side, so many time we

require more numbers of reducer but only one
mapper.
In hadoop Map-Reduce is tightly coupled.

We need such a frame work which gives a liberty
to user for Map and Reduce.
In spark Map-Reduce is not tightly coupled.
Mr. Virendra 5
Spark 2018
In hadoop Map is a task and Reduce is a task.

In spark task which can perform
(transformation,aggregation,calculation,
shuffle).
In hadoop(1x) one job tracker and multiple task
tracker. In(2x)that is in YARN application
master.
In spark “driver” class is used which is
responsible to execute task with the help of
executers.
Mr. Virendra 6
Spark 2018
Driver class decided number of executer;

programmers can also decide no. of executers.
Inside executer task will run (task may be
anything map, reduce).
One executer can handle multiple tasks.
Every task is atomic level of the operation.
Each task operates on subset of data.
In driver class there is one object called as
spark context.
In spark driver program communicate with
executer using spark context.
Mr. Virendra 7
Spark 2018
With the help of Spark context driver program

can send anything to executors automatically.
Mr. Virendra 8
Spark 2018
Driver Program
o The main executable program from where spark
operation are performed.
o Control and coordinate all operations.
o The driver program is the main class execute
parallel operation on a cluster define RDD
Mr. Virendra 9
Spark 2018
Spark Context
o Driver accesses spark functionality through a
spark context object represents a connection the
computing cluster.
o Used to build RDD.
o Work with cluster manager.
o Manages executors running on worker nodes.
Mr. Virendra 10
Spark 2018
o In hadoop job tracker executes the job and

manages the resources (logic+cluster).
o In sparks cluster handle by cluster manager and

logic will handle by some other.
Mr. Virendra 11
Spark 2018
Spark Framework
Mr. Virendra 12
Spark 2018
Spark Engine:
o Take input data and distribute to the cluster, and

perform the operation.
o Spark core contain lot of library and functions.
Management:
o Scheduling job, managing job over the cluster in

spark we can used YARN, Mesos and apache standalone
cluster manager.
Storage:
o In spark we can used HDFS,S3,RDBMS,local,NoSql etc..
Mr. Virendra 13
Spark 2018
Library :
o Spark SQL gives easy way to read and write data.
o ML Lib supports n number machine learning algorithm.
o Spark supports graph analysis with the help of
GraphX.
o Spark supports real time streaming with the help of
streaming library.
Programming
o Spark support various language like scala, python, R
and java.
o Scala,python and R having interactive shell.
o Sparks support many tools due to sparksql like JDBC.
Mr. Virendra 14
Spark 2018
RDD (Resilant Distributed Dataset)
 Spark is built around RDD’S, you can create,

transform, analyze and store RDD’S in a spark program.
Example: In home multiple rooms, so room is smallest

unit of home.
 The data set contain a collection of elements of any

type.(string, lines, row, object, collection)
 The dataset can be partitioned and distributed

across multiple nodes.
 RDD’S are immutable. They cannot be changed.
 They can cache and persist.

Mr. Virendra 15
Spark 2018
 Transformation act on RDD to create new RDD
 Actions analyze RDD’S to provide a result.
Mr. Virendra 16
Spark 2018
Mr. Virendra 17

Spark Architecture

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark Architecture

Uploaded by

Copyright:

Available Formats

Spark 2018

INTRODUCTION TO APACHE SPARK

Fast, interactive, language-integrated cluster

Why Apache spark

Problem with batch processing.

1. Doesn’t interact directly with user.

If the job is fully completed, then only

(Types of OS:Batch Processing,Time

In Hadoop, we Force to write MapReduce.

We cannot control internals of framework.

Mapper stores its output into local.

Final output of reducer goes to HDFS.

In complex processing 10-20 MR job required.

Grouping is done at reduce side, so many time we

In hadoop Map-Reduce is tightly coupled.

In hadoop Map is a task and Reduce is a task.

Driver class decided number of executer;

With the help of Spark context driver program

o Used to build RDD.

o Work with cluster manager.

o Manages executors running on worker nodes.

o In hadoop job tracker executes the job and

o In sparks cluster handle by cluster manager and

o Take input data and distribute to the cluster, and

o Spark core contain lot of library and functions.

o Scheduling job, managing job over the cluster in

RDD (Resilant Distributed Dataset)

 Spark is built around RDD’S, you can create,

Example: In home multiple rooms, so room is smallest

 The data set contain a collection of elements of any

 The dataset can be partitioned and distributed

 RDD’S are immutable. They cannot be changed.

 They can cache and persist.

 Transformation act on RDD to create new RDD

 Actions analyze RDD’S to provide a result.

You might also like