Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Spark 2018

INTRODUCTION TO APACHE SPARK

Fast, interactive, language-integrated cluster


computing frame work.
Spark is an open source cluster computing
framework.
Originally developed by the University of
California “Berkeley” at AMP Lab.
Later on hand over to the apache software
foundation.

Mr. Virendra 1
Spark 2018

Why Apache spark


Hadoop is having performance issue.
Hadoop support only batch processing

Problem with batch processing.

1. Doesn’t interact directly with user.


2. Each user prepares his job/program on an off-
line and submits it to computer operator.
3. The programmer leaves their program with the
operator & the operator start the program
with similar requirements into batches.

Mr. Virendra 2
Spark 2018

Job
|----- > CPU time
|----- > I/O time

If the job is fully completed, then only


another job will be scheduled.
It increases the CPU time.
The throughputs of the system will increases.
Spark supports batch & real time processing.

Mr. Virendra 3
Spark 2018

(Types of OS:Batch Processing,Time


sharing,distributed,network,real time,hard
real time,soft real time)

In Hadoop, we Force to write MapReduce.

We cannot control internals of framework.

Mapper stores its output into local.

Final output of reducer goes to HDFS.

In complex processing 10-20 MR job required.


L/H L/H
MR1 –-> MR2 --> MR3

Mr. Virendra 4
Spark 2018

Grouping is done at reduce side, so many time we


require more numbers of reducer but only one
mapper.

In hadoop Map-Reduce is tightly coupled.


We need such a frame work which gives a liberty
to user for Map and Reduce.
In spark Map-Reduce is not tightly coupled.

Mr. Virendra 5
Spark 2018

In hadoop Map is a task and Reduce is a task.


In spark task which can perform
(transformation,aggregation,calculation,
shuffle).
In hadoop(1x) one job tracker and multiple task
tracker. In(2x)that is in YARN application
master.
In spark “driver” class is used which is
responsible to execute task with the help of
executers.

Mr. Virendra 6
Spark 2018

Driver class decided number of executer;


programmers can also decide no. of executers.
Inside executer task will run (task may be
anything map, reduce).
One executer can handle multiple tasks.
Every task is atomic level of the operation.
Each task operates on subset of data.
In driver class there is one object called as
spark context.
In spark driver program communicate with
executer using spark context.

Mr. Virendra 7
Spark 2018

With the help of Spark context driver program


can send anything to executors automatically.

Mr. Virendra 8
Spark 2018

Driver Program
o The main executable program from where spark
operation are performed.
o Control and coordinate all operations.
o The driver program is the main class execute
parallel operation on a cluster define RDD

Mr. Virendra 9
Spark 2018

Spark Context
o Driver accesses spark functionality through a
spark context object represents a connection the
computing cluster.

o Used to build RDD.

o Work with cluster manager.

o Manages executors running on worker nodes.

Mr. Virendra 10
Spark 2018

o In hadoop job tracker executes the job and


manages the resources (logic+cluster).

o In sparks cluster handle by cluster manager and


logic will handle by some other.

Mr. Virendra 11
Spark 2018

Spark Framework

Mr. Virendra 12
Spark 2018

Spark Engine:

o Take input data and distribute to the cluster, and


perform the operation.

o Spark core contain lot of library and functions.

Management:

o Scheduling job, managing job over the cluster in


spark we can used YARN, Mesos and apache standalone
cluster manager.

Storage:
o In spark we can used HDFS,S3,RDBMS,local,NoSql etc..

Mr. Virendra 13
Spark 2018

Library :
o Spark SQL gives easy way to read and write data.
o ML Lib supports n number machine learning algorithm.
o Spark supports graph analysis with the help of
GraphX.
o Spark supports real time streaming with the help of
streaming library.

Programming
o Spark support various language like scala, python, R
and java.
o Scala,python and R having interactive shell.
o Sparks support many tools due to sparksql like JDBC.

Mr. Virendra 14
Spark 2018

RDD (Resilant Distributed Dataset)

 Spark is built around RDD’S, you can create,


transform, analyze and store RDD’S in a spark program.

Example: In home multiple rooms, so room is smallest


unit of home.

 The data set contain a collection of elements of any


type.(string, lines, row, object, collection)

 The dataset can be partitioned and distributed


across multiple nodes.

 RDD’S are immutable. They cannot be changed.

 They can cache and persist.


Mr. Virendra 15
Spark 2018

 Transformation act on RDD to create new RDD

 Actions analyze RDD’S to provide a result.

Mr. Virendra 16
Spark 2018

Mr. Virendra 17

You might also like