Spark - ESGI 2020: IPPON 2018

Spark - ESGI 2020 .

IPPON 2019
Overview of the course .

★ Each course will be divided in 45min theory and 45min TD

● In these TD you will apply what we have seen in theory

Overview of the course .

★ Apache Spark in details

★ Being able of setting up a Big Data architecture with Spark

○ Good understanding of the concepts

★ Theory & Hands-on

Agenda .

Spark Presentation

Spark Core - Dataframes

Spark Core - RDD

Spark Core - Advanced Dataframes

Spark Core - Cluster and distribution

Spark Core - Optimizing Spark

Overview of the course .

★ Evaluation

★ 50% TD

★ 50% Project + presentation

What is Spark ?

Apache Spark .
★ Open Source project

★ Processing large volumes of data

★ In-memory

★ Fault-Tolerant & zero data loss

★ Built-in ML, Streaming and Graph processing libraries

★ Written in Scala

★ APIs available in Scala, Java, Python and R

★ SQL support

Creation of Spark at the AMPLab of Berkley University (Matei Zaharia)

Foundation of Databricks

June 2013
Apache Top Level Project

May 2014
Version 1.0.0

July 2016
Version 2.0.0

Version 2.4.4 AND 3.0.0 Preview

MapReduce .

MapReduce .
★ Processing Pattern invented by Google (2004)

★ How does it work?

○ A dataset is split into items
○ The Map operation applies a transformation to each item
○ The Reduce operation aggregates the results

★ Hadoop MapReduce is an implementation of this pattern

○ Items are distributed across the network (Distributed

Spark - Map reduce

Hadoop MapReduce .
Word count example

Step 1 Step 2 Step 3 Step 4

Count words (Reduce)

Split input files into Split fragments into Split each line into ➢ On each node
fragments of 128MB lines words (Map)
➢ Between nodes
for final results

In real life
MR .
MR .

Pros Cons

Scalability Batch Processing only

Open-source No cyclic Dataflow

High level API Not easy to use

No in-memory


Spark and Hadoop .

Spark & Hadoop

★ Hadoop MR & Spark for the same Use Cases

★ Spark’s low level API is inspired from the MapReduce development pattern

★ Spark is integrated with Hadoop distributions (On-premise & Cloud)

Why use Spark?

★ Spark’s API is much simpler than Hadoop MR

○ Simpler to write
○ Less verbose

★ The MapReduce model is relaxed

○ Map, Reduce and other operations can be chained without constraints

★ Spark come with a REPL - Spark Shell

○ Allows interactive processing of the data
○ Powerful for data exploration
○ Scala, Python and SQL

Spark vs MapReduce
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
output.collect(word, one);
val textFile = sc.textFile("hdfs://...")
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { textFile.flatMap(line => line.split(" "))
int sum = 0; .map(word => (word, 1))
while (values.hasNext()) { .reduceByKey(_+_)
sum +=;
output.collect(key, new IntWritable(sum));

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);


conf.setMapperClass(Map.class); Spark Scala


Hadoop MR Java
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));


Why use Spark?

★ Spark is faster than Hadoop MR

○ In-memory processing

★ Spark officially sets a new record in large scale sorting (Nov. 2014)
○ Sorting 100TB is 30x faster
○ Hadoop MR - 72 Minutes (2100 nodes - 50400 cores)
○ Spark - 23 minutes (206 nodes - 6592 cores)

Integration in the Big Data Ecosystem

Cluster Message
Data Storage NoSQL Notebook
Management Broker

HDFS YARN Kafka Cassandra Jupyter

S3 Mesos MongoDB Zeppelin

GFS Kubernetes DynamoDB Spark Notebooks

Where is Spark in a data pipeline ?

Spark modules

Version of the frameworks & tools .

Apache Spark Scala Apache Maven

2.4 2.11 3.x

Important / Documentation

★ Spark documentation :


Questions & Comments .

