Professional Documents
Culture Documents
Spark - ESGI 2020: IPPON 2018
Spark - ESGI 2020: IPPON 2018
Spark - ESGI 2020: IPPON 2018
IPPON 2019
2018
Overview of the course .
IPPON 2019
2018
Overview of the course .
IPPON 2019
2018
Agenda .
Spark Presentation
IPPON 2019
2018
Overview of the course .
★ Evaluation
★ 50% TD
IPPON 2019
2018
What is Spark ?
IPPON 2019
2018
Apache Spark .
★ Open Source project
★ In-memory
★ Written in Scala
★ SQL support
IPPON 2019
History
2009
Creation of Spark at the AMPLab of Berkley University (Matei Zaharia)
2013
Foundation of Databricks
June 2013
Apache Top Level Project
May 2014
Version 1.0.0
July 2016
Version 2.0.0
Today
Version 2.4.4 AND 3.0.0 Preview
IPPON 2019
2018
MapReduce .
IPPON 2019
2018
MapReduce .
★ Processing Pattern invented by Google (2004)
IPPON 2019
Spark - Map reduce
IPPON 2019
Hadoop MapReduce .
Word count example
Split input files into Split fragments into Split each line into ➢ On each node
fragments of 128MB lines words (Map)
➢ Between nodes
for final results
IPPON 2019
IPPON 2019
Hadoop
In real life
MR .
Pros Cons
No in-memory
Slow
IPPON 2019
Spark and Hadoop .
IPPON 2019
2018
Spark & Hadoop
★ Spark’s low level API is inspired from the MapReduce development pattern
IPPON 2019
Why use Spark?
IPPON 2019
Spark vs MapReduce
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
val textFile = sc.textFile("hdfs://...")
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { textFile.flatMap(line => line.split(" "))
int sum = 0; .map(word => (word, 1))
while (values.hasNext()) { .reduceByKey(_+_)
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
Hadoop MR Java
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
IPPON 2019
Why use Spark?
○ In-memory processing
★ Spark officially sets a new record in large scale sorting (Nov. 2014)
○ Sorting 100TB is 30x faster
○ Hadoop MR - 72 Minutes (2100 nodes - 50400 cores)
○ Spark - 23 minutes (206 nodes - 6592 cores)
IPPON 2019
Integration in the Big Data Ecosystem
Cluster Message
Data Storage NoSQL Notebook
Management Broker
IPPON 2019
Where is Spark in a data pipeline ?
IPPON 2019
2018
Spark modules
IPPON 2019
Version of the frameworks & tools .
IPPON 2019
Important / Documentation
★ https://mvnrepository.com/
IPPON 2019
Questions & Comments .
IPPON 2019
2018