Spark - ESGI 2020: IPPON 2018

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Spark - ESGI 2020 .

IPPON 2019
2018
Overview of the course .

★ Each course will be divided in 45min theory and 45min TD

● In these TD you will apply what we have seen in theory

IPPON 2019
2018
Overview of the course .

★ Apache Spark in details

★ Being able of setting up a Big Data architecture with Spark


○ Good understanding of the concepts

★ Theory & Hands-on

IPPON 2019
2018
Agenda .

Spark Presentation

Spark Core - Dataframes

Spark Core - RDD

Spark Core - Advanced Dataframes

Spark Core - Cluster and distribution

Spark Core - Optimizing Spark

IPPON 2019
2018
Overview of the course .

★ Evaluation

★ 50% TD

★ 50% Project + presentation

IPPON 2019
2018
What is Spark ?

IPPON 2019
2018
Apache Spark .
★ Open Source project

★ Processing large volumes of data

★ In-memory

★ Fault-Tolerant & zero data loss

★ Built-in ML, Streaming and Graph processing libraries

★ Written in Scala

★ APIs available in Scala, Java, Python and R

★ SQL support

IPPON 2019
History
2009
Creation of Spark at the AMPLab of Berkley University (Matei Zaharia)

2013
Foundation of Databricks

June 2013
Apache Top Level Project

May 2014
Version 1.0.0

July 2016
Version 2.0.0

Today
Version 2.4.4 AND 3.0.0 Preview

IPPON 2019
2018
MapReduce .

IPPON 2019
2018
MapReduce .
★ Processing Pattern invented by Google (2004)

★ How does it work?


○ A dataset is split into items
○ The Map operation applies a transformation to each item
○ The Reduce operation aggregates the results

★ Hadoop MapReduce is an implementation of this pattern


○ Items are distributed across the network (Distributed
processing)

IPPON 2019
Spark - Map reduce

IPPON 2019
Hadoop MapReduce .
Word count example

Step 1 Step 2 Step 3 Step 4

Count words (Reduce)

Split input files into Split fragments into Split each line into ➢ On each node
fragments of 128MB lines words (Map)
➢ Between nodes
for final results

IPPON 2019
IPPON 2019
Hadoop
In real life
MR .

Pros Cons

Scalability Batch Processing only

Open-source No cyclic Dataflow

High level API Not easy to use

No in-memory

Slow

IPPON 2019
Spark and Hadoop .

IPPON 2019
2018
Spark & Hadoop

★ Hadoop MR & Spark for the same Use Cases

★ Spark’s low level API is inspired from the MapReduce development pattern

★ Spark is integrated with Hadoop distributions (On-premise & Cloud)

IPPON 2019
Why use Spark?

★ Spark’s API is much simpler than Hadoop MR


○ Simpler to write
○ Less verbose

★ The MapReduce model is relaxed


○ Map, Reduce and other operations can be chained without constraints

★ Spark come with a REPL - Spark Shell


○ Allows interactive processing of the data
○ Powerful for data exploration
○ Scala, Python and SQL

IPPON 2019
Spark vs MapReduce
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
val textFile = sc.textFile("hdfs://...")
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { textFile.flatMap(line => line.split(" "))
int sum = 0; .map(word => (word, 1))
while (values.hasNext()) { .reduceByKey(_+_)
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {


JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class); Spark Scala


conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
Hadoop MR Java
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}

IPPON 2019
Why use Spark?

★ Spark is faster than Hadoop MR

○ In-memory processing

★ Spark officially sets a new record in large scale sorting (Nov. 2014)
○ Sorting 100TB is 30x faster
○ Hadoop MR - 72 Minutes (2100 nodes - 50400 cores)
○ Spark - 23 minutes (206 nodes - 6592 cores)

IPPON 2019
Integration in the Big Data Ecosystem

Cluster Message
Data Storage NoSQL Notebook
Management Broker

HDFS YARN Kafka Cassandra Jupyter

S3 Mesos MongoDB Zeppelin

GFS Kubernetes DynamoDB Spark Notebooks

IPPON 2019
Where is Spark in a data pipeline ?

IPPON 2019
2018
Spark modules

IPPON 2019
Version of the frameworks & tools .

Apache Spark Scala Apache Maven

2.4 2.11 3.x

IPPON 2019
Important / Documentation

★ Spark documentation : http://spark.apache.org/docs/2.4.4/api/scala/index.html

★ https://mvnrepository.com/

IPPON 2019
Questions & Comments .

IPPON 2019
2018

You might also like