Spark - ESGI 2020: IPPON 2018

Spark - ESGI 2020 .
IPPON 2019
2018
Overview of the course .
★ Each course will be divided in 45min theory and 45min TD
● In these TD you will apply what we have seen in theory
IPPON 2019
2018
★ Apache Spark in details
★ Being able of setting up a Big Data architecture with Spark

○ Good understanding of the concepts
★ Theory & Hands-on
IPPON 2019
2018
Agenda .
Spark Presentation
Spark Core - Dataframes
Spark Core - RDD
Spark Core - Advanced Dataframes
Spark Core - Cluster and distribution
Spark Core - Optimizing Spark
IPPON 2019
2018
★ Evaluation
★ 50% TD
★ 50% Project + presentation
IPPON 2019
2018
What is Spark ?
IPPON 2019
2018
Apache Spark .
★ Open Source project
★ Processing large volumes of data
★ In-memory
★ Fault-Tolerant & zero data loss
★ Built-in ML, Streaming and Graph processing libraries
★ Written in Scala
★ APIs available in Scala, Java, Python and R
★ SQL support
IPPON 2019
History
2009
Creation of Spark at the AMPLab of Berkley University (Matei Zaharia)
2013
Foundation of Databricks
June 2013
Apache Top Level Project
May 2014
Version 1.0.0
July 2016
Version 2.0.0
Today
Version 2.4.4 AND 3.0.0 Preview
IPPON 2019
2018
MapReduce .
IPPON 2019
2018
MapReduce .
★ Processing Pattern invented by Google (2004)
★ How does it work?

○ A dataset is split into items
○ The Map operation applies a transformation to each item
○ The Reduce operation aggregates the results
★ Hadoop MapReduce is an implementation of this pattern

○ Items are distributed across the network (Distributed
processing)
IPPON 2019
Spark - Map reduce
IPPON 2019
Hadoop MapReduce .
Word count example
Step 1 Step 2 Step 3 Step 4
Count words (Reduce)
Split input files into Split fragments into Split each line into ➢ On each node
fragments of 128MB lines words (Map)
➢ Between nodes
for final results
IPPON 2019
IPPON 2019
Hadoop
In real life
MR .
Pros Cons
Scalability Batch Processing only
Open-source No cyclic Dataflow
High level API Not easy to use
No in-memory
Slow
IPPON 2019
Spark and Hadoop .
IPPON 2019
2018
Spark & Hadoop
★ Hadoop MR & Spark for the same Use Cases
★ Spark’s low level API is inspired from the MapReduce development pattern
★ Spark is integrated with Hadoop distributions (On-premise & Cloud)
IPPON 2019
Why use Spark?
★ Spark’s API is much simpler than Hadoop MR

○ Simpler to write
○ Less verbose
★ The MapReduce model is relaxed

○ Map, Reduce and other operations can be chained without constraints
★ Spark come with a REPL - Spark Shell

○ Allows interactive processing of the data
○ Powerful for data exploration
○ Scala, Python and SQL
IPPON 2019
Spark vs MapReduce
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
val textFile = sc.textFile("hdfs://...")
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { textFile.flatMap(line => line.split(" "))
int sum = 0; .map(word => (word, 1))
while (values.hasNext()) { .reduceByKey(_+_)
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class); Spark Scala

conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
Hadoop MR Java
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
IPPON 2019
Why use Spark?
★ Spark is faster than Hadoop MR
○ In-memory processing
★ Spark officially sets a new record in large scale sorting (Nov. 2014)
○ Sorting 100TB is 30x faster
○ Hadoop MR - 72 Minutes (2100 nodes - 50400 cores)
○ Spark - 23 minutes (206 nodes - 6592 cores)
IPPON 2019
Integration in the Big Data Ecosystem
Cluster Message
Data Storage NoSQL Notebook
Management Broker
HDFS YARN Kafka Cassandra Jupyter
S3 Mesos MongoDB Zeppelin
GFS Kubernetes DynamoDB Spark Notebooks
IPPON 2019
Where is Spark in a data pipeline ?
IPPON 2019
2018
Spark modules
IPPON 2019
Version of the frameworks & tools .
Apache Spark Scala Apache Maven
2.4 2.11 3.x
IPPON 2019
Important / Documentation
★ Spark documentation : http://spark.apache.org/docs/2.4.4/api/scala/index.html
★ https://mvnrepository.com/
IPPON 2019
Questions & Comments .
IPPON 2019
2018

Spark - ESGI 2020: IPPON 2018

Uploaded by

Copyright:

Available Formats

You might also like

Spark - ESGI 2020: IPPON 2018

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark - ESGI 2020: IPPON 2018

Uploaded by

Copyright:

Available Formats

Spark - ESGI 2020 .

★ Each course will be divided in 45min theory and 45min TD

● In these TD you will apply what we have seen in theory

★ Apache Spark in details

★ Being able of setting up a Big Data architecture with Spark

★ Theory & Hands-on

Spark Core - Dataframes

Spark Core - RDD

Spark Core - Advanced Dataframes

Spark Core - Cluster and distribution

Spark Core - Optimizing Spark

★ 50% Project + presentation

★ Processing large volumes of data

★ Fault-Tolerant & zero data loss

★ Built-in ML, Streaming and Graph processing libraries

★ APIs available in Scala, Java, Python and R

★ How does it work?

★ Hadoop MapReduce is an implementation of this pattern

Step 1 Step 2 Step 3 Step 4

Count words (Reduce)

Scalability Batch Processing only

Open-source No cyclic Dataflow

High level API Not easy to use

★ Hadoop MR & Spark for the same Use Cases

★ Spark is integrated with Hadoop distributions (On-premise & Cloud)

★ Spark’s API is much simpler than Hadoop MR

★ The MapReduce model is relaxed

★ Spark come with a REPL - Spark Shell

public static void main(String[] args) throws Exception {

conf.setMapperClass(Map.class); Spark Scala

★ Spark is faster than Hadoop MR

HDFS YARN Kafka Cassandra Jupyter

S3 Mesos MongoDB Zeppelin

GFS Kubernetes DynamoDB Spark Notebooks

Apache Spark Scala Apache Maven

2.4 2.11 3.x

★ Spark documentation : http://spark.apache.org/docs/2.4.4/api/scala/index.html

You might also like