Apache Spark

FIRST STEPS ON APACHE SPARK
BIG DATA II Jesús Maillo (jesusmh@decsai.ugr.es) 14/03/2019

Outline
2
 Apache Hadoop VS Apache Spark

 Weaknesses of Hadoop
 Apache Spark
 MLlib: Machine Learning on Spark
 ML: API on top of DataFrame
 SparkR & PySpark

Hadoop VS Spark
3
Hadoop VS Spark
4
Hadoop VS Spark
5
Hadoop VS Spark
6
Hadoop VS Spark
7
Hadoop VS Spark
8
Hadoop VS Spark
9
Hadoop VS Spark
10
https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Hadoop VS Spark
11
Hadoop VS Spark
12
Outline
13

 Apache Spark

Weaknesses of Hadoop
14
 Use of HDD disc

 Java programming
There is no interactive shell

 You can not iterate over the data
 However, it is widely used for its
great advantages
15
Open Source Community

 1000+ meetup members
 70+ contributors from 20 companies
 In use at Intel, Yahoo!, Adobe, etc.

16
 A wide variety
of Solutions:
Outline
17

 Apache Spark

Apache Spark
18
Retain the attractive properties of MapReduce

 Fault tolerance
 Data locality
 Scalability
Solution: augment data flow model with

“resilient distributed datasets” (RDDs)
Apache Spark
19
What is a RDD?
 A RDD is an immutable, partitioned, logical
collection of records
 Built using transformations
over another RDDs

 Can be cached for future reuse
 Partitioning can be based on a key in each record

Apache Spark
20
Transformations (define a new RDD)

map flatMap mapPartition
filter repartition sample
union intersection distinct
aggregateByKey reduceByKey
http://spark.apache.org/docs/latest/programming-guide.html#transformations
Apache Spark
21
Actions (return a result to driver program)

count first take collect
reduce takeSample takeOrdered
saveAsTextFile saveAsSequenceFile
http://spark.apache.org/docs/latest/programming-guide.html#actions
Apache Spark
22
Iterative Algorithms
Apache Spark
23
HDFS
 Hadoop Distributed
File System
 Commodity Hardware
 HDD disk
Apache Spark
24
HDFS
/home/antoniolopez/ /user/antoniolopez/
HDFS storage is different from user's local storage

Outline
25

 Apache Spark

Mllib: Machine Learning on
26
Mllib is a scalable machine learning library

 Easy to Deploy
 Take advantage of hadoop environment
 Contains many algorithms and utilities.
https://spark.apache.org/docs/latest/mllib-guide.html
Mllib: Machine Learning on
27
Algorithms and utilities

 Classification
 Naive Bayes, Decision Tree classifier, Random Forest
 Regression
 Linear regression, Decision Tree regression
 Clustering
 K-means
 Statistics
 Summary Statistics, Pearson’s test for independence
Mllib: Read & Write
28
 Reading from HDFS to main memory
 Writing intermediate or final results to HDFS

Mllib: Cache
29
Algorithms and utilities

 “Cache” operation forces Spark to
distribute the data
 Allocate the data in main memory
 Import for reuse of data
Efficiently
 Iterative algorithm
Mllib: Web UI
30
Web UI ( http://hadoop.ugr.es:8079/cluster )
Mllib: Web UI
31
12 Workers
Mllib: Web UI
32
Run & completed applications

Outline
33

 Apache Spark

ML: API on top of DataFrame
34
“Spark ML standardizes APIs for machine

learning algorithms to make it easier to combine
multiple algorithms into a single pipeline, or
workflow.”
https://spark.apache.org/docs/latest/ml-guide.html
35
“A DataFrame is a distributed collection

of data organized into named columns. It
is conceptually equivalent to a table in a
relational database or a data frame in
R/Python, but with richer optimizations
under the hood”
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
36
SQL Queries
 SQL queries using string
 Return the result as a new DataFrame
http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically
37
Interoperating with RDDs

 Transforming a RDD into a DataFrame
 Build like a RDD with column names
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
Outline
38

 Apache Spark

SparkR & PySpark
39
https://spark.apache.org/docs/latest/api/python/index.html
https://spark.apache.org/docs/latest/sparkr.html
FIRST STEPS ON APACHE SPARK
BIG DATA II Jesús Maillo (jesusmh@decsai.ugr.es) 14/03/2019

Apache Spark

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apache Spark

Uploaded by

Copyright:

Available Formats

FIRST STEPS ON APACHE SPARK

BIG DATA II Jesús Maillo (jesusmh@decsai.ugr.es) 14/03/2019

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

 Use of HDD disc

There is no interactive shell

Open Source Community

 70+ contributors from 20 companies

 In use at Intel, Yahoo!, Adobe, etc.

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

Retain the attractive properties of MapReduce

Solution: augment data flow model with

over another RDDs

 Partitioning can be based on a key in each record

Transformations (define a new RDD)

filter repartition sample

union intersection distinct

Actions (return a result to driver program)

reduce takeSample takeOrdered

HDFS storage is different from user's local storage

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

Mllib is a scalable machine learning library

 Take advantage of hadoop environment

 Contains many algorithms and utilities.

Algorithms and utilities

 Reading from HDFS to main memory

 Writing intermediate or final results to HDFS

Algorithms and utilities

 Import for reuse of data

Run & completed applications

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

“Spark ML standardizes APIs for machine

“A DataFrame is a distributed collection

Interoperating with RDDs

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

You might also like