Professional Documents
Culture Documents
Apache Spark
Apache Spark
Apache Spark
https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Hadoop VS Spark
11
https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Hadoop VS Spark
12
https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Outline
13
Apache Spark
A wide variety
of Solutions:
Outline
17
Apache Spark
Data locality
Scalability
What is a RDD?
A RDD is an immutable, partitioned, logical
collection of records
Built using transformations
aggregateByKey reduceByKey
http://spark.apache.org/docs/latest/programming-guide.html#transformations
Apache Spark
21
saveAsTextFile saveAsSequenceFile
http://spark.apache.org/docs/latest/programming-guide.html#actions
Apache Spark
22
Iterative Algorithms
Apache Spark
23
HDFS
Hadoop Distributed
File System
Commodity Hardware
HDD disk
Apache Spark
24
HDFS
/home/antoniolopez/ /user/antoniolopez/
Apache Spark
https://spark.apache.org/docs/latest/mllib-guide.html
Mllib: Machine Learning on
27
Efficiently
Iterative algorithm
Mllib: Web UI
30
Web UI ( http://hadoop.ugr.es:8079/cluster )
Mllib: Web UI
31
12 Workers
Mllib: Web UI
32
Apache Spark
https://spark.apache.org/docs/latest/ml-guide.html
ML: API on top of DataFrame
35
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
ML: API on top of DataFrame
36
SQL Queries
SQL queries using string
Return the result as a new DataFrame
http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically
ML: API on top of DataFrame
37
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
Outline
38
Apache Spark
https://spark.apache.org/docs/latest/api/python/index.html
https://spark.apache.org/docs/latest/sparkr.html
FIRST STEPS ON APACHE SPARK
BIG DATA II Jesús Maillo (jesusmh@decsai.ugr.es) 14/03/2019