Professional Documents
Culture Documents
Parallel Processing Platforms: Engineering Big Data
Parallel Processing Platforms: Engineering Big Data
AVRO Zoo Machine Learning on Hadoop Spark-ML, Mahout, Samsara, H20, Flink, R-Hadoop
Keeper SECURITY
S & KAFKA, SAMZA, STORM, TRIDENT,
E C QOS
Streaming & Near Real Time Processing
SPARK-STREAMING, FLINK
R O
I O KNOX
Ranger Application Programming PIG, Oozie, Hadoop Streaming,
A R R-Hadoop, Spark-R
Sentry
L D Atlas Data Organization
I I Kerberos SQL / No SQL HIVE, IMPALA, SQL on SPARK, Apache Drill
Z N PRIVACY ----------------------------------------
A A Parallel Computing Flink Hbase, Cassandra, MongoDB, Neo4J, Kudu
T T AUDIT Map-Reduce, MR2, Spark, Hama
I I
O O Resource Management (OS) YARN
GOVERNANCE
N N
HDFS STORAGE (Persistence)
INGESTION
Sqoop, Flume, Chukwa
YARN REFRESHER
3
The best place for students to learn Applied Engineering 3 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 4 http://www.insofe.edu.in
YARN makes Hadoop multi-tenant
http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/
The best place for students to learn Applied Engineering 6 http://www.insofe.edu.in
Role of a Resource Manager
The ReservationSystem is a
YARN component of YARN that allows
users to specify a profile of
Supports resources over-time and temporal
constraints (e.g., deadlines).
Reservation
The ReservationSystem tracks
resources over-time, performs
admission control for reservations,
and dynamically instructs the
underlying scheduler to ensure
that the reservation is fulfilled.
Parallelization
Platform
Existing software tools Run in parallel &
or applications… more efficiently on Hadoop.
In-built Parallelism
Built on Hadoop
T1 T2 T3 Tasks
r1 r2 r3 Partial Results
Matei Zaharia
The best place for students to learn Applied Engineering 18 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 19 http://www.insofe.edu.in
Kostas Dzoumas Stephen Ewen
Table API
Many Hadoop eco-system components employ these frameworks. Even more employ the ideas.
Platforms
Apache Hama (2012) 2014
2014
2005 for MR, 2010 for MR2
Frameworks
Pregel: A System for
Large Scale Graph
Processing
General-purpose
General-purpose 2010 Implementation on HDFS
Implementation on HDFS
Abstractions
RDDs: A fault tolerant
Distributed
BSP: “Bulk Synchronous Map Reduce: “Simplified Data abstraction for in-
Streaming
Parallel processing” Processing on Large Clusters” memory cluster
Data Flows
computing
1990 – Les Valiant 2004: Jeffrey Dean, Sanjay Ghemawat 2012: Zaharia,
Choudhary, Das, etc.
1:many
Mini Reducer
• Map Reduce
– MR1: Job Tracker, Task Tracker
– Distributed Cache
– MR Design Patterns
• BSP
– Vote to Halt
– Think like a vertex