Professional Documents
Culture Documents
Hadoop Trainting in Hyderabad@KellyTechnologies
Hadoop Trainting in Hyderabad@KellyTechnologies
Large-Scale Data
Management
Hadoop/MapReduce
Computing Paradigm
Database
vs.
www.kellytechno.com
Scalability (petabytes of
data, thousands of machines)
Flexibility in accepting all
data formats (no schema)
Efficient and simple faulttolerant mechanism
Performance (tons of
indexing, tuning, data
organization tech.)
Features:
- Provenance tracking
- Annotation management
- .
Commodity inexpensive
hardware
www.kellytechno.com
What is Hadoop
Hadoop is a software framework for distributed
MapReduce
Hadoop is based on a simple programming model called
MapReduce
Hadoop is based on a simple data model, any data will fit
www.kellytechno.com
www.kellytechno.com
Hadoop Master/Slave
Architecture
Hadoop is designed as a master-slave shared-nothing architecture
www.kellytechno.com
thousands of nodes
Commodity hardware
Large number of low-end cheap machines
www.kellytechno.com
reduce
www.kellytechno.com
paradigm
Yahoo: Developing Hadoop open-source of
MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Many others + universities and research labs
www.kellytechno.com
10
www.kellytechno.com
Hadoop Architecture
Distributed file system (HDFS)
Execution engine (MapReduce)
11
www.kellytechno.com
Centralized namenode
- Maintains metadata info about fi
File F
www.kellytechno.com
www.kellytechno.com
Produces (k,
v)
( , 1)
Map
Shuffle &
Sorting based
on k
Parse-hash
Consumes(k, [v])
(
,
[1,1,1,1,1,1..])
Produces(k, v)
(
, 100)
Reduce
Map
Parse-hash
Reduce
Map
Parse-hash
Reduce
Map
14
Parse-hash
namenode)
15
www.kellytechno.com
M ap
P a rse-h a sh
R ed u ce
M ap
P a rse-h a sh
R ed u ce
M ap
P a rse-h a sh
R ed u ce
M ap
16
P a rse-h a sh
www.kellytechno.com
Key-Value Pairs
Mappers and Reducers are users code (provided
functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume <key, value> pairs
Produce <key, value> pairs
Reducers:
17
www.kellytechno.com
MapReduce Phases
www.kellytechno.com
Map
Tasks
19
Reduce
Tasks
www.kellytechno.com
www.kellytechno.com
Produces (k,
v)
( , 1)
Map
Parse-hash
Map
Parse-hash
Map
Map
20
Shuffle &
Sorting based
on k
Consumes(k, [v])
(
,
[1,1,1,1,1,1..])
Produces(k, v)
(
, 100)
Part0001
Reduce
Reduce
Part0002
Reduce
Part0003
Parse-hash
Parse-hash
Write to HDFS
Write to HDFS
Map
Write to HDFS
Map
Write to HDFS
Map
21
Part0001
Part0002
Part0003
Part0004
www.kellytechno.com
Hadoop
Computing
Model
Notion of transactions
Transaction is the unit of work
ACID properties, Concurrency
control
Notion of jobs
Job is the unit of work
No concurrency control
Data Model
Cost Model
Expensive servers
Fault Tolerance
Cloud Computing
Key
- Efficiency, optimizations, fine
A computing model
where any computing
Characteristics
tuning
22
www.kellytechno.com
Thank You
Presented By
23