Professional Documents
Culture Documents
1.4 Map Reduce
1.4 Map Reduce
http://www.google.org/flutrends/ca/
(2012)
Average Searches Per Day:
5,134,000,000
2
Motivation
• Process lots of data
• Google processed about 24 petabytes of data per day in
2009.
4
Typical problem solved by MapReduce
• Read a lot of data
5
MapReduce workflow
7
Example: Word Count
http://kickstarthadoop.blogspot.ca/2011/04/word-count-hadoop-map-reduce-example.html
8
Mapper
• Reads in input pair <Key,Value>
9
Reducer
• Accepts the Mapper output, and aggregates
values on the key
– For our example, the reducer input would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store, 1>
<opens,1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>
– The output would be:
<The, 6> <teacher, 1> <went, 1> <to, 1> <store, 3> <was, 1>
<closed, 1> <opens, 1> <morning, 1> <at, 1> <9am, 1>
10
MapReduce
Hadoo
p
Progra
m
fork fork fork
Mas
assign ter assign
map reduce
Input Data Output Data
Wor Outpu
ker Wor write t
Split Transfer local ker File 0
0 read write
Split peta-sca Wor
1
Split ker Outpu
le data Wor t
2
through Wor ker File 1
remote
network ker
read,
sort
Map Reduce 11
Google File System (GFS)
Hadoop Distributed File System (HDFS)
• Split data and store 3 replica on commodity
servers
12
HDFS MapReduce
NameNod
e Where are the chunks
Location of the of input data?
chunks of input data
Mas
assign ter assign
map
Input Data reduce Output Data
Wor Outpu
Split
ker Wor write t
Split 0 local ker File 0
0
Split Wor write
1
Split ker
Split Outpu
1 Wor t
2
Wor ker File 1
remote
ker
Split
read,
2
Read from sort
local disk Map Reduce
13
Locality Optimization
• Master scheduling policy:
– Asks GFS for locations of replicas of input file blocks
– Map tasks scheduled so GFS input block replica are on
same machine or same rack
14
Failure in MapReduce
• Failures are norm in commodity hardware
• Worker failure
– Detect failure via periodic heartbeats
– Re-execute in-progress map/reduce tasks
• Master failure
– Single point of failure; Resume from Execution Log
• Robust
– Google’s experience: lost 1600 of 1800 machines once!, but finished fine
15
Fault tolerance:
Handled via re-execution
• On worker failure:
– Detect failure via periodic heartbeats
– Re-execute completed and in-progress map tasks
– Task completion committed through master
16
Refinement:
Redundant Execution
• Slow workers significantly lengthen completion
time
– Other jobs consuming resources on machine
– Bad disks with soft errors transfer data very slowly
– Weird things: processor caches disabled (!!)
17
Refinement:
Skipping Bad Records
Map/Reduce functions sometimes fail for particular
inputs
18
A MapReduce Job
Mapper
Reducer
Reducer
21
Running Program in MR
• Apache sources under /opt, the examples will be in
the following directory:
/opt/hadoop-2.6.0/share/hadoop/mapreduce/
22
Running Program in MR
• A list of the available examples can be found
by running the following command.
$ yarn jar
$HADOOP_EXAMPLES/hadoop-mapreduce-exa
mples.jar
23
Running Program in MR
• The pi example calculates the digits of p using
a quasi-Monte Carlo method.
$ yarn jar
$HADOOP_EXAMPLES/hadoop-mapreduce-exa
mples.jar pi 16 1000000
24
25
26
Running the Terasort Test
• The terasort benchmark sorts a specified amount of randomly
generated data.
27
Running the Terasort Test
• Run teragen to generate rows of random data to sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teragen 500000000
/user/hdfs/TeraGen-50GB
• Run terasort to sort the database.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar terasort
➥/user/hdfs/TeraGen-50GB /user/hdfs/TeraSort-50GB
• Run teravalidate to validate the sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teravalidate
➥/user/hdfs/TeraSort-50GB /user/hdfs/TeraValid-50GB
28
Running the Terasort Test
• The following command will instruct terasort to
use four reducer tasks:
• $ yarn jar
$HADOOP_EXAMPLES/hadoop-mapreduce-examp
les.jar terasort
➥ -Dmapred.reduce.tasks=4
/user/hdfs/TeraGen-50GB
/user/hdfs/TeraSort-50GB 29
Additional information and background on
each of the examples and benchmarks
• Pi Benchmark
• https://hadoop.apache.org/docs/current/api/org/apache/h
adoop/examples/pi/package-summary.html
• Terasort Benchmark
• https://hadoop.apache.org/docs/current/api/org/apache/h
adoop/examples/terasort/package-summary.html
• Benchmarking and Stress Testing an Hadoop Cluster
• http://www.michael-noll.com/blog/2011/04/09/benchmark
ing-and-stresstesting-an-hadoop-cluster-with-terasort-testd
fsio-nnbench-mrbench (uses Hadoop V1, will work with V2)
30