12 13 14 Map Reduce

Hadoop – MapReduce
First Things first .....
Is your Hadoop Cluster up and running......
Revise all the config once and make sure its all okay
Format the cluster
Start all daemons, check if all are running
Put some files in the cluster (every one can do that)
Friday 14 July 2017 Hadoop Architecture and Concepts 2

12 MapReduce Introduction
Friday 14 July 2017 Hadoop - MapReduce 3

MapReduce – Framework & Tasks
Hadoop MapReduce is a software framework for easily writing
applications which process vast amounts of data (multi-terabyte
data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
The framework takes care of scheduling tasks, monitoring

them and re-executes the failed tasks.
A MapReduce job usually splits the input data-set into

independent chunks which are processed by the map tasks in a
completely parallel manner. The framework sorts the outputs of
the maps, which are then input to the reduce tasks.
Compute nodes and storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System
are running on the same set of nodes.
MapReduce – Framework - Managers
The MapReduce framework consists of

●
a single master ResourceManager,
●
one slave NodeManager per cluster-node, and
●
MRAppMaster per application.
The Hadoop job client submits the job (jar/executable

etc.) and configuration to the ResourceManager which
then assumes
●
responsibility of distributing the
software/configuration to the slaves,
●
scheduling tasks and monitoring them,
●
providing status and
●
diagnostic information to the job-client.

MapReduce – Core Functionality
Code usually written in Java- though it can be written in other
languages with the Hadoop Streaming API
Two fundamental pieces:
Map step
 Master node takes large problem input and slices it into smaller
sub problems; distributes these to worker nodes.
 Worker node may do this again; leads to a multi-level tree
structure
 Worker processes smaller problem and hands back to master
Reduce step
 Master node takes the answers to the sub problems and
combines them in a predefined way to get the output/answer to
original problem
MapReduce – Inputs and Outputs
The MapReduce framework operates exclusively on <key, value>

pairs, that is, the framework views the input to the job as a set of
<key, value> pairs and produces a set of <key, value> pairs as
the output of the job, conceivably of different types.
Input and Output types of a MapReduce job: A simple functional

understanding
(input) <k1, v1> -> map -> <k2, v2> -> reduce -> <k3,
v3> (output)

MapReduce – Single Reduce Task

MapReduce – Example Wordcount
def mapper(line):
foreach word in line.split():

output(word, 1)
def reducer(key, values):
output(key, sum(values))

MapReduce – Example Wordcount -
Execution

MapReduce – Execution Architecture

MapReduce – Lets jump dirting our hands
now
Env configuration
edit bashrc to add
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Eclipse
Setup new project
Import Hadoop libraries
Build and create jar

MapReduce – Exer 1 - WordCount
Real or command line method
$ bin/hadoop com.sun.tools.javac.Main WordCount.java

$ jar cf wc.jar WordCount*.class
Executing the MR program

$ touch file01; touch file02
$ nano file01 – add ‘Hello World, Bye World’
$ nano file01 – add ‘Hello Hadoop. Goodbye World’
$ hdfs dfs -put file01 ‘/user/lab/wc/*’; hdfs dfs -put file02 ‘/user/lab/wc/*’
$ hdfs dfs -cat /user/lab/wc/*.* to verify the content
$ bin/hadoop jar wc.jar WordCount /user/lab/wc/input /user/lab/wc/output
Output
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
MapReduce – JobHistroyserver
Config
add below properties in mapred-site.xml (not essential to add in config file)
Property - mapreduce.jobhistory.address
Value - <hostname>:10020
Property – mapreduce.jobhistory.webapp.address
Value - <hostname>:19888
Start a histroy server (execute below on the node where history server can run
as a daemon)
$ mapred historyserver

MapReduce – Example 2 Calculate Number
of Flights

MapReduce – Example Calculate Number
of Flights

Flightsby Carrier - Driver

Flightsby Carrier - Mapper

Flightsby Carrier - Reducer

Running the FlightsByCarrier application
Check if $HADOOP_CLASSPATH is set up in bashrc, opencsv
should also be included here
1. Go to the directory with your Java code and compile it using

the following command:
$ hadoop com.sun.tools.javac.Main FlightsByCarrier*.java
2. Build a JAR file for the application by using this command:

$ jar cvf FlightsByCarrier.jar *.class (make sure you run this
fromt he folder that contains the opencsv library)
Steps 1 & 2 can also be done in Eclipse itself t create final jar.
3. Run the driver application by using this command:

$ hadoop jar FlightsByCarrier.jar FlightsByCarrier
/user/lab/airline/2008.csv /user/lab/airline/flightsCount
To unzip bz2 use command ‘bzip2 -d file.bz2’
Running the FlightsByCarrier application
4. Show the job’s output file from HDFS by running the

command
$ hadoop fs -cat /user/root/output/flightsCount/part-r-00000
You see the total counts of all flights completed for each of the
carriers in 2008:
AA 165121
AS 21406
CO 123002
DL 185813
EA 108776
HP 45399
NW 108273
PA (1) 16785
PI 116482
PS 41706
TW 69650
UA 152624
US 94814
WN 61975

MapReduce – Hadoop Streaming Intro
Hadoop Streaming is actually just a java library that implements
Job
Mapper
Reducer
but instead of actually doing anything, it pipes data to scripts. By doing so, it
provides an API for other languages:
read from STDIN

write to STDOUT
To run a basic streaming job
hadoop jar /
$HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streaming-
2.7.2.jar -Dmapreduce.job.reduces=1 -input /user/lab/wc/file02
-output /user/lab/wc/streamout -mapper cat -reducer "wc -l"

MapReduce – Exer 3 – Python way
some ideas on how to test the functionality of the Map and Reduce
scripts.
$ echo "foo foo quux labs foo bar quux" | /home/lab/mapper.py
$ echo "foo foo quux labs foo bar quux" | /home/lab/mapper.py | sort -k1,1
| /home/lab/reducer.py
Executing the MR program
$ hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-

2.7.2.jar -files /home/poc/wc -mapper wc/mapper.py -reducer wc/reducer.py
-input /user/lab/airline/2008.csv -output /user/lab/wc/pythonout2
Check Output

13 Understanding MapReduce
Part 1

MapReduce - User Interfaces - Mapper
●
Maps input key/value pairs to a set of intermediate key/value
pairs
●
Maps are the individual tasks that transform input records into
intermediate records
●
A given input pair may map to zero or many output pairs
●
spawns one map task for each InputSplit generated by the
InputFormat for the job
●
Applications can use the Counter to report its statistics
●
All intermediate values associated with a given output key are
subsequently grouped by the framework. Users can control the
grouping by specifying a Comparator.
MapReduce - User Interfaces – Mapper -
InputSplit
●
InputSplit represents the data to be processed by an individual
Mapper.
●
Typically InputSplit presents a byte-oriented view of the input,
and it is the responsibility of RecordReader to process and
present a record-oriented view.
●
FileSplit is the default InputSplit

InputSplit

RecordReader
●
RecordReader reads <key, value> pairs from an InputSplit.
●
Typically the RecordReader converts the byte-oriented view of
the input, provided by the InputSplit, and presents a record-
oriented to the Mapper implementations for processing.
●
RecordReader assumes the responsibility of processing record
boundaries and presents the tasks with keys and values.

●
The Mapper outputs are sorted and then partitioned per
Reducer. The total number of partitions is the same as the
number of reduce tasks for the job.
●
Users can control which keys (and hence records) go to which
Reducer by implementing a custom Partitioner.
●
Users can optionally specify a combiner, to perform local
aggregation of the intermediate outputs, which helps to cut
down the amount of data transferred from the Mapper to the
Reducer
●
The intermediate (data from mapper to reducer), sorted outputs
are always stored in a simple (key-len, key, value-len, value)
format.

MapReduce – Example Wordcount – MR
Optimization with combiner

MapReduce – Mapper - Stages
Driver (Job)
--> InputFormat – Reads from file blocks
--> InputSplit – Splits based on record
--> InRecordReader – reads record to form (k,v) pair
--> Mapper – emits or filters records (k,v) pair
(optional) --> comparator – group values of given (k,list(v))
(optional) -->combiner– combine values of same key (k,∑(list(v)))
--> Partitioner – determines which (k,∑(list(v)))

pair goes to which reducer
How many Maps ?
●
The number of maps is usually driven by the total size of the
inputs, that is, the total number of blocks of the input files.
●
If you expect 10GB of input data and have a blocksize of
128MB, you’ll end up with 82 maps.

MapReduce - User Interfaces - Partitioner
Partitioner partitions the key space.
●
Partitioner controls the partitioning of the keys of the
intermediate map-outputs.
●
The key (or a subset of the key) is used to derive the partition,
typically by a hash function. The total number of partitions is
the same as the number of reduce tasks for the job. Hence
this controls which of the m reduce tasks the intermediate key
(and hence the record) is sent to for reduction.
●
HashPartitioner is the default Partitioner.

MapReduce - User Interfaces - Reducer
Reduces a set of intermediate values which share a key to a
smaller set of values. The number of reduces for the job is set
by the user.
Reducer has 3 primary phases:
●
Shuffle - relevant partitioned sorted output of the mappers are
fetched to the corresponding reducer, via HTTP.
●
Sort - The framework groups Reducer inputs by keys
The shuffle and sort phases occur simultaneously; while map-
outputs are being fetched they are merged
●
Secondary Sort - If equivalence rules for grouping the
intermediate keys are required to be different from those for
grouping keys before reduction, then one may specify a
Comparator
●
Reduce – reduce method is called for each <key, (list of
values)> pair in the grouped inputs.
MapReduce – Reducer Stages
--> Shuffle – moving map outputs to the reducers, atleast one
map task complete. i.e. (k1,∑(list(v))) goes to Reducer 1,
(k2,∑(list(v))) goes to Reducer 2, etc.
--> Sort – sort the intermediate key/value pairs, e.g.

(k1,∑(list(v))) may have come from map 10, same may come from
map 20, sort them to be ready for aggregation
(optional) --> Comparator – group values (k,list(∑(list(v)))
--> Reduce – Transform/aggregate key/value pair. Iterates

for each value to perform desired operation on the values.
--> OutputFormat - defines location of output data
--> Record Writer - defines how individual output

records are
Friday 14 July 2017written Hadoop - MapReduce 35
MapReduce - User Interfaces - Reducer
How Many Reduces?
●
The right number of reduces seems to be 0.95 or 1.75
multiplied by (<no. of nodes> * <no. Of maximum containers
per node>)
●
Increasing the number of reduces increases the framework
overhead, but increases load balancing and lowers the cost of
failures.
●
Reducer NONE - It is legal to set the number of reduce-tasks
to zero if no reduction is desired. In this case the outputs of
the map-tasks go directly to the FileSystem, into the output
path set. The framework does not sort the map-outputs before
writing them out to the FileSystem.

MapReduce – data flow multiple reduce

MapReduce – data flow no reduce

MapReduce – MR Execution Architecture

14 Understanding MapReduce
Part 2

MapReduce - User Interfaces – Job
●
Job sets the overall MapReduce job configuration
●
Job is specified client-side
●
Primary interface for a user to describe a
●
MapReduce job to the Hadoop framework for execution
Used to specify
●
Mapper
●
Combiner (if any)
●
Partitioner (to partition key space)
●
Reducer
●
InputFormat
●
OutputFormat
●
Many user options; high customizability

MapReduce - User Interfaces – Job
●
Jobs can be monitored by users
●
Users can chain MapReduce jobs together to accomplish
complex tasks which cannot be done with a single MapReduce
job
▫ make use of Job.waitForCompletion()
▫ and Job.submit()
●
To check list of current running jobs
$ mapred job -list

MapReduce - User Interfaces – Job Output
OutputFormat describes the output-specification for a

MapReduce job.
The MapReduce framework relies on the OutputFormat of the

job to:
1. Validate the output-specification of the job; for example,

check that the output directory doesn’t already exist.
2. Provide the RecordWriter implementation used to write the

output files of the job. Output files are stored in a FileSystem.
TextOutputFormat is the default OutputFormat.

MapReduce - User Interfaces –
RecordWriter
●
RecordWriter writes the output <key, value> pairs to an output
file.
●
RecordWriter implementations write the job outputs to the
FileSystem.

MapReduce - Submitting Jobs to Queues
●
Users submit jobs to Queues.
●
Queues, as collection of jobs, allow the system to provide
specific functionality.
●
Hadoop comes configured with a single mandatory queue,
called ‘default’.
●
Some job schedulers, such as the Capacity Scheduler,
support multiple queues.
●
●
Command line $ mapred queue -list

MapReduce - Task Execution and
Environment
●
TaskTracker executes Mapper/Reducer task as a child process
in a separate jvm
●
Child task inherits the environment of the parent TaskTracker
●
User can specify environmental variables controlling memory,
parallel computation settings, segment size, and more
●
To check all the classpath configured
$ mapred classpath

MapReduce - Sorting
●
The ability to sort data is at the heart of MapReduce. Even if
your application isn’t concerned with sorting per se, it may be
able to use the sorting stage that MapReduce provides to
organize its data.
●
Partial Sort (default)– MR will sort input records by their keys
●
Partitioned MapFile lookups – change the output format to
be a MapFileOutputFormat
●
Total Sort - it is possible to produce a set of sorted files that,
if concatenated, would form a globally sorted file. The secret
to doing this is to use a partitioner that respects the total
order of the output.

MapReduce - Sorting
●
Secondary Sort - The MapReduce framework sorts the
records by key before they reach the reducers. For any
particular key, however, the values are not sorted. The order
that the values appear is not even stable from one run to the
next, since they come from different map tasks, which may
finish at different times from run to run.
●
However, it is possible to impose an order on the values by
sorting and grouping the keys in a particular way.
●
We set the partitioner to partition by the first field of the key
(the year), using a custom partitioner.

MapReduce - Joins
●
Rather than writing MapReduce programs, you might consider
using a higher-level framework such as Pig, Hive, or
Cascading, in which join operations are a core part of the
implementation.
●
Side Data Distribution – If one dataset is large (the weather
records) but the other one is small enough to be distributed to
each node in the cluster.
●
If both datasets are too large for either to be copied to each
node in the cluster, then we can join them using MapReduce,
using either a map-side join or a reduce-side join.

MapReduce – Joins - Map-Side Joins
●
A map-side join works by performing the join before the data
reaches the map function.

MapReduce – Joins - Reduce-Side Joins
●
A reduce-side join is more general than a map-side join, in that
the input datasets don’t have to be structured in any particular
way, but it is less efficient as both datasets have to go through
the MapReduce shuffle.
●
Multiple inputs - The input sources for the datasets have
different formats, in general, so it is very convenient to use
the MultipleInputs class to separate the logic for parsing and
tagging each source.
●
Secondary sort - The reducer will see the records from
both sources that have same key, but they are not
guaranteed to be in any particular order. However, to
perform the join, it is important to have the data from one
source before another.
MapReduce – Joins - Side Data
Distribution
●
Side data can be defined as extra read-only data needed by a
job to process the main dataset. The challenge is to make side
data available to all the map or reduce tasks (which are spread
across the cluster) in a convenient and efficient fashion.
●
Using the Job Configuration - You can set arbitrary key-
value pairs in the job configuration using the various setter
methods on JobConf (inherited from Configuration ). This is
very useful if you need to pass a small piece of metadata to
your tasks.
●
Distributed Cache - It is preferable to distribute datasets
using Hadoop’s distributed cache mechanism. This provides
a service for copying files and archives to the task nodes in
time for the tasks to use them when they run. To save
network bandwidth, files are normally copied to any
particular node once per job.
MapReduce - DistributedCache
●
DistributedCache distributes application-specific, large,
read-only files efficiently.
●
The framework will copy the necessary files to the slave node
before any tasks for the job are executed on that node.
●
DistributedCache can be used to distribute simple, read-only
data/text files and more complex types such as archives and
jars. Archives (zip, tar, tgz and tar.gz files) are un-archived at
the slave nodes.
●
The DistributedCache can also be used as a rudimentary
software distribution mechanism for use in the map and/or
reduce tasks.

MapReduce – DistributedCache Exer 3
Word Count with Distributed Cache
●
Same input output
●
Create a new file at “hdfs://user/lab/wc/patterns.txt” with
below as content
\.
\,
\!
to
Hello
●
Make jar with WordCount2.java example and Execute job as
$ hadoop jar wc.jar WordCount2
-Dwordcount.case.sensitive=true /user/lab/wc/input
/user/lab/wc/output -skip /user/lab/wc/patterns.txt

MapReduce - Fault Tolerance
1. If a task crashes:
– Retry on another node
• OK for a map because it had no dependencies
• OK for reduce because map outputs are on disk
– If the same task repeatedly fails, fail the job or ignore that
input block
2. If a node crashes:
– Relaunch its current tasks on other nodes
– Relaunch any maps the node previously ran
• Necessary because their output files were lost along
with the crashed node
3. If a task is going slowly (straggler):

– Launch second copy of task on another node
– Take the output of whichever copy finishes first, and kill the
other
Friday 14 July one
2017 Hadoop - MapReduce 55
MapReduce – Take Aways
By providing a data-parallel programming model, MapReduce can

control job execution under the hood in useful ways:
– Automatic division of job into tasks
– Placement of computation near data
– Load balancing
– Recovery from failures & stragglers

Thank You

12 13 14 Map Reduce

Uploaded by

Copyright:

Available Formats

You might also like

12 13 14 Map Reduce

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

12 13 14 Map Reduce

Uploaded by

Copyright:

Available Formats

Hadoop – MapReduce

First Things first .....

Is your Hadoop Cluster up and running......

Format the cluster

Start all daemons, check if all are running

Put some files in the cluster (every one can do that)

Friday 14 July 2017 Hadoop Architecture and Concepts 2

Friday 14 July 2017 Hadoop - MapReduce 3

The framework takes care of scheduling tasks, monitoring

A MapReduce job usually splits the input data-set into

The MapReduce framework consists of

The Hadoop job client submits the job (jar/executable

Friday 14 July 2017 Hadoop - MapReduce 5

Two fundamental pieces:

The MapReduce framework operates exclusively on <key, value>

Input and Output types of a MapReduce job: A simple functional

Friday 14 July 2017 Hadoop - MapReduce 7

Friday 14 July 2017 Hadoop - MapReduce 8

foreach word in line.split():

def reducer(key, values):

Friday 14 July 2017 Hadoop - MapReduce 9

Friday 14 July 2017 Hadoop - MapReduce 10

Friday 14 July 2017 Hadoop - MapReduce 11

Setup new project

Import Hadoop libraries

Build and create jar

Friday 14 July 2017 Hadoop - MapReduce 12

$ bin/hadoop com.sun.tools.javac.Main WordCount.java

Executing the MR program

Friday 14 July 2017 Hadoop - MapReduce 14

Friday 14 July 2017 Hadoop - MapReduce 15

Friday 14 July 2017 Hadoop - MapReduce 16

Friday 14 July 2017 Hadoop - MapReduce 17

Friday 14 July 2017 Hadoop - MapReduce 18

Friday 14 July 2017 Hadoop - MapReduce 19

1. Go to the directory with your Java code and compile it using

2. Build a JAR file for the application by using this command:

3. Run the driver application by using this command:

4. Show the job’s output file from HDFS by running the

Friday 14 July 2017 Hadoop - MapReduce 21

read from STDIN

To run a basic streaming job

Friday 14 July 2017 Hadoop - MapReduce 22

$ echo "foo foo quux labs foo bar quux" | /home/lab/mapper.py

Executing the MR program

$ hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-

Friday 14 July 2017 Hadoop - MapReduce 23

Friday 14 July 2017 Hadoop - MapReduce 24

Friday 14 July 2017 Hadoop - MapReduce 26

Friday 14 July 2017 Hadoop - MapReduce 27

Friday 14 July 2017 Hadoop - MapReduce 28

Friday 14 July 2017 Hadoop - MapReduce 29

Friday 14 July 2017 Hadoop - MapReduce 30

--> InputFormat – Reads from file blocks

--> InputSplit – Splits based on record

--> InRecordReader – reads record to form (k,v) pair

--> Mapper – emits or filters records (k,v) pair

(optional) --> comparator – group values of given (k,list(v))