12 13 14 Map Reduce

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Hadoop – MapReduce

First Things first .....

Is your Hadoop Cluster up and running......

Revise all the config once and make sure its all okay

Format the cluster

Start all daemons, check if all are running

Put some files in the cluster (every one can do that)

Friday 14 July 2017 Hadoop Architecture and Concepts 2


12 MapReduce Introduction

Friday 14 July 2017 Hadoop - MapReduce 3


MapReduce – Framework & Tasks
Hadoop MapReduce is a software framework for easily writing
applications which process vast amounts of data (multi-terabyte
data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.

The framework takes care of scheduling tasks, monitoring


them and re-executes the failed tasks.

A MapReduce job usually splits the input data-set into


independent chunks which are processed by the map tasks in a
completely parallel manner. The framework sorts the outputs of
the maps, which are then input to the reduce tasks.

Compute nodes and storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System
are running on the same set of nodes.
Friday 14 July 2017 Hadoop - MapReduce 4
MapReduce – Framework - Managers

The MapReduce framework consists of



a single master ResourceManager,

one slave NodeManager per cluster-node, and

MRAppMaster per application.

The Hadoop job client submits the job (jar/executable


etc.) and configuration to the ResourceManager which
then assumes

responsibility of distributing the
software/configuration to the slaves,

scheduling tasks and monitoring them,

providing status and

diagnostic information to the job-client.

Friday 14 July 2017 Hadoop - MapReduce 5


MapReduce – Core Functionality
Code usually written in Java- though it can be written in other
languages with the Hadoop Streaming API

Two fundamental pieces:

Map step
 Master node takes large problem input and slices it into smaller
sub problems; distributes these to worker nodes.
 Worker node may do this again; leads to a multi-level tree
structure
 Worker processes smaller problem and hands back to master

Reduce step
 Master node takes the answers to the sub problems and
combines them in a predefined way to get the output/answer to
original problem
Friday 14 July 2017 Hadoop - MapReduce 6
MapReduce – Inputs and Outputs

The MapReduce framework operates exclusively on <key, value>


pairs, that is, the framework views the input to the job as a set of
<key, value> pairs and produces a set of <key, value> pairs as
the output of the job, conceivably of different types.

Input and Output types of a MapReduce job: A simple functional


understanding
(input) <k1, v1> -> map -> <k2, v2> -> reduce -> <k3,
v3> (output)

Friday 14 July 2017 Hadoop - MapReduce 7


MapReduce – Single Reduce Task

Friday 14 July 2017 Hadoop - MapReduce 8


MapReduce – Example Wordcount

def mapper(line):

foreach word in line.split():


output(word, 1)

def reducer(key, values):

output(key, sum(values))

Friday 14 July 2017 Hadoop - MapReduce 9


MapReduce – Example Wordcount -
Execution

Friday 14 July 2017 Hadoop - MapReduce 10


MapReduce – Execution Architecture

Friday 14 July 2017 Hadoop - MapReduce 11


MapReduce – Lets jump dirting our hands
now
Env configuration
edit bashrc to add
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

Eclipse

Setup new project

Import Hadoop libraries

Build and create jar

Friday 14 July 2017 Hadoop - MapReduce 12


MapReduce – Exer 1 - WordCount
Real or command line method

$ bin/hadoop com.sun.tools.javac.Main WordCount.java


$ jar cf wc.jar WordCount*.class

Executing the MR program


$ touch file01; touch file02
$ nano file01 – add ‘Hello World, Bye World’
$ nano file01 – add ‘Hello Hadoop. Goodbye World’
$ hdfs dfs -put file01 ‘/user/lab/wc/*’; hdfs dfs -put file02 ‘/user/lab/wc/*’
$ hdfs dfs -cat /user/lab/wc/*.* to verify the content
$ bin/hadoop jar wc.jar WordCount /user/lab/wc/input /user/lab/wc/output

Output
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Friday 14 July 2017 Hadoop - MapReduce 13
MapReduce – JobHistroyserver

Config
add below properties in mapred-site.xml (not essential to add in config file)

Property - mapreduce.jobhistory.address
Value - <hostname>:10020

Property – mapreduce.jobhistory.webapp.address
Value - <hostname>:19888

Start a histroy server (execute below on the node where history server can run
as a daemon)
$ mapred historyserver

Friday 14 July 2017 Hadoop - MapReduce 14


MapReduce – Example 2 Calculate Number
of Flights

Friday 14 July 2017 Hadoop - MapReduce 15


MapReduce – Example Calculate Number
of Flights

Friday 14 July 2017 Hadoop - MapReduce 16


Flightsby Carrier - Driver

Friday 14 July 2017 Hadoop - MapReduce 17


Flightsby Carrier - Mapper

Friday 14 July 2017 Hadoop - MapReduce 18


Flightsby Carrier - Reducer

Friday 14 July 2017 Hadoop - MapReduce 19


Running the FlightsByCarrier application
Check if $HADOOP_CLASSPATH is set up in bashrc, opencsv
should also be included here

1. Go to the directory with your Java code and compile it using


the following command:
$ hadoop com.sun.tools.javac.Main FlightsByCarrier*.java

2. Build a JAR file for the application by using this command:


$ jar cvf FlightsByCarrier.jar *.class (make sure you run this
fromt he folder that contains the opencsv library)

Steps 1 & 2 can also be done in Eclipse itself t create final jar.

3. Run the driver application by using this command:


$ hadoop jar FlightsByCarrier.jar FlightsByCarrier
/user/lab/airline/2008.csv /user/lab/airline/flightsCount
To unzip bz2 use command ‘bzip2 -d file.bz2’
Friday 14 July 2017 Hadoop - MapReduce 20
Running the FlightsByCarrier application

4. Show the job’s output file from HDFS by running the


command
$ hadoop fs -cat /user/root/output/flightsCount/part-r-00000
You see the total counts of all flights completed for each of the
carriers in 2008:
AA 165121
AS 21406
CO 123002
DL 185813
EA 108776
HP 45399
NW 108273
PA (1) 16785
PI 116482
PS 41706
TW 69650
UA 152624
US 94814
WN 61975

Friday 14 July 2017 Hadoop - MapReduce 21


MapReduce – Hadoop Streaming Intro
Hadoop Streaming is actually just a java library that implements

Job
Mapper
Reducer

but instead of actually doing anything, it pipes data to scripts. By doing so, it
provides an API for other languages:

read from STDIN


write to STDOUT

To run a basic streaming job

hadoop jar /
$HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streaming-
2.7.2.jar -Dmapreduce.job.reduces=1 -input /user/lab/wc/file02
-output /user/lab/wc/streamout -mapper cat -reducer "wc -l"

Friday 14 July 2017 Hadoop - MapReduce 22


MapReduce – Exer 3 – Python way
some ideas on how to test the functionality of the Map and Reduce
scripts.

$ echo "foo foo quux labs foo bar quux" | /home/lab/mapper.py

$ echo "foo foo quux labs foo bar quux" | /home/lab/mapper.py | sort -k1,1
| /home/lab/reducer.py

Executing the MR program

$ hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-


2.7.2.jar -files /home/poc/wc -mapper wc/mapper.py -reducer wc/reducer.py
-input /user/lab/airline/2008.csv -output /user/lab/wc/pythonout2

Check Output

Friday 14 July 2017 Hadoop - MapReduce 23


13 Understanding MapReduce
Part 1

Friday 14 July 2017 Hadoop - MapReduce 24


MapReduce - User Interfaces - Mapper

Maps input key/value pairs to a set of intermediate key/value
pairs


Maps are the individual tasks that transform input records into
intermediate records


A given input pair may map to zero or many output pairs


spawns one map task for each InputSplit generated by the
InputFormat for the job


Applications can use the Counter to report its statistics


All intermediate values associated with a given output key are
subsequently grouped by the framework. Users can control the
grouping by specifying a Comparator.
Friday 14 July 2017 Hadoop - MapReduce 25
MapReduce - User Interfaces – Mapper -
InputSplit

InputSplit represents the data to be processed by an individual
Mapper.


Typically InputSplit presents a byte-oriented view of the input,
and it is the responsibility of RecordReader to process and
present a record-oriented view.


FileSplit is the default InputSplit

Friday 14 July 2017 Hadoop - MapReduce 26


MapReduce - User Interfaces – Mapper -
InputSplit

Friday 14 July 2017 Hadoop - MapReduce 27


MapReduce - User Interfaces – Mapper -
RecordReader

RecordReader reads <key, value> pairs from an InputSplit.


Typically the RecordReader converts the byte-oriented view of
the input, provided by the InputSplit, and presents a record-
oriented to the Mapper implementations for processing.


RecordReader assumes the responsibility of processing record
boundaries and presents the tasks with keys and values.

Friday 14 July 2017 Hadoop - MapReduce 28


MapReduce - User Interfaces - Mapper

The Mapper outputs are sorted and then partitioned per
Reducer. The total number of partitions is the same as the
number of reduce tasks for the job.


Users can control which keys (and hence records) go to which
Reducer by implementing a custom Partitioner.


Users can optionally specify a combiner, to perform local
aggregation of the intermediate outputs, which helps to cut
down the amount of data transferred from the Mapper to the
Reducer


The intermediate (data from mapper to reducer), sorted outputs
are always stored in a simple (key-len, key, value-len, value)
format.

Friday 14 July 2017 Hadoop - MapReduce 29


MapReduce – Example Wordcount – MR
Optimization with combiner

Friday 14 July 2017 Hadoop - MapReduce 30


MapReduce – Mapper - Stages
Driver (Job)

--> InputFormat – Reads from file blocks

--> InputSplit – Splits based on record

--> InRecordReader – reads record to form (k,v) pair

--> Mapper – emits or filters records (k,v) pair

(optional) --> comparator – group values of given (k,list(v))

(optional) -->combiner– combine values of same key (k,∑(list(v)))

--> Partitioner – determines which (k,∑(list(v)))


pair goes to which reducer
Friday 14 July 2017 Hadoop - MapReduce 31
MapReduce - User Interfaces - Mapper

How many Maps ?


The number of maps is usually driven by the total size of the
inputs, that is, the total number of blocks of the input files.


If you expect 10GB of input data and have a blocksize of
128MB, you’ll end up with 82 maps.

Friday 14 July 2017 Hadoop - MapReduce 32


MapReduce - User Interfaces - Partitioner
Partitioner partitions the key space.


Partitioner controls the partitioning of the keys of the
intermediate map-outputs.


The key (or a subset of the key) is used to derive the partition,
typically by a hash function. The total number of partitions is
the same as the number of reduce tasks for the job. Hence
this controls which of the m reduce tasks the intermediate key
(and hence the record) is sent to for reduction.


HashPartitioner is the default Partitioner.

Friday 14 July 2017 Hadoop - MapReduce 33


MapReduce - User Interfaces - Reducer
Reduces a set of intermediate values which share a key to a
smaller set of values. The number of reduces for the job is set
by the user.
Reducer has 3 primary phases:


Shuffle - relevant partitioned sorted output of the mappers are
fetched to the corresponding reducer, via HTTP.

Sort - The framework groups Reducer inputs by keys
The shuffle and sort phases occur simultaneously; while map-
outputs are being fetched they are merged

Secondary Sort - If equivalence rules for grouping the
intermediate keys are required to be different from those for
grouping keys before reduction, then one may specify a
Comparator

Reduce – reduce method is called for each <key, (list of
values)> pair in the grouped inputs.
Friday 14 July 2017 Hadoop - MapReduce 34
MapReduce – Reducer Stages
--> Shuffle – moving map outputs to the reducers, atleast one
map task complete. i.e. (k1,∑(list(v))) goes to Reducer 1,
(k2,∑(list(v))) goes to Reducer 2, etc.

--> Sort – sort the intermediate key/value pairs, e.g.


(k1,∑(list(v))) may have come from map 10, same may come from
map 20, sort them to be ready for aggregation

(optional) --> Comparator – group values (k,list(∑(list(v)))

--> Reduce – Transform/aggregate key/value pair. Iterates


for each value to perform desired operation on the values.

--> OutputFormat - defines location of output data

--> Record Writer - defines how individual output


records are
Friday 14 July 2017written Hadoop - MapReduce 35
MapReduce - User Interfaces - Reducer
How Many Reduces?


The right number of reduces seems to be 0.95 or 1.75
multiplied by (<no. of nodes> * <no. Of maximum containers
per node>)


Increasing the number of reduces increases the framework
overhead, but increases load balancing and lowers the cost of
failures.


Reducer NONE - It is legal to set the number of reduce-tasks
to zero if no reduction is desired. In this case the outputs of
the map-tasks go directly to the FileSystem, into the output
path set. The framework does not sort the map-outputs before
writing them out to the FileSystem.

Friday 14 July 2017 Hadoop - MapReduce 36


MapReduce – data flow multiple reduce

Friday 14 July 2017 Hadoop - MapReduce 37


MapReduce – data flow no reduce

Friday 14 July 2017 Hadoop - MapReduce 38


MapReduce – MR Execution Architecture

Friday 14 July 2017 Hadoop - MapReduce 39


14 Understanding MapReduce
Part 2

Friday 14 July 2017 Hadoop - MapReduce 40


MapReduce - User Interfaces – Job


Job sets the overall MapReduce job configuration

Job is specified client-side

Primary interface for a user to describe a

MapReduce job to the Hadoop framework for execution

Used to specify

Mapper

Combiner (if any)

Partitioner (to partition key space)

Reducer

InputFormat

OutputFormat

Many user options; high customizability

Friday 14 July 2017 Hadoop - MapReduce 41


MapReduce - User Interfaces – Job


Jobs can be monitored by users

Users can chain MapReduce jobs together to accomplish
complex tasks which cannot be done with a single MapReduce
job
▫ make use of Job.waitForCompletion()
▫ and Job.submit()


To check list of current running jobs
$ mapred job -list

Friday 14 July 2017 Hadoop - MapReduce 42


MapReduce - User Interfaces – Job Output

OutputFormat describes the output-specification for a


MapReduce job.

The MapReduce framework relies on the OutputFormat of the


job to:

1. Validate the output-specification of the job; for example,


check that the output directory doesn’t already exist.

2. Provide the RecordWriter implementation used to write the


output files of the job. Output files are stored in a FileSystem.

TextOutputFormat is the default OutputFormat.

Friday 14 July 2017 Hadoop - MapReduce 43


MapReduce - User Interfaces –
RecordWriter

RecordWriter writes the output <key, value> pairs to an output
file.


RecordWriter implementations write the job outputs to the
FileSystem.

Friday 14 July 2017 Hadoop - MapReduce 44


MapReduce - Submitting Jobs to Queues


Users submit jobs to Queues.


Queues, as collection of jobs, allow the system to provide
specific functionality.


Hadoop comes configured with a single mandatory queue,
called ‘default’.


Some job schedulers, such as the Capacity Scheduler,
support multiple queues.


Command line $ mapred queue -list

Friday 14 July 2017 Hadoop - MapReduce 45


MapReduce - Task Execution and
Environment


TaskTracker executes Mapper/Reducer task as a child process
in a separate jvm


Child task inherits the environment of the parent TaskTracker


User can specify environmental variables controlling memory,
parallel computation settings, segment size, and more


To check all the classpath configured
$ mapred classpath

Friday 14 July 2017 Hadoop - MapReduce 46


MapReduce - Sorting


The ability to sort data is at the heart of MapReduce. Even if
your application isn’t concerned with sorting per se, it may be
able to use the sorting stage that MapReduce provides to
organize its data.


Partial Sort (default)– MR will sort input records by their keys


Partitioned MapFile lookups – change the output format to
be a MapFileOutputFormat


Total Sort - it is possible to produce a set of sorted files that,
if concatenated, would form a globally sorted file. The secret
to doing this is to use a partitioner that respects the total
order of the output.

Friday 14 July 2017 Hadoop - MapReduce 47


MapReduce - Sorting


Secondary Sort - The MapReduce framework sorts the
records by key before they reach the reducers. For any
particular key, however, the values are not sorted. The order
that the values appear is not even stable from one run to the
next, since they come from different map tasks, which may
finish at different times from run to run.


However, it is possible to impose an order on the values by
sorting and grouping the keys in a particular way.


We set the partitioner to partition by the first field of the key
(the year), using a custom partitioner.

Friday 14 July 2017 Hadoop - MapReduce 48


MapReduce - Joins


Rather than writing MapReduce programs, you might consider
using a higher-level framework such as Pig, Hive, or
Cascading, in which join operations are a core part of the
implementation.


Side Data Distribution – If one dataset is large (the weather
records) but the other one is small enough to be distributed to
each node in the cluster.


If both datasets are too large for either to be copied to each
node in the cluster, then we can join them using MapReduce,
using either a map-side join or a reduce-side join.

Friday 14 July 2017 Hadoop - MapReduce 49


MapReduce – Joins - Map-Side Joins


A map-side join works by performing the join before the data
reaches the map function.

Friday 14 July 2017 Hadoop - MapReduce 50


MapReduce – Joins - Reduce-Side Joins


A reduce-side join is more general than a map-side join, in that
the input datasets don’t have to be structured in any particular
way, but it is less efficient as both datasets have to go through
the MapReduce shuffle.


Multiple inputs - The input sources for the datasets have
different formats, in general, so it is very convenient to use
the MultipleInputs class to separate the logic for parsing and
tagging each source.


Secondary sort - The reducer will see the records from
both sources that have same key, but they are not
guaranteed to be in any particular order. However, to
perform the join, it is important to have the data from one
source before another.
Friday 14 July 2017 Hadoop - MapReduce 51
MapReduce – Joins - Side Data
Distribution

Side data can be defined as extra read-only data needed by a
job to process the main dataset. The challenge is to make side
data available to all the map or reduce tasks (which are spread
across the cluster) in a convenient and efficient fashion.


Using the Job Configuration - You can set arbitrary key-
value pairs in the job configuration using the various setter
methods on JobConf (inherited from Configuration ). This is
very useful if you need to pass a small piece of metadata to
your tasks.

Distributed Cache - It is preferable to distribute datasets
using Hadoop’s distributed cache mechanism. This provides
a service for copying files and archives to the task nodes in
time for the tasks to use them when they run. To save
network bandwidth, files are normally copied to any
particular node once per job.
Friday 14 July 2017 Hadoop - MapReduce 52
MapReduce - DistributedCache


DistributedCache distributes application-specific, large,
read-only files efficiently.


The framework will copy the necessary files to the slave node
before any tasks for the job are executed on that node.


DistributedCache can be used to distribute simple, read-only
data/text files and more complex types such as archives and
jars. Archives (zip, tar, tgz and tar.gz files) are un-archived at
the slave nodes.


The DistributedCache can also be used as a rudimentary
software distribution mechanism for use in the map and/or
reduce tasks.

Friday 14 July 2017 Hadoop - MapReduce 53


MapReduce – DistributedCache Exer 3

Word Count with Distributed Cache


Same input output

Create a new file at “hdfs://user/lab/wc/patterns.txt” with
below as content
\.
\,
\!
to
Hello

Make jar with WordCount2.java example and Execute job as
$ hadoop jar wc.jar WordCount2
-Dwordcount.case.sensitive=true /user/lab/wc/input
/user/lab/wc/output -skip /user/lab/wc/patterns.txt

Friday 14 July 2017 Hadoop - MapReduce 54


MapReduce - Fault Tolerance
1. If a task crashes:
– Retry on another node
• OK for a map because it had no dependencies
• OK for reduce because map outputs are on disk
– If the same task repeatedly fails, fail the job or ignore that
input block

2. If a node crashes:
– Relaunch its current tasks on other nodes
– Relaunch any maps the node previously ran
• Necessary because their output files were lost along
with the crashed node

3. If a task is going slowly (straggler):


– Launch second copy of task on another node
– Take the output of whichever copy finishes first, and kill the
other
Friday 14 July one
2017 Hadoop - MapReduce 55
MapReduce – Take Aways

By providing a data-parallel programming model, MapReduce can


control job execution under the hood in useful ways:

– Automatic division of job into tasks

– Placement of computation near data

– Load balancing

– Recovery from failures & stragglers

Friday 14 July 2017 Hadoop - MapReduce 56


Thank You

Friday 14 July 2017 Hadoop - MapReduce 57

You might also like