Professional Documents
Culture Documents
12 13 14 Map Reduce
12 13 14 Map Reduce
12 13 14 Map Reduce
Revise all the config once and make sure its all okay
Compute nodes and storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System
are running on the same set of nodes.
Friday 14 July 2017 Hadoop - MapReduce 4
MapReduce – Framework - Managers
Map step
Master node takes large problem input and slices it into smaller
sub problems; distributes these to worker nodes.
Worker node may do this again; leads to a multi-level tree
structure
Worker processes smaller problem and hands back to master
Reduce step
Master node takes the answers to the sub problems and
combines them in a predefined way to get the output/answer to
original problem
Friday 14 July 2017 Hadoop - MapReduce 6
MapReduce – Inputs and Outputs
def mapper(line):
output(key, sum(values))
Eclipse
Output
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Friday 14 July 2017 Hadoop - MapReduce 13
MapReduce – JobHistroyserver
Config
add below properties in mapred-site.xml (not essential to add in config file)
Property - mapreduce.jobhistory.address
Value - <hostname>:10020
Property – mapreduce.jobhistory.webapp.address
Value - <hostname>:19888
Start a histroy server (execute below on the node where history server can run
as a daemon)
$ mapred historyserver
Steps 1 & 2 can also be done in Eclipse itself t create final jar.
Job
Mapper
Reducer
but instead of actually doing anything, it pipes data to scripts. By doing so, it
provides an API for other languages:
hadoop jar /
$HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streaming-
2.7.2.jar -Dmapreduce.job.reduces=1 -input /user/lab/wc/file02
-output /user/lab/wc/streamout -mapper cat -reducer "wc -l"
$ echo "foo foo quux labs foo bar quux" | /home/lab/mapper.py | sort -k1,1
| /home/lab/reducer.py
Check Output
●
Maps are the individual tasks that transform input records into
intermediate records
●
A given input pair may map to zero or many output pairs
●
spawns one map task for each InputSplit generated by the
InputFormat for the job
●
Applications can use the Counter to report its statistics
●
All intermediate values associated with a given output key are
subsequently grouped by the framework. Users can control the
grouping by specifying a Comparator.
Friday 14 July 2017 Hadoop - MapReduce 25
MapReduce - User Interfaces – Mapper -
InputSplit
●
InputSplit represents the data to be processed by an individual
Mapper.
●
Typically InputSplit presents a byte-oriented view of the input,
and it is the responsibility of RecordReader to process and
present a record-oriented view.
●
FileSplit is the default InputSplit
●
Typically the RecordReader converts the byte-oriented view of
the input, provided by the InputSplit, and presents a record-
oriented to the Mapper implementations for processing.
●
RecordReader assumes the responsibility of processing record
boundaries and presents the tasks with keys and values.
●
Users can control which keys (and hence records) go to which
Reducer by implementing a custom Partitioner.
●
Users can optionally specify a combiner, to perform local
aggregation of the intermediate outputs, which helps to cut
down the amount of data transferred from the Mapper to the
Reducer
●
The intermediate (data from mapper to reducer), sorted outputs
are always stored in a simple (key-len, key, value-len, value)
format.
●
The number of maps is usually driven by the total size of the
inputs, that is, the total number of blocks of the input files.
●
If you expect 10GB of input data and have a blocksize of
128MB, you’ll end up with 82 maps.
●
Partitioner controls the partitioning of the keys of the
intermediate map-outputs.
●
The key (or a subset of the key) is used to derive the partition,
typically by a hash function. The total number of partitions is
the same as the number of reduce tasks for the job. Hence
this controls which of the m reduce tasks the intermediate key
(and hence the record) is sent to for reduction.
●
HashPartitioner is the default Partitioner.
●
Shuffle - relevant partitioned sorted output of the mappers are
fetched to the corresponding reducer, via HTTP.
●
Sort - The framework groups Reducer inputs by keys
The shuffle and sort phases occur simultaneously; while map-
outputs are being fetched they are merged
●
Secondary Sort - If equivalence rules for grouping the
intermediate keys are required to be different from those for
grouping keys before reduction, then one may specify a
Comparator
●
Reduce – reduce method is called for each <key, (list of
values)> pair in the grouped inputs.
Friday 14 July 2017 Hadoop - MapReduce 34
MapReduce – Reducer Stages
--> Shuffle – moving map outputs to the reducers, atleast one
map task complete. i.e. (k1,∑(list(v))) goes to Reducer 1,
(k2,∑(list(v))) goes to Reducer 2, etc.
●
The right number of reduces seems to be 0.95 or 1.75
multiplied by (<no. of nodes> * <no. Of maximum containers
per node>)
●
Increasing the number of reduces increases the framework
overhead, but increases load balancing and lowers the cost of
failures.
●
Reducer NONE - It is legal to set the number of reduce-tasks
to zero if no reduction is desired. In this case the outputs of
the map-tasks go directly to the FileSystem, into the output
path set. The framework does not sort the map-outputs before
writing them out to the FileSystem.
●
Job sets the overall MapReduce job configuration
●
Job is specified client-side
●
Primary interface for a user to describe a
●
MapReduce job to the Hadoop framework for execution
Used to specify
●
Mapper
●
Combiner (if any)
●
Partitioner (to partition key space)
●
Reducer
●
InputFormat
●
OutputFormat
●
Many user options; high customizability
●
Jobs can be monitored by users
●
Users can chain MapReduce jobs together to accomplish
complex tasks which cannot be done with a single MapReduce
job
▫ make use of Job.waitForCompletion()
▫ and Job.submit()
●
To check list of current running jobs
$ mapred job -list
●
RecordWriter implementations write the job outputs to the
FileSystem.
●
Users submit jobs to Queues.
●
Queues, as collection of jobs, allow the system to provide
specific functionality.
●
Hadoop comes configured with a single mandatory queue,
called ‘default’.
●
Some job schedulers, such as the Capacity Scheduler,
support multiple queues.
●
●
Command line $ mapred queue -list
●
TaskTracker executes Mapper/Reducer task as a child process
in a separate jvm
●
Child task inherits the environment of the parent TaskTracker
●
User can specify environmental variables controlling memory,
parallel computation settings, segment size, and more
●
To check all the classpath configured
$ mapred classpath
●
The ability to sort data is at the heart of MapReduce. Even if
your application isn’t concerned with sorting per se, it may be
able to use the sorting stage that MapReduce provides to
organize its data.
●
Partial Sort (default)– MR will sort input records by their keys
●
Partitioned MapFile lookups – change the output format to
be a MapFileOutputFormat
●
Total Sort - it is possible to produce a set of sorted files that,
if concatenated, would form a globally sorted file. The secret
to doing this is to use a partitioner that respects the total
order of the output.
●
Secondary Sort - The MapReduce framework sorts the
records by key before they reach the reducers. For any
particular key, however, the values are not sorted. The order
that the values appear is not even stable from one run to the
next, since they come from different map tasks, which may
finish at different times from run to run.
●
However, it is possible to impose an order on the values by
sorting and grouping the keys in a particular way.
●
We set the partitioner to partition by the first field of the key
(the year), using a custom partitioner.
●
Rather than writing MapReduce programs, you might consider
using a higher-level framework such as Pig, Hive, or
Cascading, in which join operations are a core part of the
implementation.
●
Side Data Distribution – If one dataset is large (the weather
records) but the other one is small enough to be distributed to
each node in the cluster.
●
If both datasets are too large for either to be copied to each
node in the cluster, then we can join them using MapReduce,
using either a map-side join or a reduce-side join.
●
A map-side join works by performing the join before the data
reaches the map function.
●
A reduce-side join is more general than a map-side join, in that
the input datasets don’t have to be structured in any particular
way, but it is less efficient as both datasets have to go through
the MapReduce shuffle.
●
Multiple inputs - The input sources for the datasets have
different formats, in general, so it is very convenient to use
the MultipleInputs class to separate the logic for parsing and
tagging each source.
●
Secondary sort - The reducer will see the records from
both sources that have same key, but they are not
guaranteed to be in any particular order. However, to
perform the join, it is important to have the data from one
source before another.
Friday 14 July 2017 Hadoop - MapReduce 51
MapReduce – Joins - Side Data
Distribution
●
Side data can be defined as extra read-only data needed by a
job to process the main dataset. The challenge is to make side
data available to all the map or reduce tasks (which are spread
across the cluster) in a convenient and efficient fashion.
●
Using the Job Configuration - You can set arbitrary key-
value pairs in the job configuration using the various setter
methods on JobConf (inherited from Configuration ). This is
very useful if you need to pass a small piece of metadata to
your tasks.
●
Distributed Cache - It is preferable to distribute datasets
using Hadoop’s distributed cache mechanism. This provides
a service for copying files and archives to the task nodes in
time for the tasks to use them when they run. To save
network bandwidth, files are normally copied to any
particular node once per job.
Friday 14 July 2017 Hadoop - MapReduce 52
MapReduce - DistributedCache
●
DistributedCache distributes application-specific, large,
read-only files efficiently.
●
The framework will copy the necessary files to the slave node
before any tasks for the job are executed on that node.
●
DistributedCache can be used to distribute simple, read-only
data/text files and more complex types such as archives and
jars. Archives (zip, tar, tgz and tar.gz files) are un-archived at
the slave nodes.
●
The DistributedCache can also be used as a rudimentary
software distribution mechanism for use in the map and/or
reduce tasks.
●
Same input output
●
Create a new file at “hdfs://user/lab/wc/patterns.txt” with
below as content
\.
\,
\!
to
Hello
●
Make jar with WordCount2.java example and Execute job as
$ hadoop jar wc.jar WordCount2
-Dwordcount.case.sensitive=true /user/lab/wc/input
/user/lab/wc/output -skip /user/lab/wc/patterns.txt
2. If a node crashes:
– Relaunch its current tasks on other nodes
– Relaunch any maps the node previously ran
• Necessary because their output files were lost along
with the crashed node
– Load balancing