Professional Documents
Culture Documents
Map Reduce
Map Reduce
What is MapReduce?
MapReduce is a programming model for distributed processing. It works on
(key,value) pair
What is Mapper?
In the mapper, user-provided code is executed on each key/value pair from the
record reader to produce zero or more new key/value pairs, called the
intermediate pairs. The decision of what is the key and value here is not arbitrary
and is very important to what the MapReduce job is accomplishing. The key is
what the data will be grouped on and the value is the information pertinent to
the analysis in the reducer.
What is partitioner?
The partitioner takes the intermediate key/value pairs from the mapper (or
combiner if it is being used) and splits them up into shards, one shard per
reducer. By default, the partitioner interrogates the object for its hash code,
which is typically an md5sum. Then, the partitioner performs a modulus
operation by the number of reducers: key.hashCode() % (number of reducers).
This randomly distributes the keyspace evenly over the reducers, but still
ensures that keys with the same value in different mappers end up at the same
reducer. The default behavior of the partitioner can be customized, and will be in
some more advanced patterns, such as sorting. However, changing the
partitioner is rarely necessary. The partitioned data is written to the local file
system for each map task and waits to be pulled by its respective reducer.
By default HashPartitioner
There are two reducer. Identity reducer and your own reducer. If no reducer
defined then identity reducer comes into picture
If reducers do not start before all mappers finish then why does
the progress on MapReduce job shows something like Map(50%)
Reduce(10%)? Why reducers progress percentage is displayed
when mapper is not finished yet?
Reducers start copying intermediate key-value pairs from the mappers as soon
as they are available. The progress calculation also takes in account the
processing of data transfer which is done by reduce process, therefore the
reduce progress starts showing up as soon as any intermediate key-value pair for
a mapper is available to be transferred to reducer. Though the reducer progress
is updated still the programmer defined reduce method is called only after all the
mappers have finished.
In old API we have JobConf class and one method name setNumMapTask(int) but
in new API we have Job class no such method hence we can configure mapper
with old API
if set 0 -> mapper will give you the output and the number of output depends on
number of mappers/inputsplit
1-> Default number of reducer is one
For storing the data two service is enough datanode and namenode and for
processing the data jobtracker and tasktracker is enough
Task instances are the actual MapReduce jobs which are run on each slave node.
The TaskTracker starts a separate JVM processes to do the actual work (called as
Task Instance) this is to ensure that process failure does not take down the task
tracker. Each Task Instance runs on its own JVM process. There can be multiple
processes of task instance running on a slave node. This is based on the number
of slots configured on task tracker. By default a new task instance JVM process is
spawned for a task.
What is the Hadoop MapReduce API contract for a key and value
Class?
The Key must implement the org.apache.hadoop.io.WritableComparable
interface.
The value must implement the org.apache.hadoop.io.Writable interface.
What commands are used to see all jobs running in the Hadoop
cluster and kill a job in LINUX?
Hadoop job - list
Hadoop job - kill jobID
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;