Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

UNIT -3

MapReduce

 MapReduce is a programming model or pattern within the Hadoop framework


that is used to access big data stored in the Hadoop File System (HDFS). It is a
core component, integral to the functioning of the Hadoop framework.
 Another important foundation idea of the Hadoop Ecosystem is MapReduce.
 The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a
case, that part of the job is rescheduled.
 It has two parts, mapper (that transforms the data into key-value pairs) and reducer
(that aggregates the key-value pairs to give the desired output). Mapper converts raw
source data into key-value pairs. The keys can be repeated during mapping.
 Once the mapper creates the key-value pairs, MapReduce ‘shuffles and sorts’ them to
a different node. The process of shuffle and sort is based on the merge sorting
algorithm. In the above case, the same key will be sent to the same node. After this
step, [k1:v1,v2], [k2:v2],[k3:v2,v3] are sent to three different nodes. The Reducer
comes next and applies an aggregate function to each key. For example, in the above
data, the reducer applies the count function. So, the final output will be [k1:2], [k2:1],
[k3:2].

MapReduce on a Cluster

 The above schematic depicts how the mapper converts the input into key-value pairs
and then shuffles and sorts them. the reducer aggregates the mapped, shuffled, and
sorted data to send the desired output to the client node. This is the basic idea of
MapReduce.

Terminology:

Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.

NamedNode − Node that manages the Hadoop Distributed File System (HDFS).

DataNode − Node where data is presented in advance before any processing takes place.

MasterNode − Node where JobTracker runs and which accepts job requests from clients.

SlaveNode − Node where Map and Reduce program runs.

JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.

Task Tracker − Tracks the task and reports status to JobTracker.


Job − A program is an execution of a Mapper and Reducer across a dataset.

Task − An execution of a Mapper or a Reducer on a slice of data.

Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

Job Tracker

 The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
 In response, NameNode provides metadata to Job Tracker.
 The work of Job tracker is to manage all the resources and all the jobs across the
cluster and also to schedule each map on the Task Tracker running on the same data
node since there can be hundreds of data nodes available in the cluster.

Task Tracker

 It works as a slave node for Job Tracker.


 It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
 The Task Tracker can be considered as the actual slaves that are working on the
instruction given by the Job Tracker. This Task Tracker is deployed on each of the
nodes available in the cluster that executes the Map and Reduce task as instructed by
Job Tracker.
Advantages of MapReduce

The two biggest advantages of MapReduce are:

1. Parallel Processing:

In MapReduce, we are dividing the job among multiple nodes and each node works with a
part of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm
which helps us to process the data using different machines. As the data is processed by
multiple machines instead of a single machine in parallel, the time taken to process the data
gets reduced by a tremendous amount .

Fig.: Traditional Way Vs. MapReduce Way


2. Data Locality:

Instead of moving data to the processing unit, we are moving the processing unit to the data
in the MapReduce Framework. In the traditional system, we used to bring data to the
processing unit and process it. But, as the data grew and became very huge, bringing this
huge amount of data to the processing unit posed the following issues:

 Moving huge data to processing is costly and deteriorates the network performance.
 Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
 The master node can get over-burdened and may fail.

Hadoop – Different Modes of Operation

Hadoop is an open-source framework which is mainly used for storage purpose and
maintaining and analyzing a large amount of data or datasets on the clusters of commodity
hardware, which means it is actually a data management tool. Hadoop also posses a scale-out
storage property, which means that we can scale up or scale down the number of nodes as per
are a requirement in the future which is really a cool feature.

Hadoop Mainly works on 3 different Modes:

1. Standalone Mode
2. Pseudo-distributed Mode
3. Fully-Distributed Mode

1. Standalone Mode:

In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary
Name node, Job Tracker, and Task Tracker.

Standalone Mode also means that we are installing Hadoop only in a single system. By
default, Hadoop is made to run in this Standalone Mode or we can also call it as the Local
mode. We mainly use Hadoop in this Mode for the Purpose of Learning, testing, and
debugging.

Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know
HDFS (Hadoop distributed file system) is one of the major components for Hadoop which
utilized for storage Permission is not utilized in this mode. You can think of HDFS as similar
to the file system’s available for windows i.e. NTFS (New Technology File System) and
FAT32(File Allocation Table which stores the data in the blocks of 32 bits ). when your
Hadoop works in this mode there is no need to configure the files – hdfs-site.xml, mapred-
site.xml, core-site.xml for Hadoop environment. In this Mode, all of your Processes will run
on a single JVM(Java Virtual Machine) and this mode can only be used for small
development purposes.

2. Pseudo Distributed Mode (Single Node Cluster):

In Pseudo-distributed Mode we also use only a single node, but the main thing is that the
cluster is simulated, which means that all the processes inside the cluster will run
independently to each other.

All the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager,
Node Manager, etc. will be running as a separate process on separate JVM(Java Virtual
Machine) or we can say run on different java processes that is why it is called a Pseudo-
distributed.

One thing we should remember that as we are using only the single node set up so all the
Master and Slave processes are handled by the single system. Namenode and Resource
Manager are used as Master and Datanode and Node Manager is used as a slave.

A secondary name node is also used as a Master. The purpose of the Secondary Name node is
to just keep the hourly based backup of the Name node. In this Mode,

 Hadoop is used for development and for debugging purposes both.


 Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and
Output processes.

3. Fully Distributed Mode (Multi-Node Cluster):

This is the most important one in which multiple nodes are used few of them run the Master
Daemon’s that are Namenode and Resource Manager and the rest of them run the Slave
Daemon’s that are DataNode and Node Manager. Here Hadoop will run on the clusters of
Machine or nodes. Here the data that is used is distributed across different nodes. This is
actually the Production Mode of Hadoop.
Classes of MapReduce:

Mapper Class

The first stage in Data Processing using MapReduce is the Mapper Class. Here, Record
Reader processes each Input record and generates the respective key-value pair. Hadoop’s
Mapper store saves this intermediate data into the local disk.

 Input Split:It is the logical representation of data. It represents a block of work that
contains a single map task in the MapReduce Program.
 Record Reader:It interacts with the Input split and converts the obtained data in the
form of Key-Value Pairs.

Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which processes it
and generates the final output which is then saved in the HDFS.

Driver Class

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes
long with data types and their respective job names.

MapReduce – Combiners

Map-Reduce is a programming model that is used for processing large-size data-sets over
distributed systems in Hadoop. Map phase and Reduce Phase are the main two important
parts of any Map-Reduce job. Map-Reduce applications are limited by the bandwidth
available on the cluster because there is a movement of data from Mapper to Reducer.

For example, if we have 1 GBPS(Gigabits per second) of the network in our cluster and we
are processing data that is in the range of hundreds of PB(Peta Bytes). Moving such a large
dataset over 1GBPS takes too much time to process. The Combiner is used to solve this
problem by minimizing the data that got shuffled between Map and Reduce.

 What is a combiner?
 How combiner works
 Advantage of combiners
 Disadvantage of combiner

What is a combiner?

Combiner always works in between Mapper and Reducer. The output produced by the
Mapper is the intermediate output in terms of key-value pairs which is massive in size.

If we directly feed this huge output to the Reducer, then that will result in increasing the
Network Congestion. So to minimize this Network congestion we have to put combiner in
between Mapper and Reducer. These combiners are also known as semi-reducer.

It is not necessary to add a combiner to your Map-Reduce program, it is optional. Combiner


is also a class in our java program like Map and Reduce class that is used in between this
Map and Reduce classes. Combiner helps us to produce abstract details or a summary of
very large datasets. When we process or deal with very large datasets using Hadoop
Combiner is very much necessary, resulting in the enhancement of overall performance.
How does combiner work?

In the above example, we can see that two Mappers are containing different data. the main
text file is divided into two different Mappers. Each mapper is assigned to process a different
line of our data. in our above example, we have two lines of data so we have two Mappers to
handle each line. Mappers are producing the intermediate key-value pairs, where the name of
the particular word is key and its count is its value. For example for the data Geeks For
Geeks For the key-value pairs are shown below.

// Key Value pairs generated for data Geeks For Geeks For

(Geeks,1)
(For,1)
(Geeks,1)
(For,1)
The key-value pairs generated by the Mapper are known as the intermediate key-value pairs
or intermediate output of the Mapper. Now we can minimize the number of these key-value
pairs by introducing a combiner for each Mapper in our program. In our case, we have 4
key-value pairs generated by each of the Mapper. since these intermediate key-value pairs are
not ready to directly feed to Reducer because that can increase Network congestion so
Combiner will combine these intermediate key-value pairs before sending them to Reducer.
The combiner combines these intermediate key-value pairs as per their key. For the above
example for data Geeks For Geeks For the combiner will partially reduce them by merging
the same pairs according to their key value and generate new key-value pairs as shown
below.

// Partially reduced key-value pairs with combiner


(Geeks,2)
(For,2)

With the help of Combiner, the Mapper output got partially reduced in terms of size(key-
value pairs) which now can be made available to the Reducer for better performance. Now
the Reducer will again Reduce the output obtained from combiners and produces the final
output that is stored on HDFS(Hadoop Distributed File System).

Advantage of combiners

 Reduces the time taken for transferring the data from Mapper to Reducer.
 Reduces the size of the intermediate output generated by the Mapper.
 Improves performance by minimizing Network congestion.
 Reduces the workload on the Reducer: Combiners can help reduce the amount of data
that needs to be processed by the Reducer. By performing some aggregation or
reduction on the data in the Mapper phase itself, combiners can reduce the number of
records that are passed on to the Reducer, which can help improve overall
performance.
 Improves fault tolerance: Combiners can also help improve fault tolerance in
MapReduce. In case of a node failure, the MapReduce job can be re-executed from
the point of failure. Since combiners perform some of the aggregation or reduction
tasks in the Mapper phase itself, this reduces the amount of work that needs to be re-
executed, which can help improve fault tolerance.
 Improves scalability: By reducing the amount of data that needs to be transferred
between the Mapper and Reducer, combiners can help improve the scalability of
MapReduce jobs. This is because the amount of network bandwidth required for
transferring data is reduced, which can help prevent network congestion and improve
overall performance.
 Helps optimize MapReduce jobs: Combiners can be used to optimize MapReduce
jobs by performing some preliminary data processing before the data is sent to the
Reducer. This can help reduce the amount of processing required by the Reducer,
which can help improve performance and reduce overall processing time.

Disadvantage of combiners

 The intermediate key-value pairs generated by Mappers are stored on Local Disk and
combiners will run later on to partially reduce the output which results in expensive
Disk Input-Output.
 The map-Reduce job can not depend on the function of the combiner because there is
no such guarantee in its execution.
 Increased resource usage: Combiners can increase the resource usage of MapReduce
jobs since they require additional CPU and memory resources to perform their
operations. This can be especially problematic in large-scale MapReduce jobs that
process huge amounts of data.
 Combiners may not always be effective: While combiners can help reduce the amount
of data transferred between the Mapper and Reducer, they may not always be
effective in doing so. This is because the effectiveness of combiners depends on the
data being processed and the operations being performed. In some cases, using
combiners may actually increase the amount of data transferred, which can reduce
overall performance.
 Combiners can introduce data inconsistencies: Since combiners perform partial
reductions on the data, they can introduce inconsistencies in the output if they are not
implemented correctly. This can be especially problematic if the combiner performs
operations that are not associative or commutative, which can lead to incorrect results.
 Increased complexity: Combiners can add complexity to MapReduce jobs since they
require additional logic to be implemented in the code. This can make the code more
difficult to maintain and debug, especially if the combiner logic is complex.

You might also like