Professional Documents
Culture Documents
Unit 3 Bba
Unit 3 Bba
MapReduce
MapReduce on a Cluster
The above schematic depicts how the mapper converts the input into key-value pairs
and then shuffles and sorts them. the reducer aggregates the mapped, shuffled, and
sorted data to send the desired output to the client node. This is the basic idea of
MapReduce.
Terminology:
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before any processing takes place.
MasterNode − Node where JobTracker runs and which accepts job requests from clients.
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
In response, NameNode provides metadata to Job Tracker.
The work of Job tracker is to manage all the resources and all the jobs across the
cluster and also to schedule each map on the Task Tracker running on the same data
node since there can be hundreds of data nodes available in the cluster.
Task Tracker
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a
part of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm
which helps us to process the data using different machines. As the data is processed by
multiple machines instead of a single machine in parallel, the time taken to process the data
gets reduced by a tremendous amount .
Instead of moving data to the processing unit, we are moving the processing unit to the data
in the MapReduce Framework. In the traditional system, we used to bring data to the
processing unit and process it. But, as the data grew and became very huge, bringing this
huge amount of data to the processing unit posed the following issues:
Moving huge data to processing is costly and deteriorates the network performance.
Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
The master node can get over-burdened and may fail.
Hadoop is an open-source framework which is mainly used for storage purpose and
maintaining and analyzing a large amount of data or datasets on the clusters of commodity
hardware, which means it is actually a data management tool. Hadoop also posses a scale-out
storage property, which means that we can scale up or scale down the number of nodes as per
are a requirement in the future which is really a cool feature.
1. Standalone Mode
2. Pseudo-distributed Mode
3. Fully-Distributed Mode
1. Standalone Mode:
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary
Name node, Job Tracker, and Task Tracker.
Standalone Mode also means that we are installing Hadoop only in a single system. By
default, Hadoop is made to run in this Standalone Mode or we can also call it as the Local
mode. We mainly use Hadoop in this Mode for the Purpose of Learning, testing, and
debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know
HDFS (Hadoop distributed file system) is one of the major components for Hadoop which
utilized for storage Permission is not utilized in this mode. You can think of HDFS as similar
to the file system’s available for windows i.e. NTFS (New Technology File System) and
FAT32(File Allocation Table which stores the data in the blocks of 32 bits ). when your
Hadoop works in this mode there is no need to configure the files – hdfs-site.xml, mapred-
site.xml, core-site.xml for Hadoop environment. In this Mode, all of your Processes will run
on a single JVM(Java Virtual Machine) and this mode can only be used for small
development purposes.
In Pseudo-distributed Mode we also use only a single node, but the main thing is that the
cluster is simulated, which means that all the processes inside the cluster will run
independently to each other.
All the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager,
Node Manager, etc. will be running as a separate process on separate JVM(Java Virtual
Machine) or we can say run on different java processes that is why it is called a Pseudo-
distributed.
One thing we should remember that as we are using only the single node set up so all the
Master and Slave processes are handled by the single system. Namenode and Resource
Manager are used as Master and Datanode and Node Manager is used as a slave.
A secondary name node is also used as a Master. The purpose of the Secondary Name node is
to just keep the hourly based backup of the Name node. In this Mode,
This is the most important one in which multiple nodes are used few of them run the Master
Daemon’s that are Namenode and Resource Manager and the rest of them run the Slave
Daemon’s that are DataNode and Node Manager. Here Hadoop will run on the clusters of
Machine or nodes. Here the data that is used is distributed across different nodes. This is
actually the Production Mode of Hadoop.
Classes of MapReduce:
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here, Record
Reader processes each Input record and generates the respective key-value pair. Hadoop’s
Mapper store saves this intermediate data into the local disk.
Input Split:It is the logical representation of data. It represents a block of work that
contains a single map task in the MapReduce Program.
Record Reader:It interacts with the Input split and converts the obtained data in the
form of Key-Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which processes it
and generates the final output which is then saved in the HDFS.
Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes
long with data types and their respective job names.
MapReduce – Combiners
Map-Reduce is a programming model that is used for processing large-size data-sets over
distributed systems in Hadoop. Map phase and Reduce Phase are the main two important
parts of any Map-Reduce job. Map-Reduce applications are limited by the bandwidth
available on the cluster because there is a movement of data from Mapper to Reducer.
For example, if we have 1 GBPS(Gigabits per second) of the network in our cluster and we
are processing data that is in the range of hundreds of PB(Peta Bytes). Moving such a large
dataset over 1GBPS takes too much time to process. The Combiner is used to solve this
problem by minimizing the data that got shuffled between Map and Reduce.
What is a combiner?
How combiner works
Advantage of combiners
Disadvantage of combiner
What is a combiner?
Combiner always works in between Mapper and Reducer. The output produced by the
Mapper is the intermediate output in terms of key-value pairs which is massive in size.
If we directly feed this huge output to the Reducer, then that will result in increasing the
Network Congestion. So to minimize this Network congestion we have to put combiner in
between Mapper and Reducer. These combiners are also known as semi-reducer.
In the above example, we can see that two Mappers are containing different data. the main
text file is divided into two different Mappers. Each mapper is assigned to process a different
line of our data. in our above example, we have two lines of data so we have two Mappers to
handle each line. Mappers are producing the intermediate key-value pairs, where the name of
the particular word is key and its count is its value. For example for the data Geeks For
Geeks For the key-value pairs are shown below.
// Key Value pairs generated for data Geeks For Geeks For
(Geeks,1)
(For,1)
(Geeks,1)
(For,1)
The key-value pairs generated by the Mapper are known as the intermediate key-value pairs
or intermediate output of the Mapper. Now we can minimize the number of these key-value
pairs by introducing a combiner for each Mapper in our program. In our case, we have 4
key-value pairs generated by each of the Mapper. since these intermediate key-value pairs are
not ready to directly feed to Reducer because that can increase Network congestion so
Combiner will combine these intermediate key-value pairs before sending them to Reducer.
The combiner combines these intermediate key-value pairs as per their key. For the above
example for data Geeks For Geeks For the combiner will partially reduce them by merging
the same pairs according to their key value and generate new key-value pairs as shown
below.
With the help of Combiner, the Mapper output got partially reduced in terms of size(key-
value pairs) which now can be made available to the Reducer for better performance. Now
the Reducer will again Reduce the output obtained from combiners and produces the final
output that is stored on HDFS(Hadoop Distributed File System).
Advantage of combiners
Reduces the time taken for transferring the data from Mapper to Reducer.
Reduces the size of the intermediate output generated by the Mapper.
Improves performance by minimizing Network congestion.
Reduces the workload on the Reducer: Combiners can help reduce the amount of data
that needs to be processed by the Reducer. By performing some aggregation or
reduction on the data in the Mapper phase itself, combiners can reduce the number of
records that are passed on to the Reducer, which can help improve overall
performance.
Improves fault tolerance: Combiners can also help improve fault tolerance in
MapReduce. In case of a node failure, the MapReduce job can be re-executed from
the point of failure. Since combiners perform some of the aggregation or reduction
tasks in the Mapper phase itself, this reduces the amount of work that needs to be re-
executed, which can help improve fault tolerance.
Improves scalability: By reducing the amount of data that needs to be transferred
between the Mapper and Reducer, combiners can help improve the scalability of
MapReduce jobs. This is because the amount of network bandwidth required for
transferring data is reduced, which can help prevent network congestion and improve
overall performance.
Helps optimize MapReduce jobs: Combiners can be used to optimize MapReduce
jobs by performing some preliminary data processing before the data is sent to the
Reducer. This can help reduce the amount of processing required by the Reducer,
which can help improve performance and reduce overall processing time.
Disadvantage of combiners
The intermediate key-value pairs generated by Mappers are stored on Local Disk and
combiners will run later on to partially reduce the output which results in expensive
Disk Input-Output.
The map-Reduce job can not depend on the function of the combiner because there is
no such guarantee in its execution.
Increased resource usage: Combiners can increase the resource usage of MapReduce
jobs since they require additional CPU and memory resources to perform their
operations. This can be especially problematic in large-scale MapReduce jobs that
process huge amounts of data.
Combiners may not always be effective: While combiners can help reduce the amount
of data transferred between the Mapper and Reducer, they may not always be
effective in doing so. This is because the effectiveness of combiners depends on the
data being processed and the operations being performed. In some cases, using
combiners may actually increase the amount of data transferred, which can reduce
overall performance.
Combiners can introduce data inconsistencies: Since combiners perform partial
reductions on the data, they can introduce inconsistencies in the output if they are not
implemented correctly. This can be especially problematic if the combiner performs
operations that are not associative or commutative, which can lead to incorrect results.
Increased complexity: Combiners can add complexity to MapReduce jobs since they
require additional logic to be implemented in the code. This can make the code more
difficult to maintain and debug, especially if the combiner logic is complex.