Unit2 HDFS and Map Reduce

Big Data
Analysis
MCA31
Big Data Analytics and
Visualization
BDA Syllabus: XP
Big Data
Analysis
Module Detailed Contents Hrs.

1 Introduction to Big Data and Hadoop 6
2 HDFS and Map Reduce 6
3 NoSQL: 5
4 Hadoop Ecosystem: HIVE and PIG 6
5 Apache Kafka & Spark: 9
6 Data Visualization: 8
Big Data
Analysis
Unit2: HDFS and Map

Reduce
XP
Big Data
Unit2: HDFS and Map Reduce Analysis
Unit2: HDFS and Map Reduce Hours
Sr. No. Topics 06

1 HDFS: HDFS architecture, Features of HDFS, Rack Awareness,
HDFS Federation
2 Map Reduce: The Map Task, The Reduce Task,
3 Map Reduce:Grouping by Key,Partitioner and Combiners, Detail of

Map Reduce Execution.
4 Algorithm Using Map Reduce:
Matrix and Vector Multiplication by Map Reduce
5 Computing Selection and Projection by Map Reduce
6 Computing Grouping and Aggregation by Map Reduce
7 Self-Learning Topics: Concept of Sorting and Natural Joins

HDFS federation XP
Big Data
Analysis
• The limitation of HDFS implementation is single NameNode
• Because all files metadata is stored in memory , the amount of memory
in the NameNode defines the number of files that can be available on an
Hadoop cluster.
• To overcome this limitation and to scale the name service horizontally,
Hadoop 0.23 introduced HDFS federation which is based on multiple
independent NameNode / Namespaces.
XP
Big Data
Analysis
• Overview of Current HDFS : IT has two main layers:
Namespace
– Namespace manages directories, files and blocks. It
supports file system operations such as creation,
modification, deletion and listing of files and directories.
Block Storage
– Block Storage has two parts:
• Block Management maintains the membership of
datanodes in the cluster. It supports block-related
operations such as creation, deletion, modification
and getting location of the blocks. It also takes care of
replica placement and replication.
• Physical Storage stores the blocks and provides
read/write access to it.
Benefits of HDFS Federation: XP
Big Data
Analysis
Namespace Scalability:
• While HDFS cluster storage scales horizontally with the addition of DataNodes,
the namespace does not.
• Currently the namespace can only be vertically scaled on a single NameNode
• The NameNode stores the entire file system metadata in memory.
• This limits the number of blocks, files, and directories supported on the file
system to what can be accommodated in the memory of a single NameNode.
• A large deployment (or deployments using a lots of small files ) benefits from
scaling the namespace by adding more NameNode to the Cluster
XP
Big Data
Analysis
Performance:
• File system operations throughput are limited to the single NameNode.
• Adding more NameNode to the cluster scales the file system read/write
operation throughput
Isolation:
• A single NameNode offers no isolation in multi-user environment .
• An experimental application that overloads the NameNode can slow
down the other production applications.
• With multiple NameNodes, different categories of applications and users
can be isolated to different Namespaces
XP
Big Data
Analysis
• In order to scale the name service horizontally, federation uses multiple
independent namenodes/namespaces.
• The datanodes are used as common storage for blocks by all the namenodes.
• Each datanode registers with all the namenodes in the cluster.
• Datanodes send periodic heartbeats and block reports and handles commands
from the namenodes.
XP
Big Data
Analysis
• A Block Pool is a set of blocks that belong to a single namespace. Datanodes
store blocks for all the block pools in the cluster.
• It is managed independently of other block pools. This allows a namespace to
generate Block IDs for new blocks without the need for coordination with the
other namespaces. The failure of a namenode does not prevent the datanode
from serving other namenodes in the cluster.

XP
Big Data
Analysis
• A Namespace and its block pool together are called Namespace Volume. It is a
self-contained unit of management.
• When a namenode/namespace is deleted, the corresponding block pool at the
datanodes is deleted.
• Each namespace volume is upgraded as a unit, during cluster upgrade.

XP
Big Data
Analysis
XP
What is a rack? Big Data
Analysis
• The Rack is the collection of around 40-50 DataNodes

connected using the same network switch.
• If the network goes down, the whole rack will be
unavailable.
• A large Hadoop cluster is deployed in multiple racks.
XP
What is Rack Awareness in Hadoop HDFS? Big Data
Analysis
• In a large Hadoop cluster, there are multiple racks. Each rack

consists of DataNodes. Communication between the DataNodes
on the same rack is more efficient as compared to the
communication between DataNodes residing on different racks.
• To reduce the network traffic during file read/write, NameNode
chooses the closest DataNode for serving the client read/write
request.
• NameNode maintains rack ids of each DataNode to achieve this
rack information.
• This concept of choosing the closest DataNode based on the
rack information is known as Rack Awareness.
• A default Hadoop installation assumes that all the DataNodes
reside on the same rack.
Why Rack Awareness? XP
Big Data
Analysis
The reasons for the Rack Awareness in Hadoop are:
• To reduce the network traffic while file read/write, which

improves the cluster performance.
• To achieve fault tolerance, even when the rack goes
down.
• Achieve high availability of data so that data is available
even in unfavorable conditions.
• To reduce the latency, that is, to make the file read/write
operations done with lower delay.
NameNode uses a rack awareness algorithm while placing
the replicas in HDFS.
XP
Big Data
Analysis
NameNode on multiple rack cluster maintains block replication
by using inbuilt Rack awareness policies which are:
• Not more than one replica be placed on one node.

• Not more than two replicas are placed on the same rack.
• Also, the number of racks used for block replication should
always be smaller than the number of replicas.
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Hadoop Commands Big Data
Analysis
• Version : Returns the version

version
• Mkdir: to create the directory in HDFS.

mkdir <path>
Eg. hdfs dfs –mkdir /new_dir

• ls:display the list of Files and Directories in HDFS, contents of
the directory specified by <path>. It shows the name,
permissions, owner, size and modification date of each entry.
For eg, hdfs dfs -ls /user/dataflair
XP
Big Data
Analysis
fsck: to check the health of the Hadoop file system.
hdfs fsck /
touchz: to create a file in HDFS with file size 0 bytes.

hdfs dfs –touchz /directory/filename
for eg hdfs dfs –touchz /new_edu/sample
Note:we are trying to create a file named “sample” in the directory
“new_edu” of hdfs with file size 0 bytes.
XP
Big Data
Analysis
du: to check the file size.
hdfs dfs –du –s /directory/filename
for eg : hdfs dfs –du –s /new_edua/sample1
cat:reads a file on HDFS and prints the content of that file to the
standard output.
hdfs dfs –cat /path/to/file_in_hdfs
for eg: hdfs dfs –cat /new_edu/test
XP
Big Data
Analysis
text :that takes a source file and outputs the file in text format.
hdfs dfs –text /directory/filename
For eg: hdfs dfs –text /new_edu/test
copy FromLocal : to copy the file from a Local file system to HDFS.
hdfs dfs -copyFromLocal <localsrc> <hdfs destination>
for eg: hdfs dfs –copyFromLocal /home/edu/test /new_edu1
Note: Here the test is the file present in the local directory /home/edu and
after the command gets executed the test file will be copied in
/new_edu1 directory of HDFS.
XP
Big Data
Analysis
copyToLocal : to copy the file from HDFS to Local File System.
hdfs dfs -copyToLocal <hdfs source> <localdst>

eg. hdfs dfs –copyToLocal /new_edu/test /home/edu1
Note: Here test is a file present in the new_edu directory of HDFS
and after the command gets executed the test file will be copied to
local directory /home/edu1
XP
Big Data
Analysis
Put: to copy single source or multiple sources from local file

system to the destination file system.
hdfs dfs -put <localsrc> <destination>
for eg. hdfs dfs –put /home/edureka/test /user

Note: The command copyFromLocal is similar to put
command, except that the source is restricted to a local file
reference.
XP
Big Data
Analysis
Get: to copy files from hdfs to the local file system.
hdfs dfs -get <src> <localdst>
for eg: hdfs dfs –get /user/test /home/edureka
Note: The command copyToLocal is similar to get command, except
that the destination is restricted to a local file reference.

XP
Big Data
Analysis
Count : to count the number of directories, files, and bytes under
the paths that match the specified file pattern.
hdfs dfs -count <path>
for eg: hdfs dfs –count /user

XP
Big Data
Analysis
rm: to remove the file from HDFS.
hdfs dfs –rm <path>

for eg: hdfs dfs –rm /new_edureka/test
rm –r : to remove the entire directory and all of its content from
HDFS.
hdfs dfs -rm -r <path>
for eg: hdfs dfs -rm -r /new_edureka
XP
Big Data
Analysis
cp : to copy files from source to destination. This command allows

multiple sources as well, in which case the destination must be a
directory.
hdfs dfs -cp <src> <dest>

for eg: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
for eg : hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
/user/hadoop/dir
XP
Big Data
Analysis
mv: to move files from source to destination. This command
allows multiple sources as well, in which case the destination
needs to be a directory.
hdfs dfs -mv <src> <dest>
for eg: hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2

expunge: that makes the trash empty.
hdfs dfs -expunge
XP
Big Data
Analysis
rmdir : to remove the directory.
hdfs dfs -rmdir <path>

for eg: hdfs dfs –rmdir /user/Hadoop
usage: that returns the help for an individual command.
hdfs dfs -usage <command>

for eg: hdfs dfs -usage mkdir
Note: By using usage command you can get information about any
command.
XP
Big Data
Analysis
Help : that displays help for given command or all commands if
none is specified.
For eg: hdfs dfs –help
Big Data
Analysis
MapReduce
Introduction to MapReduce XP
Big Data
Analysis
• MapReduce is mainly a data processing component of Hadoop.
• It is a programming model for processing large number of data sets.
• It contains the task of data processing and distributes the particular tasks across the nodes.
• It consists of two phases –
– Map : Map converts a typical dataset into another set of data where individual elements are
divided into key-value pairs.
– Reduce :Reduce task takes the output files from a map considering as an input and then
integrate the data tuples into a smaller set of tuples. Always it is been executed after the
map job is done.
– In between Map and Reduce, there is small phase called shuffle and sort in
MapReduce.
What is MapReduce? XP
Big Data
Analysis
XP
MapReduce in Nutshell Big Data
Analysis
Advantage 1: Parallel Processing XP
Big Data
Analysis
XP
Advantage 2: Data Locality – Processing to Storage
Big Data
Analysis
Election Votes Counting – MapReduce Way XP
Big Data
Analysis
MapReduce Way XP
Big Data
Analysis
XP
Components of MapReduce Big Data
Analysis
• MapReduce process is divided into two applications Job Tracker and Task
Tracker
Job Tracker :
– It is responsible for scheduling job runs and managing computational resources across the
cluster
– It runs only on one node of cluster.
– It Oversees the progress of each TaskTracker as they complete their individual task
Task Tracker:
– Every MapReduce job is split into a number of task which are assign to the various
TaskTrackers depending on which data is stored on that node.
– The Job tracker runs on every slave node in the cluster.
JobHistoryServer:
It saves the historical information about completed task or application.
XP
Big Data
Analysis
XP
Big Data
Analysis
Mapper XP
Big Data
Analysis
Reducers XP
Big Data
Analysis
Reducers XP
Big Data
Analysis
XP
Big Data
Analysis
• The basic unit of information, used in
MapReduce is a (Key,value) pair.
• All types of structured and unstructured
data need to be translated to this basic
unit, before feeding the data to
MapReduce model.
• As the name suggests, MapReduce
model consist of two separate routines,
namely Map-function and Reduce-
function.
• The computation on an input (i.e. on a
set of pairs) in MapReduce model occurs
in three stages:
Step 1 : The map stage
Step 2 : The shuffle stage
Step 3 : The reduce stage.
XP
Big Data
Analysis
• The map and shuffle phases distribute the data, and the reduce phase
performs the computation.
• In the map stage, the mapper takes a single (key, value) pair as input
and produces any number of (key, value) pairs as output
• The shuffle stage is automatically handled by the MapReduce
framework
• In the reduce stage, the reducer takes all of the values associated with
a single key k and outputs any number of (key, value) pairs.
• all of the maps need to finish before the reduce stage can begin.
Since the reducer has access to all the values with the same key,
it can perform sequential computations on these values.
• To summarize, for the reduce phase, the user designs a function that
takes in input a list of values associated with a single key and outputs
any number of pairs.
XP
Big Data
Analysis
• Overall, a program in the MapReduce paradigm can consist of
many rounds (usually called jobs) of different map and reduce
functions, performed sequentially one after another.
Example 1:
• To understand Map-Reduce in depth. We have the following 3
sentences :
1. The quick brown fox
2. The fox ate the mouse
3. How now brown cow
• Our objective is to count the frequency of each word in all the
sentences. Imagine that each of these sentences acquire huge
memory and hence are allotted to different data nodes
Execution: XP
Big Data
Analysis
• Mapper takes over this unstructured data and creates key value pairs.
• In this case key is the word and value is the count of this word in the
text available at this data node.
• For instance, the 1st Map node generates 4 key-value pairs :
(the,1), (brown,1),(fox,1), (quick,1).
• The first 3 key-value pairs go to the first Reducer and the last key-
value go to the second Reducer.
• Similarly, the 2nd and 3rd map functions do the mapping for the other
two sentences.
• Through shuffling, all the similar words come to the same end.
• Once, the key value pairs are sorted, the reducer function operates
on this structured data to come up with a summary.
XP
Big Data
Analysis
XP
Example: MapReduce Way – Word Count Process
Big Data
Analysis
XP
Big Data
Analysis
Some example of Map-Reduce function usage in the industry :
At Google:
⮚ Index building for Google Search
⮚ Article clustering for Google News
⮚ Statistical machine translation
At Yahoo!:
⮚ Index building for Yahoo! Search
⮚ Spam detection for Yahoo! Mail
At Facebook:
⮚ Data mining
⮚ Ad optimization
⮚ Spam detection Example
At Amazon:
⮚ Product clustering
⮚ Statistical machine translation
XP
Speculative execution Big Data
Analysis
• In MapReduce, jobs are broken into tasks and the tasks are run in
parallel to make the overall job execution time smaller than it would
otherwise be if the tasks ran sequentially. Now among the divided
tasks, if one of the tasks take more time than desired, then the
overall execution time of job increases.

XP
Speculative execution Big Data
Analysis
• Tasks may be slow for various reasons:
Including hardware degradation or software misconfiguration, but
the causes may be hard to detect since the tasks may be completed
successfully, could be after a longer time than expected.
• Apache Hadoop does not fix or diagnose slow-running tasks.
Instead, it tries to detect when a task is running slower than
expected and launches another, equivalent task as a backup (the
backup task is called as speculative task). This process is called
Speculative execution in MapReduce.
XP
Big Data
Analysis
• Speculative execution in Hadoop does not imply that launching duplicate tasks
at the same time so they can race. As this will result in wastage of resources in
the cluster. Rather, a speculative task is launched only after a task runs for the
significant amount of time and framework detects it running slow as compared
to other tasks, running for the same job.
• When a task successfully completes, then duplicate tasks that are running are
killed since they are no longer needed.
• If the speculative task after the original task, then kill the speculative task.
on the other hand, if the speculative task finishes first, then the original one is
killed. Speculative execution in Hadoop is just an optimization, it is not a feature
to make jobs run more reliably.
XP
Big Data
Analysis
• So if, I summarize:
The speed of MapReduce job is dominated by the slowest task.
MapReduce first detects slow tasks. Then, run redundant
(speculative) tasks. This will optimistically commit before the
corresponding stragglers. This process is known as speculative
execution. Only one copy of a straggler is allowed to be
speculated. Whichever copy (among the two copies) of a task
commits first, it becomes the definitive copy, and the other
copy is killed by the framework
• https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploa
XP
Big Data
Analysis
Partitioner in Map Reduce XP
Big Data
Analysis
• The Partitioner in MapReduce controls the partitioning of the key of
the intermediate mapper output.
• By hash function, key (or a subset of the key) is used to derive the
partition.
• A total number of partitions depends on the number of reduce task.
• According to the key-value each mapper output is partitioned and
records having the same key value go into the same partition (within
each mapper), and then each partition is sent to a reducer
• Partition class determines which partition a given (key, value) pair will
go.
• Partition phase takes place after map phase and before reduce phase
XP
Need of Hadoop MapReduce Partitioner? Big Data
Analysis
• MapReduce job takes an input data set and produces the list of
the key-value pair which is the result of map phase Then, the
output from the map phase is sent to reduce task which
processes the user-defined reduce function on map outputs.
• Before reduce phase, partitioning of the map output take place
on the basis of the key and sorted.
• This partitioning specifies that all the values for each key are
grouped together and make sure that all the values of a single
key go to the same reducer, thus allows even distribution of the
map output over the reducer.
• Partitioner in Hadoop MapReduce redirects the mapper output
to the reducer by determining which reducer is responsible for
the particular key.
Default MapReduce Partitioner XP
Big Data
Analysis
• The Default Hadoop partitioner in Hadoop MapReduce is Hash
Partitioner which computes a hash value for the key and assigns
the partition based on this result.

How many Partitioners are there in Hadoop? XP
Big Data
Analysis
• The total number of Partitioners that run in Hadoop is equal to the
number of reducers
i.e. Partitioner will divide the data according to the number of
reducers which is set by JobConf.setNumReduceTasks() method.
• Thus, the data from single partitioner is processed by a single

reducer. And partitioner is created only when there are multiple
reducers.
Poor Partitioning in Hadoop MapReduce XP
Big Data
Analysis
• If in data input, one key appears more than any

other key. In such case, we use two mechanisms
to send data to partitions.
– The key appearing more will be sent to one partition.
– All the other key will be sent to partitions according to
their hashCode().
• But if hashCode() method does not uniformly
distribute other keys data over partition range,
then data will not be evenly sent to reducers.
XP
Big Data
Analysis
• Poor partitioning of data means that some reducers will have
more data input than other i.e. they will have more work to do
than other reducers. So, the entire job will wait for one reducer to
finish its extra-large share of the load.
• How to overcome poor partitioning in MapReduce?
To overcome poor partitioner in Hadoop MapReduce, we can
create Custom partitioner, which allows sharing workload
uniformly across different reducers.
• Conclusion is ,Hadoop Partitioner allows even distribution of the
map output over the reducer.

XP
Hadoop Combiner Big Data
Analysis
• Hadoop Combiner is also known as “Mini-Reducer” that

summarizes the Mapper output record with the same Key before
passing to the Reducer.
• On a large dataset when we run MapReduce job, large chunks of
intermediate data is generated by the Mapper and this
intermediate data is passed on the Reducer for further processing,
which leads to enormous network congestion.
• MapReduce framework provides a function known as Hadoop
Combiner that plays a key role in reducing network congestion.
XP
Big Data
Analysis
• The primary job of Combiner is to process the output data from
the Mapper, before passing it to Reducer. It runs after the mapper
and before the Reducer and its use is optional.

XP
Big Data
Analysis
• Combiner: When the reduce function is both associative and
commutative(e.g. sum, max, average), then some of the task of
reduce function are assign to combiner .
• Instead of sending all the Mapper data to the reducer , some values
are computed to the mapped side itself by using combiner and then
they are sent to the reducer.
• For eg , if particular w words appears k times among the all

documents assigned to the process , then there will be k
times(word,1), key- value pairs as a result of Map execution, which
can be group into single pair(word,k) provided to the reduce task.
XP
Big Data
Analysis
MapReduce
program
without
Combiner
XP
Big Data
Analysis
MapReduce
program
with
Combiner in
between
Mapper and
Reducer
Advantages of MapReduce Combiner XP
Big Data
Analysis
• Hadoop Combiner reduces the time taken for data transfer
between mapper and reducer.
• It decreases the amount of data that needed to be processed by
the reducer.
• The Combiner improves the overall performance of the reducer.
Disadvantages of Hadoop combiner in MapReduce XP
Big Data
Analysis
• The role of the combiner is to reduce network congestion
• MapReduce jobs cannot depend on the Hadoop combiner
execution because there is no guarantee in its execution.
• Hadoop may or may not execute a combiner. Also if required it
may execute it more than 1 times. So, MapReduce jobs should not
depend on the Combiners execution.
XP
Big Data
Analysis
• In conclusion, MapReduce Combiner plays a key role in reducing
network congestion. MapReduce combiner improves the overall
performance of the reducer by summarizing the output of
Mapper.
Details of MapReduce execution: XP
Big Data
Analysis
1. Run-time coordination in MapReduce
2. Responsibilities of MapReduce framework
3. MapReduce execution pipeline
4. Process pipeline
• import org.apache.hadoop.io.Text; DriverClass
import org.apache.hadoop.mapred.FileOutputFormat; XP
import org.apache.hadoop.mapred.JobConf;
Big Data
import org.apache.hadoop.mapred.Mapper;
Analysis
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.fs.Path;
public class WordCount extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
JobConf jobConf = new JobConf(WordCount.class);
jobConf.setMapperClass(WordCountMapper.class);
jobConf.setReducerClass(WordCountReducer.class);
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(IntWritable.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(jobConf);
return 0;
}
public static void main(String args[])throws Exception {
int exitCode = ToolRunner.run(new WordCount(), args);
System.exit(exitCode);
}
• import org.apache.hadoop.io.Text; Mapclass
XP
import org.apache.hadoop.mapred.MapReduceBase; Big Data
import org.apache.hadoop.mapred.Mapper; Analysis
class WordCountMapper extends MapReduceBase
implements Mapper <LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter rep) throws IOException {
String row = value.toString();
String[] words = row.split(" ");
for (String word : words) {
if(word.length > 0){
output.collect(new Text(word), new
IntWritable(1));
}
} } }
import org.apache.hadoop.io.Text; ReducerClass
import org.apache.hadoop.mapred.MapReduceBase; XP
Big Data
import org.apache.hadoop.mapred.Mapper; Analysis
class WordCountReducer extends MapReduceBase

implements Mapper <Text,IntWritable,Text,IntWritable > {
public void reduce(Text key, Iterator<IntWritable>
values, OutputCollector<Text, IntWritable> output,
Reporter rep) throws IOException {
int count = 0;
for(IntWritable value : values){
count++;
}
output.collect(key, new IntWritable(count));
}
}
XP
Details of MapReduce execution: Big Data
Analysis
1. Run-time coordination in MapReduce:
• Once the user submits his “jar file” , MapReduce
transparently handles execution on cluster.
• It takes care of
– scheduling: by using speculative execution and
– Synchronization : to synchronize Map and Reduce process – the
reduce phase cant start still all Map processes are completed
• Its ensures that all job submitted by all the users get fairly
equal execution of clusters execution
Big Data
Analysis
2. Responsibilities of MapReduce framework:
It take cares of scheduling, monitoring and rescheduling of failed
tasks
i. Provide overall coordination of execution
ii. Select nodes for running mappers
iii. Starts and monitor mappers execution
iv. Chooses locations for reducers execution
v. Delivers the output of mapper to reducer node
vi. Start and monitor reducers execution
Big Data
Analysis
3. MapReduce execution pipeline: Following are the main
components
i. Drivers:
-It is main program that initializes MapReduce job and gets
back the status of job execution.
-for each job its provide the specification and configuration of
all its components (Mapper, Reducer , Combiner and
partitioner) along with input output formats.
Details of MapReduce execution: 3. MapReduce XP
execution pipeline: Big Data
Analysis
Question: What do you mean by Input Format? What is role of
RecordReader in
Hadoop MapReduce?
ii. Input Data: it is resides in HDFS or Hbase.

Input Format:
✔Input Format defines how the input files are split and read.
✔Input Format creates InputSplit.
✔Based on split , Input Format defines number of map task in the
mapping phase.
✔Job driver invokes the InputFormat directly to decide the InputSplit
number and location of the map task execution.
Details of MapReduce execution: 3. MapReduce XP
execution pipeline: Big Data
Analysis
Input Split:
✔It is the logical representation of data.
✔ It represents the data which is processed by an individual
Mapper. When you save any file in Hadoop, the file is broken
down into blocks of 128 MB (default configuration).
✔HDFS is designed to have a Master — Slave configuration, so the
blocks of data is stored in slaves (Data Nodes) and meta data of
data is stored in Master(Name Node).
✔One map task is created for each Input Split.
✔The split is divided into records and each record will be
processed by the mapper.
XP
Big Data
Analysis
Record Reader:
✔It communicates with InputSplit in and converts the data
into key-value pairs suitable for reading by the mapper.
✔By default, it uses TextInputFormat for converting data into
a key-value pair.
✔Record Reader communicates with the InputSplit until the
file reading is not completed.
✔It assigns byte offset (unique number) to each line present
in the file. Then, these key-value pairs are sent to the
mapper for further processing.
XP
Big Data
Analysis
iii.Mapper:
✔ For each map task , new instance of mapper is instantiated.
✔ Mapper processes each input record and generates an
intermediate key-value pair.
✔ In mapper task, the output is full collection of all these <key,
value> pairs.
✔ In the event of node failure before the map output is
consumed by the reduce task, Hadoop reruns the map task on
another node and re-creates the map output.
No. of Mapper= {(total data size)/ (input split size)}
Mappers output is passed to the combiner for further process.

XP
Big Data
Analysis
iv.Combiner:
✔The combiner is also known as ‘Mini-Reducer’.
✔Combiner is optional and performs local aggregation on the
mappers output, which helps to minimize the data transfer
between Mapper and Reducer, thereby improving the
overall performance of the Reducer.
✔The output of Combiner is then passed to the Partitioner.
XP
v. Partitioner: Big Data
Analysis
✔Partitioner comes into picture if we are working on more than one
reducer.
✔Partitioner takes the output from Combiners and performs partitioning.
✔ Partitioning of output takes place on the basis of the key and then
sorted.
✔ Hash Partitioner is the default Partitioner in Map Reduce which
computes a hash value for the key and assigns the partition based on this
result.
✔The total number of Partitioner that run in Hadoop is equal to the number
of reducers
XP
Big Data
Analysis
vi.Shuffling and Sorting:
As shuffling can start even before the map phase has finished
so this saves some time and completes the tasks in lesser
time.
The keys generated by the mapper are automatically sorted
by Map Reduce.
Values passed to each reducer are not sorted and can be in
any order. Sorting helps reducer to easily distinguish when a
new reduce task should start.
This saves time for the Reducer. Reducer starts a new reduce
task when the next key in the sorted input data is different
than the previous.
Each reduce task takes key-value pairs as input and generates
key-value pair as output.
XP
Big Data
Analysis
vii. Reducer:
✔It takes the set of intermediate key-value pairs produced by the
mappers as the input and then runs a reducer function on each of
them to generate the output.
✔The output of the reducer is the final output, which is stored in
HDFS.
✔Reducers run in parallel as they are independent of one another.
✔The user decides the number of reducers. By default, the number
of reducers is 1
XP
Big Data
Analysis
Viii. Record Writer:
✔It writes these output key-value pair from the Reducer phase to
the output files.
✔The implementation to be used to write the output files of the job
is defined by Output Format.
ix.Output Files: The output is stored in these Output Files and these
Output Files are generally stored in HDFS.
XP
Big Data
Analysis
x. Output Format:
✔The way these output key-value pairs are written in output files by
RecordWriter is determined by the Output Format.
✔The final output of reducer is written on HDFS by Output Format.
✔Output Files are stored in a File System.
Big Data
Analysis
4. Process pipeline:
1. Job driver uses InputFormat to partition a maps execution &
initiate a JobClient
2. JobClient communicate with JobTracker & submits the job for
execution
3. Job tracker creates one Map task for each split as well as set
of reducer task
4. TaskTracker is present on every node of cluster controls the
actual execution of Map task
5. Once the TaskTracker start the Map job, its periodically sends
a heartbeat message to the JobTracker to communicate that
it is alive & also to indicate that it is ready to accept a new
job for execution
Details of MapReduce execution: Process XP
pipeline: Big Data
Analysis
6. JobTracker then uses a scheduler to allocate the task to the TaskTracker by
using heartbeat return value
7. Once the task is assign to the TaskTracker, it copies the job jar file to
Tasktracker local file system. Along with others file needed for execution
of job, it creates an instance of task runner( child process)
8. The child process informs the parents (TaskTracker) about the task
progress every few second till it completes the task
9. When the last job of task is complete, Jobtracker receives the notification
& it changes the status of the job as “ Completed”
10. By periodically polling the JobTracker, the JobClient recognizes the Job
status
Coping with Node failure XP
Big Data
Analysis
• Only on ONE JobTracker for Hadoop cluster and JobTracker runs on
its own JVM.
• All Slave nodes are configured with JobTracker node location.
• So if JobTracker fails ,all the job running in its slave are halted. The
whole MapReduced job is restarted
• If the node , the TaskTracker runs fail, then JobTracker monitor all
TaskTracker and detect failure. Now only all task that are run in this
node are restarted . Even a task which had completed also restart .
• The JobTracker has to inform all the Reduce task that input for them
will be available from another location.
• If there is failure at Reduce node , then JobTacker sets it to “idle”
and reschedules the Reduce task on another node
Algorithm using Map Reduce : XP
Big Data
Analysis
matrix vector multiplication

http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/9-pa
rallel/matrix-mult.html
Matrix Multiplication XP
Big Data
Analysis
Matrix Data Model for MapReduce
•We represent matrix M as a relation M(I,J,V), with tuples (I, j, mi,j)
and matrix N as a relation N(J, K, W), with tuples (j, k, njk) . Most
matrices are sparse so large amount of cells have value zero. When
we represent matrices in this form, we do not need to keep entries
for the cells that have values of zero to save large amount of disk
space.
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Matrix Multiplication – Mapper function Big Data
Analysis
Let us consider the matrix multiplication example to visualize MapReduce. Consider the
following matrix: Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
I,1 j,1
j,2
I,2 Therefore computing the mapper for Matrix A:
2x2 2x2 # k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have 2 values 1
j,2 k,2 & 2, each case can have 2 further values of j=1 and j=2.
j,1 k,1 Substituting all values in formula
Computing the mapper for Computing the mapper for
Matrix A: Matrix B :
k=1 i=1 j=1 ((1, 1), (A, 1, 1)) i=1 j=1 k=1 ((1, 1), (B, 1, 5))
j=2 ((1, 1), (A, 2, 2)) k=2 ((1, 2), (B, 1, 6))
i=2 j=1 ((2, 1), (A, 1, 3)) j=2 k=1 ((1, 1), (B, 2, 7))
j=2 ((2, 1), (A, 2, 4)) k=2 ((1, 2), (B, 2, 8))
k=2 i=1 j=1 ((1, 2), (A, 1, 1)) i=2 j=1 k=1 ((2, 1), (B, 1, 5))
j=2 ((1, 2), (A, 2, 2)) k=2 ((2, 2), (B, 1, 6))
i=2 j=1 ((2, 2), (A, 1, 3)) j=2 k=1 ((2, 1), (B, 2, 7))
j=2 ((2, 2), (A, 2, 4)) k=2 ((2, 2), (B, 2, 8))
Matrix Multiplication – Reducer function XP
Big Data
The formula for Reducer is:
Analysis
Reducer(k, v)=(i, k)=>Make sorted Alist and Blist (i, k) => Summation (Aij * Bjk)) for j Output =>((i,
k), sum)
1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i) Therefore the Final
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)} Matrix is:
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)
(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)
From (i), (ii), (iii) and (iv) we conclude that ((1, 1), 19) ((1, 2), 22) ((2, 1), 43) ((2, 2), 50)
Relational algebra operations XP
Big Data
Analysis
Question: How relational Algebra operations are executed using
MapReduce ?
• Selection
• Projection
• Union, Intersection, and Difference

• Natural Join
• Grouping and Aggregation
Assumptions: XP
Big Data
Analysis
⮚R, S – relation
⮚ t, t0 - a tuple
⮚ C - a condition of selection
⮚ A, B, C - subset of attributes
⮚ a, b, c - attribute values for a given subset of

attributes
Selection : XP
Big Data
Analysis
• Map: For each tuple t in R, test if it satisfies C. If so, produce the
key- value pair (t, t). That is, both the key and value are
t.
• Reduce: The Reduce function is the identity. It simply passes each
key-value pair to the output.

XP
Big Data
Analysis
• Selection: σC(R) – Apply condition C to each tuple of relation
R
• Produce in output a relation containing only tuples that
satisfy C
XP
Projections Big Data
Analysis
• A MapReduce implementation of πS(R)

• Map: -
– For each tuple t in R, construct a tuple t’ by eliminating
those components whose attributes are not in S
– Emit a key/value pair (t’, t’)
• Reduce: -
– For each key produced by any of the Map tasks, fetch t′, [t
′, ··· , t′]
– Emit a key/value pair (t’, t’)
XP
Projections Big Data
Analysis
• Projection: πS(R)
• Given a subset S of relation R attributes
• Produce in output a relation containing only tuples for
the attributes in S
Projection XP
Big Data
Analysis
XP
Union, Intersection and Difference Big Data
Analysis
• Apply to the set of tuples in two relations that have the same schema
• Unions in MapReduce:
Suppose relations R and S have the same schema –
✔ Map tasks will be assigned chunks from either R or S

✔Mappers don’t do much, just pass by to reducers
✔Reducers do duplicate elimination
A MapReduce implementation of Union :
Map: For each tuple t in R or S, emit a key/value pair (t, t)

Reduce: For each key t, emit a key/value pair (t, t)
Intersection in MapReduce: XP
Big Data
Analysis
• Suppose relations R and S have the same schema
• The map function is the same (an identity mapper) as for union
• The reduce function must produce a tuple only if both relations
have that tuple
• A MapReduce implementation of Intersection :
• Map: For each tuple t in R or S, emit a key/value pair (t, t)
• Reduce: If key t has value list [t,t], emit a key/value pair (t, t)
Otherwise, emit a key/value pair (t, NULL)
XP
Difference in MapReduce Big Data
Analysis
• we have two relations R and S with the same schema
• The only way a tuple t can appear in the output is if it is in R but not in S
• The map function can pass tuples from R and S to the reducer
• NOTE: it must inform the reducer whether the tuple came from R or S
• A MapReduce implementation of Difference:
• Map: For a tuple t in R emit a key/value pair (t, ‘R’) For a tuple t in S, emit
a key/value pair (t, ‘S’)
• Reduce: If key t has value list [R], emit a key/value pair (t, t) Otherwise,
emit a key/value pair (t, NULL) i.e., [‘R’, ‘S’] or [‘S’, ‘R’] or [‘S’]
XP
Grouping and Aggregation Big Data
Analysis
• Grouping and Aggregation: γX (R)
✔Given a relation R, partition its tuples according to their values in one set
of attributes G .The set G is called the grouping attributes
✔ Then, for each group, aggregate the values in certain other attributes.
Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ...
• In the notation, X is a list of elements that can be:

✔ A grouping attribute
✔ An expression θ(A), where θ is one of the (five) aggregation functions
and A is an attribute NOT among the grouping attributes
Grouping and Aggregation: γX (R) XP
Big Data
Analysis
• The result of this operation is a relation with one tuple for each group
• Example: Imagine that a social-networking site has a relation
Friends(User, Friend) . The tuples are pairs (a, b) such that b is a friend of
a
- Question: compute the number of friends each member has
How to satisfy the query YUser , COUNT(Friend))(Friends)
Answer:
✔ This operation groups all the tuples by the value in their
first component
There is one group for each user
✔ Then
✔Count function counts the number of tuples in the group ,
for each group, it counts the number of friends
XP
Big Data
Analysis
Natural join: by MapReduce XP
Big Data
Analysis
• For computing Natural join of relation R(A,B) with S(B,C) ,we need
to find tuples that agree on their B components, that is , the
second component from tuple of R and first component of tuples
of S.
• Using B value of tuples from either relation as the key value ,the
value will be the other component along with the name of the
relation , so that the Reduce function can know where each tuple
come from.
Big Data
Analysis
End of Unit 2

Unit2 HDFS and Map Reduce

Uploaded by

Copyright:

Available Formats

You might also like

Unit2 HDFS and Map Reduce

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit2 HDFS and Map Reduce

Uploaded by

Copyright:

Available Formats

Big Data

Module Detailed Contents Hrs.

Unit2: HDFS and Map

Unit2: HDFS and Map Reduce Hours

Sr. No. Topics 06

3 Map Reduce:Grouping by Key,Partitioner and Combiners, Detail of

6 Computing Grouping and Aggregation by Map Reduce

7 Self-Learning Topics: Concept of Sorting and Natural Joins

• Each datanode registers with all the namenodes in the cluster.

• A Block Pool is a set of blocks that belong to a single namespace. Datanodes

store blocks for all the block pools in the cluster.

• It is managed independently of other block pools. This allows a namespace to

from serving other namenodes in the cluster.

• A Namespace and its block pool together are called Namespace Volume. It is a

self-contained unit of management.

• When a namenode/namespace is deleted, the corresponding block pool at the

• Each namespace volume is upgraded as a unit, during cluster upgrade.

• The Rack is the collection of around 40-50 DataNodes

• In a large Hadoop cluster, there are multiple racks. Each rack

• To reduce the network traffic while file read/write, which

• Not more than one replica be placed on one node.

• Version : Returns the version

• Mkdir: to create the directory in HDFS.

Eg. hdfs dfs –mkdir /new_dir

touchz: to create a file in HDFS with file size 0 bytes.

du: to check the file size.

copyToLocal : to copy the file from HDFS to Local File System.

hdfs dfs -copyToLocal <hdfs source> <localdst>

Put: to copy single source or multiple sources from local file

for eg. hdfs dfs –put /home/edureka/test /user

Get: to copy files from hdfs to the local file system.

hdfs dfs -get <src> <localdst>

for eg: hdfs dfs –get /user/test /home/edureka

Note: The command copyToLocal is similar to get command, except

that the destination is restricted to a local file reference.

Count : to count the number of directories, files, and bytes under

the paths that match the specified file pattern.

hdfs dfs -count <path>

for eg: hdfs dfs –count /user

rm: to remove the file from HDFS.

hdfs dfs –rm <path>

cp : to copy files from source to destination. This command allows

hdfs dfs -cp <src> <dest>

for eg: hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2

rmdir : to remove the directory.

hdfs dfs -rmdir <path>

hdfs dfs -usage <command>

otherwise be if the tasks ran sequentially. Now among the divided

overall execution time of job increases.

• The Default Hadoop partitioner in Hadoop MapReduce is Hash

the partition based on this result.

• Thus, the data from single partitioner is processed by a single

• If in data input, one key appears more than any

• Conclusion is ,Hadoop Partitioner allows even distribution of the

map output over the reducer.

• Hadoop Combiner is also known as “Mini-Reducer” that

• The primary job of Combiner is to process the output data from

the Mapper, before passing it to Reducer. It runs after the mapper