Professional Documents
Culture Documents
Unit2 HDFS and Map Reduce
Unit2 HDFS and Map Reduce
Unit2 HDFS and Map Reduce
Analysis
MCA31
Big Data Analytics and
Visualization
BDA Syllabus: XP
Big Data
Analysis
Namespace
– Namespace manages directories, files and blocks. It
supports file system operations such as creation,
modification, deletion and listing of files and directories.
Block Storage
– Block Storage has two parts:
• Block Management maintains the membership of
datanodes in the cluster. It supports block-related
operations such as creation, deletion, modification
and getting location of the blocks. It also takes care of
replica placement and replication.
• Physical Storage stores the blocks and provides
read/write access to it.
Benefits of HDFS Federation: XP
Big Data
Analysis
Namespace Scalability:
• While HDFS cluster storage scales horizontally with the addition of DataNodes,
the namespace does not.
• Currently the namespace can only be vertically scaled on a single NameNode
• The NameNode stores the entire file system metadata in memory.
• This limits the number of blocks, files, and directories supported on the file
system to what can be accommodated in the memory of a single NameNode.
• A large deployment (or deployments using a lots of small files ) benefits from
scaling the namespace by adding more NameNode to the Cluster
XP
Big Data
Analysis
Performance:
• File system operations throughput are limited to the single NameNode.
• Adding more NameNode to the cluster scales the file system read/write
operation throughput
Isolation:
• A single NameNode offers no isolation in multi-user environment .
• An experimental application that overloads the NameNode can slow
down the other production applications.
• With multiple NameNodes, different categories of applications and users
can be isolated to different Namespaces
XP
Big Data
Analysis
• In order to scale the name service horizontally, federation uses multiple
independent namenodes/namespaces.
• The datanodes are used as common storage for blocks by all the namenodes.
• Datanodes send periodic heartbeats and block reports and handles commands
from the namenodes.
XP
Big Data
Analysis
generate Block IDs for new blocks without the need for coordination with the
other namespaces. The failure of a namenode does not prevent the datanode
datanodes is deleted.
hdfs dfs –du –s /directory/filename
for eg : hdfs dfs –du –s /new_edua/sample1
cat:reads a file on HDFS and prints the content of that file to the
standard output.
hdfs dfs –cat /path/to/file_in_hdfs
for eg: hdfs dfs –cat /new_edu/test
XP
Big Data
Analysis
text :that takes a source file and outputs the file in text format.
hdfs dfs –text /directory/filename
For eg: hdfs dfs –text /new_edu/test
copy FromLocal : to copy the file from a Local file system to HDFS.
hdfs dfs -copyFromLocal <localsrc> <hdfs destination>
for eg: hdfs dfs –copyFromLocal /home/edu/test /new_edu1
Note: Here the test is the file present in the local directory /home/edu and
after the command gets executed the test file will be copied in
/new_edu1 directory of HDFS.
XP
Big Data
Analysis
MapReduce
Introduction to MapReduce XP
Big Data
Analysis
• MapReduce is mainly a data processing component of Hadoop.
• It is a programming model for processing large number of data sets.
• It contains the task of data processing and distributes the particular tasks across the nodes.
• It consists of two phases –
– Map : Map converts a typical dataset into another set of data where individual elements are
divided into key-value pairs.
– Reduce :Reduce task takes the output files from a map considering as an input and then
integrate the data tuples into a smaller set of tuples. Always it is been executed after the
map job is done.
– In between Map and Reduce, there is small phase called shuffle and sort in
MapReduce.
What is MapReduce? XP
Big Data
Analysis
XP
MapReduce in Nutshell Big Data
Analysis
Advantage 1: Parallel Processing XP
Big Data
Analysis
XP
Advantage 2: Data Locality – Processing to Storage
Big Data
Analysis
Election Votes Counting – MapReduce Way XP
Big Data
Analysis
MapReduce Way XP
Big Data
Analysis
XP
Components of MapReduce Big Data
Analysis
• MapReduce process is divided into two applications Job Tracker and Task
Tracker
Job Tracker :
– It is responsible for scheduling job runs and managing computational resources across the
cluster
– It runs only on one node of cluster.
– It Oversees the progress of each TaskTracker as they complete their individual task
Task Tracker:
– Every MapReduce job is split into a number of task which are assign to the various
TaskTrackers depending on which data is stored on that node.
– The Job tracker runs on every slave node in the cluster.
JobHistoryServer:
It saves the historical information about completed task or application.
XP
Big Data
Analysis
XP
Big Data
Analysis
Mapper XP
Big Data
Analysis
Reducers XP
Big Data
Analysis
Reducers XP
Big Data
Analysis
XP
Big Data
Analysis
• The basic unit of information, used in
MapReduce is a (Key,value) pair.
• All types of structured and unstructured
data need to be translated to this basic
unit, before feeding the data to
MapReduce model.
• As the name suggests, MapReduce
model consist of two separate routines,
namely Map-function and Reduce-
function.
• The computation on an input (i.e. on a
set of pairs) in MapReduce model occurs
in three stages:
Step 1 : The map stage
Step 2 : The shuffle stage
Step 3 : The reduce stage.
XP
Big Data
Analysis
• The map and shuffle phases distribute the data, and the reduce phase
performs the computation.
• In the map stage, the mapper takes a single (key, value) pair as input
and produces any number of (key, value) pairs as output
• The shuffle stage is automatically handled by the MapReduce
framework
• In the reduce stage, the reducer takes all of the values associated with
a single key k and outputs any number of (key, value) pairs.
• all of the maps need to finish before the reduce stage can begin.
Since the reducer has access to all the values with the same key,
it can perform sequential computations on these values.
• To summarize, for the reduce phase, the user designs a function that
takes in input a list of values associated with a single key and outputs
any number of pairs.
XP
Big Data
Analysis
• Overall, a program in the MapReduce paradigm can consist of
many rounds (usually called jobs) of different map and reduce
functions, performed sequentially one after another.
Example 1:
• To understand Map-Reduce in depth. We have the following 3
sentences :
1. The quick brown fox
2. The fox ate the mouse
3. How now brown cow
• Our objective is to count the frequency of each word in all the
sentences. Imagine that each of these sentences acquire huge
memory and hence are allotted to different data nodes
Execution: XP
Big Data
Analysis
• Mapper takes over this unstructured data and creates key value pairs.
• In this case key is the word and value is the count of this word in the
text available at this data node.
• For instance, the 1st Map node generates 4 key-value pairs :
(the,1), (brown,1),(fox,1), (quick,1).
• The first 3 key-value pairs go to the first Reducer and the last key-
value go to the second Reducer.
• Similarly, the 2nd and 3rd map functions do the mapping for the other
two sentences.
• Through shuffling, all the similar words come to the same end.
• Once, the key value pairs are sorted, the reducer function operates
on this structured data to come up with a summary.
XP
Big Data
Analysis
XP
Example: MapReduce Way – Word Count Process
Big Data
Analysis
XP
Big Data
Analysis
Some example of Map-Reduce function usage in the industry :
At Google:
⮚ Index building for Google Search
⮚ Article clustering for Google News
⮚ Statistical machine translation
At Yahoo!:
⮚ Index building for Yahoo! Search
⮚ Spam detection for Yahoo! Mail
At Facebook:
⮚ Data mining
⮚ Ad optimization
⮚ Spam detection Example
At Amazon:
⮚ Product clustering
⮚ Statistical machine translation
XP
Speculative execution Big Data
Analysis
• In MapReduce, jobs are broken into tasks and the tasks are run in
parallel to make the overall job execution time smaller than it would
tasks, if one of the tasks take more time than desired, then the
Partitioner which computes a hash value for the key and assigns
• Instead of sending all the Mapper data to the reducer , some values
are computed to the mapped side itself by using combiner and then
they are sent to the reducer.
MapReduce
program
without
Combiner
XP
Big Data
Analysis
MapReduce
program
with
Combiner in
between
Mapper and
Reducer
Advantages of MapReduce Combiner XP
Big Data
Analysis
• Hadoop Combiner reduces the time taken for data transfer
between mapper and reducer.
• It decreases the amount of data that needed to be processed by
the reducer.
• The Combiner improves the overall performance of the reducer.
Disadvantages of Hadoop combiner in MapReduce XP
Big Data
Analysis
• The role of the combiner is to reduce network congestion
• MapReduce jobs cannot depend on the Hadoop combiner
execution because there is no guarantee in its execution.
• Hadoop may or may not execute a combiner. Also if required it
may execute it more than 1 times. So, MapReduce jobs should not
depend on the Combiners execution.
XP
Big Data
Analysis
Mapper.
Details of MapReduce execution: XP
Big Data
Analysis
1. Run-time coordination in MapReduce
2. Responsibilities of MapReduce framework
3. MapReduce execution pipeline
4. Process pipeline
• import org.apache.hadoop.io.Text; DriverClass
import org.apache.hadoop.mapred.FileOutputFormat; XP
import org.apache.hadoop.mapred.JobConf;
Big Data
import org.apache.hadoop.mapred.Mapper;
Analysis
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.fs.Path;
public class WordCount extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
JobConf jobConf = new JobConf(WordCount.class);
jobConf.setMapperClass(WordCountMapper.class);
jobConf.setReducerClass(WordCountReducer.class);
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(IntWritable.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(jobConf);
return 0;
}
public static void main(String args[])throws Exception {
int exitCode = ToolRunner.run(new WordCount(), args);
System.exit(exitCode);
}
• import org.apache.hadoop.io.Text; Mapclass
XP
import org.apache.hadoop.mapred.MapReduceBase; Big Data
import org.apache.hadoop.mapred.Mapper; Analysis
class WordCountMapper extends MapReduceBase
implements Mapper <LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter rep) throws IOException {
String row = value.toString();
String[] words = row.split(" ");
for (String word : words) {
if(word.length > 0){
output.collect(new Text(word), new
IntWritable(1));
}
} } }
import org.apache.hadoop.io.Text; ReducerClass
import org.apache.hadoop.mapred.MapReduceBase; XP
Big Data
import org.apache.hadoop.mapred.Mapper; Analysis
• Its ensures that all job submitted by all the users get fairly
equal execution of clusters execution
Details of MapReduce execution: XP
Big Data
Analysis
2. Responsibilities of MapReduce framework:
It take cares of scheduling, monitoring and rescheduling of failed
tasks
i. Provide overall coordination of execution
ii. Select nodes for running mappers
iii. Starts and monitor mappers execution
iv. Chooses locations for reducers execution
v. Delivers the output of mapper to reducer node
vi. Start and monitor reducers execution
Details of MapReduce execution: XP
Big Data
Analysis
3. MapReduce execution pipeline: Following are the main
components
i. Drivers:
-It is main program that initializes MapReduce job and gets
back the status of job execution.
-for each job its provide the specification and configuration of
all its components (Mapper, Reducer , Combiner and
partitioner) along with input output formats.
Details of MapReduce execution: 3. MapReduce XP
execution pipeline: Big Data
Analysis
Question: What do you mean by Input Format? What is role of
RecordReader in
Hadoop MapReduce?
From (i), (ii), (iii) and (iv) we conclude that ((1, 1), 19) ((1, 2), 22) ((2, 1), 43) ((2, 2), 50)
Relational algebra operations XP
Big Data
Analysis
Question: How relational Algebra operations are executed using
MapReduce ?
• Selection
• Projection
⮚R, S – relation
⮚ t, t0 - a tuple
⮚ C - a condition of selection
⮚ A, B, C - subset of attributes
key- value pair (t, t). That is, both the key and value are
t.
• Projection: πS(R)
the attributes in S
Projection XP
Big Data
Analysis
XP
Union, Intersection and Difference Big Data
Analysis
• Apply to the set of tuples in two relations that have the same schema
• Unions in MapReduce:
• The map function is the same (an identity mapper) as for union
• The reduce function must produce a tuple only if both relations
have that tuple
• A MapReduce implementation of Intersection :
• Map: For each tuple t in R or S, emit a key/value pair (t, t)
• Reduce: If key t has value list [t,t], emit a key/value pair (t, t)
Otherwise, emit a key/value pair (t, NULL)
XP
Difference in MapReduce Big Data
Analysis
• we have two relations R and S with the same schema
• The only way a tuple t can appear in the output is if it is in R but not in S
• The map function can pass tuples from R and S to the reducer
• NOTE: it must inform the reducer whether the tuple came from R or S
• A MapReduce implementation of Difference:
• Map: For a tuple t in R emit a key/value pair (t, ‘R’) For a tuple t in S, emit
a key/value pair (t, ‘S’)
• Reduce: If key t has value list [R], emit a key/value pair (t, t) Otherwise,
emit a key/value pair (t, NULL) i.e., [‘R’, ‘S’] or [‘S’, ‘R’] or [‘S’]
XP
Grouping and Aggregation Big Data
Analysis
• Grouping and Aggregation: γX (R)
✔Given a relation R, partition its tuples according to their values in one set
of attributes G .The set G is called the grouping attributes
✔ Then, for each group, aggregate the values in certain other attributes.
Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ...
• For computing Natural join of relation R(A,B) with S(B,C) ,we need
to find tuples that agree on their B components, that is , the
second component from tuple of R and first component of tuples
of S.
• Using B value of tuples from either relation as the key value ,the
value will be the other component along with the name of the
relation , so that the Reduce function can know where each tuple
come from.
Big Data
Analysis
End of Unit 2