BSc In Information Technology

(Data Science)
SLIIT – 2019 (Semester 2)

Massive or BIG Data Processing

Introduction to Map Reduce
BIG Data Processing and Abstraction
MapReduce Overview
• A method for distributing computation across multiple nodes
• Each node processes the data that is stored at that node
• Consists of two main phases
• Map
• Reduce

MapReduce Features
• Automatic parallelization and distribution
• Fault-Tolerance
• Provides a clean abstraction for programmers to use
MapReduce Algorithm
• Iterate over a large n u m b e r of records MAP
• Extract something of interest from each
• Shuffle a n d sort intermediate results REDUCE
• Aggregate intermediate results
• Generate final o u t p u t
Key idea: provide a functional abstraction for these two operations

Programmers specify two functions:

m a p (k1, v1) → [(k2, v2)]
reduce (k2, [v2]) → [(k3, v3)]
 All values with the same key are sent to the same reducer
The execution framework handles everything else…
MapReduce Algorithm
• Handles scheduling
• Assigns workers to map and
reduce tasks
• Handles “data distribution”
• Moves processes to data
• Handles synchronization
• Gathers, sorts, and shuffles
intermediate data
• Handles errors and faults
• Detects worker failures and
• Everything happens on top of a
distributed FS (HDFS)
MapReduce Algorithm
The Combiner
The Mapper • Called once for each unique key
• Reads data as key/value pairs • Gets a list of all values associated with a key as
• The key is often discarded input
• Outputs zero or more key/value pairs • The reducer outputs zero or more final
key/value pairs
Shuffle and Sort • Usually just one output per input key
• Output from the mapper is sorted by key • Example: local counting for Word Count:
• All values with the same key are guaranteed to go to the • def combiner(key, values):
same machine • output(key, sum(values)
The Reducer • In MapReduce, intermediate output values are
• Called once for each unique key not usually reduced together
• Gets a list of all values associated with a key as input • All values with the same key are presented to a
• The reducer outputs zero or more final key/value pairs single Reducer together
• Usually just one output per input key • More specifically, a different subset of
intermediate key space is assigned to each
• These subsets are known as partitions
MapReduce Algorithm
• Programmers specify two functions:
• map (k1, v1) → [(k2, v2)]
• reduce (k2, [v2]) → [(k3, v3)]
• All values with the same key are reduced together
• The execution framework handles everything else…
• Not quite…usually, programmers also specify:
• partition (k2, number of partitions) → partition for
• Often a simple hash of the key, e.g., hash(k2) mod n
• Divides up key space for parallel reduce operations
combine (k2, [v2]) → [(k2, v2’)]
• Mini-reducers that run in memory after the map
• Used as an optimization to reduce network traffic
Word Count Example
Word Count Example
MapReduce Program
A MapReduce program consists of the following 3 parts:

• Driver (main- would trigger the map and reduce methods)

• Mapper
• Reducer

It is better to include the map, reduce and main methods in 3

different classes
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
public void map(LongWritable key, Text value, Context context) throws
IOException,InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
context.write(value, new IntWritable(1));

The key is nothing but the offset of each line in the text file: LongWritable
The value is each individual line (as shown in the figure at the right): Text
The key is the tokenized words: Text
We have the hardcoded value in our case which is 1: IntWritable
Example – Dear 1, Bear 1, etc.
Reducer Code:
1public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {

3public void reduce(Text key, Iterable<IntWritable> values,Context context)
4throws IOException,InterruptedException {

6int sum=0;
7for(IntWritable x: values)
11context.write(key, new IntWritable(sum));

Both the input and the output of the Reducer is a key-value pair.
The key nothing but those unique words which have been generated after the sorting and shuffling phase: Text
The value is a list of integers corresponding to each key: IntWritable
Example – Bear, [1, 1], etc.
The key is all the unique words present in the input text file: Text
The value is the number of occurrences of each of the unique words: IntWritable
Example – Bear, 2; Car, 3, etc. 
We have aggregated the values present in each of the list corresponding to each key and produced the final answer.
• In the driver class, we set the
1Configuration conf= new Configuration();
configuration of our MapReduce
2Job job = new Job(conf,"My Word Count Program");
3job.setJarByClass(WordCount.class); job to run in Hadoop.
4job.setMapperClass(Map.class); • specify the name of the job ,
5job.setReducerClass(Reduce.class); • the data type of input/output of
6job.setOutputKeyClass(Text.class); the mapper and reducer.
7  • specify the names of the
8job.setOutputValueClass(IntWritable.class); mapper and reducer classes.
9job.setInputFormatClass(TextInputFormat.class); • The path of the input and
10job.setOutputFormatClass(TextOutputFormat.class); output folder.
11Path outputPath = new Path(args[1]); • The method
12  setInputFormatClass () is used
13//Configuring the input/output path from the filesystem into the job for specifying that how a
14FileInputFormat.addInputPath(job, new Path(args[0]));
Mapper will read the input data
15FileOutputFormat.setOutputPath(job, new Path(args[1]));
or what will be the unit of work.
Here, we have chosen
TextInputFormat so that single
hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input line is read by the mapper at a
/sample/output time from the input text file.
• The main () method is the entry
point for the driver. In this
method, we instantiate a new
Configuration object for the job.
HDFS Architecture
Distributed File System
• Don’t move data to workers… move workers to
the data!
• Store data on the local disks of nodes in the
• Start up the workers on the node that has
the data local
• Why DFS?
• Not enough RAM to hold all the data in
• Disk access is slow, but disk throughput is
DFS - Features
• Single Namespace for entire cluster reasonable
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 128 MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find location of blocks
• Client accesses data directly from DataNode
NameNode – Metadata NameNode – Responsibilities
• Meta-data in Memory • Managing the file system namespace:
• The entire metadata is in main memory • Holds file/directory structure, metadata, file-
• No demand paging of meta-data to-block mapping, access permissions, etc.
• Types of Metadata • Coordinating file operations:
• List of file • Directs clients to datanodes for reads and
• List of Blocks for each file writes
• List of DataNodes for each block • No data is moved through the namenode
• File attributes, e.g creation time, • Maintaining overall health:
replication factor • Periodic communication with the datanodes
• A Transaction Log • Block re-replication and rebalancing
• Records file creations, file deletions. etc • Garbage collection
Datanode Block Placement
• A Block Server
• Stores data in the local file system • Current Strategy
• Stores meta-data of a block • One replica on local node
• Serves data and meta-data to Clients • Second replica on a remote rack
• Block Report • Third replica on same remote rack
• Periodically sends a report of all existing • Additional replicas are randomly placed
blocks to the NameNode • Clients read from nearest replica
• Facilitates Pipelining of Data • Would like to make this policy pluggable
• Forwards data to other specified DataNodes
Data Correctness
• Use Checksums to validate data
• Use CRC32
• File Creation
• Client computes checksum per 512 byte
• DataNode stores the checksum
• File access
• Client retrieves the data and checksum from DataNode
• If Validation fails, Client tries other replicas

NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
• A directory on the local file system
• A directory on a remote file system (NFS/CIFS)
3. Client writes block directly to one Data Node
• Data Nodes replicates block
1. Client consults Name Node • Cycle repeats for next block
2. Name Node replies with the location of the Data 4. Data node replies with acknowledgement
5. Client sends to the Name Node a request to
Node close the file
1. An application client wishing to read a file must first 3. The client then contacts the Data Nodes to
contact the Name Node to determine where the actual retrieve data.
data is stored. • Important features of the design:
2. In response to the client request the Name Node • Data is never moved through the Name Node
returns: • All data transfer occurs directly between clients
• The relevant block ids and the Data Nodes
• The locations where the blocks are held • Communications with the Name Node only
involves transfer of metadata
Introduction to YARN
MapReduce Vs YARN
MapReduce Vs YARN

Limits Scalability
• Maximum cluster size: 4,000 nodes
• Maximum concurrent tasks: 40,000
• Availability – Job Tracker is SPOF (Single Point of Failure)
• Problem with Resource Utilization
• Predefined number of map slots and reduce slots for each TaskTracker
• Underutilization when more map tasks or reducer tasks are running
• Runs only MapReduce applications
Advantages of YARN

• Yarn does efficient utilization of the resource

• Centralized resource management
• Multiple applications in Hadoop, all sharing a common resource
• No more fixed map-reduce slots
• Supports applications that do not follow MapReduce mode
• Apache Spark, Apache Giraph, Tez
• Most JobTracker functions moved to Application Master
• – one cluster can have many Application Masters
Components of YARN
Resource Manager (RM)
• Runs on Master Node
• Global resource scheduler
• Arbitrates system resources between
competing nodes

Node Manager (NM)

• Runs on slave nodes
• Communicates with RM

Container Application Master

• Created by the RM upon request • One per application
• Allocate a certain amount of resources (memory , CPU) • Framework/application specific
on a slave node • Runs in a container
• Applications run in one or more containers • Requests more containers to run application
Fault Tolerance
• Task (Container) – Handled just like MRv1
• MR APPMaster will re-attempt tasks that complete with exceptions or stop responding (4 times by default)
• Applications with too many failed tasks are considered failed

• Application Master
• If application fails or if AM stops sending heartbeats, RM will re-attempt the whole application (2 times by
• MR AppMaster optional setting: Job Recovery
• If false, all tasks will re-run
• If true, MR APPMaster retrieves state of tasks when it restarts; only incomplete tasks will be re-run

• NodeManager
• If NM stops sending heartbeats to RM, it is removed from list of active nodes
• Tasks on the node will be treated as failed by MR AppMaster
• If the AppMaster node fails, it will be treated as a failed application

• Resource Manager

• No application or tasks can be launched if RM is unavailable

• Can be configured with High Availability
MapReduce Program
• Every mapper class must be extended
from MapReduceBase class and it must
implement Mapper interface.
• The main part of Mapper class is
a 'map()' method which accepts four
• At every call to 'map()' method, a key-value pair
('key' and 'value' in this code) is passed.
• 'map()' method begins by splitting input text
which is received as an argument. It uses the
tokenizer to split these lines into words.
• After this, a pair is formed using a record at 7th
index of array 'SingleCountryData' and a
value '1'.

• Next Step to select the 7th index because the Country data is located at 7th index in array 'SingleCountryData'.
• Please note that the input data is in the below format (where Country is at 7th index, with 0 as a starting index)-
• Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Long
• An output of mapper is again a key-value pair which is outputted using 'collect()' method of 'OutputCollector'.
MapReduce Program
• An input to the reduce() method is a key with a list
of multiple values.
• For example, in our case, it will be-
<United Arab Emirates, 1>, <United Arab
Emirates, 1>, <United Arab Emirates, 1>,<United
Arab Emirates, 1>, <United Arab Emirates, 1>,
<United Arab Emirates, 1>.
• This is given to reducer as <United Arab Emirates,
• So, to accept arguments of this form, first two data
types are used,
viz., Text and Iterator<IntWritable>. Text is a data
type of key and Iterator<IntWritable> is a data type
for list of values for that key.
• The next argument is of
type OutputCollector<Text,IntWritable> which
collects the output of reducer phase.
• reduce() method begins by copying key value and initializing frequency count to 0.
• Then, 'while' loop, is used to iterate through the list of values associated with the key and calculate the final
frequency by summing up all the values.
• push the result to the output collector in the form of key and obtained frequency count.
MapReduce Program

