Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

MapReduce

Agenda MapReduce:
1. Data Flow
2. Map
3. Shuffle
4. Sort
5. Reduce,
6. Hadoop Streaming,
7. mrjob,
8. Installation
9. wordcount in mrjob
10. Executing mrjob
What is MapReduce?
History :
MapReduce was developed in the walls of Google back in 2004
by Jeffery Dean and Sanjay Ghemawat of Google (Dean &
Ghemawat, 2004). In their paper, “MAPREDUCE: SIMPLIFIED
DATA PROCESSING ON LARGE CLUSTERS,” and was inspired by
the map and reduce functions commonly used in functional
programming.
What is MapReduce?
Hadoop MapReduce is the data processing layer. It processes the huge
amount of structured and unstructured data stored in HDFS. MapReduce
processes data in parallel by dividing the job into the set of independent tasks.
So, parallel processing improves speed and reliability.
Hadoop MapReduce data processing takes place in 2 phases- Map and Reduce
phase.
•Map phase- It is the first phase of data processing. In this phase, we specify all
the complex logic/business rules/costly code.
•Reduce phase- It is the second phase of processing. In this phase, we specify
light-weight processing like aggregation/summation.
MapReduce programming offers several benefits to help you gain
valuable insights from your big data:

•Scalability. Businesses can process petabytes of data stored in the


Hadoop Distributed File System (HDFS).
•Flexibility. Hadoop enables easier access to multiple sources of data
and multiple types of data.
•Speed. With parallel processing and minimal data movement, Hadoop
offers fast processing of massive amounts of data.
•Simple. Developers can write code in a choice of languages, including
Java, C++ and Python.
How does MapReduce work?
The MapReduce operations are:

•Map: The first phase of a MapReduce application is the map phase.


Within the map phase, a function (called the mapper) processes a
series of key-value pairs. The mapper sequentially processes each
key-value pair individually, producing zero or more output keyvalue
pairs.
•As an example, consider a mapper whose purpose is to transform
sentences into words. The input to this mapper would be strings that
contain sentences, and the mapper’s function would be to split the
sentences into words and output the words
How does MapReduce work?
•Shuffle:
•This phase consumes the output of Mapping phase. Its task is to
consolidate the relevant records from Mapping phase output. In our
example, the same words are clubed together along with their
respective frequency.
•In this phase, output values from the Shuffling phase are aggregated.
This phase combines values from Shuffling phase and returns a single
output value. In short, this phase summarizes the complete dataset.
•In our example, this phase aggregates the values from Shuffling phase
i.e., calculates total occurrences of each word.
How does MapReduce work?
Data Flow In MapReduce
MapReduce is used to compute the huge amount of data . To handle the upcoming data in a parallel and
distributed form, the data has to flow from various phases.

Phases of MapReduce data flow


Input reader
The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128
MB). Each data block is associated with a Map function.
Once input reads the data, it generates the corresponding key-value pairs. The input files reside in HDFS.
Data Flow In MapReduce
Phases of MapReduce data flow (Continue….)
Map function
The map function process the upcoming key-value pairs and generated the corresponding output key-
value pairs. The map input and output type may be different from each other.
Partition function
The partition function assigns the output of each Map function to the appropriate reducer. The available
key and value provide this function. It returns the index of reducers.
Shuffling and Sorting
The data are shuffled between/within nodes so that it moves out from the map and get ready to process
for reduce function. Sometimes, the shuffling of data can take much computation time.
The sorting operation is performed on input data for Reduce function. Here, the data is compared using
comparison function and arranged in a sorted form.
Data Flow In MapReduce

Phases of MapReduce data flow (Continue….)


Reduce function
The Reduce function is assigned to each unique key. These keys are already arranged in sorted order. The
values associated with the keys can iterate the Reduce and generates the corresponding output.
Output writer
Once the data flow from all the above phases, Output writer executes. The role of Output writer is to
write the Reduce output to the stable storage.

You might also like