Professional Documents
Culture Documents
Chapter3 HDFS MapReduce YARN
Chapter3 HDFS MapReduce YARN
If a DataNode fails to send the heartbeat message, then NameNode will
mark it as a dead node.
If the file data present in the DataNode becomes less than the replication
factor, then NameNode replicates the file data to other DataNodes.
HDFS Architecture
The input and output, even the intermediary output in a MapReduce job,
are in the form of <Key, Value> pair.
Map Abstract
Take key/value pair input
Key is a reference to the input value.
Value is the data set on which to operate.
Processing
Function defined by user
Applies to every value in value input
Produces a new list of key/value pairs
Output of Map is called intermediate output
Can be different type from pair
Output is stored in local disk
Reduce Abstract
Takes intermediate key/value pairs as input
Generated by Map
Key/Value pairs are sorted by key
Processing
Function defined by user
Iterator supplies the values for a given key to the reduce
function.
Produce final list of key/value pair
Output of Reduce is called Final output
Can be different from input pair
Output is stored in HDFS
The MapReduce architecture
Reducer
Reducer
class RatingsBreakdown(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]
if __name__ == '__main__':
RatingsBreakdown.run()
MapReduce Example
100 files with daily temperature in two cities.
Each file has 10,000 entries.
For example, one file may have (Toronto 20),
(New York 30),
..
Our goal is to compute the maximum temperature in the two
cities.
Assign the task to 100 Map processors each works on one file.
Each processor outputs a list of key-value pairs, e.g., (Toronto
30), New York (65), …
Now we have 100 lists each with two elements. We give this
list to two reducers – one for Toronto and another for New
York.
The reducer produce the final answer: (Toronto 55), (New
York 65)
Hadoop MapReduce
is highly scalable, fault tolerant, and designed to work in
commodity hardware's.
MapReduce architecture has a master JobTracker and multiple
worker TaskTracker processes in the Nodes.
MapReduce jobs are broken into multistep processes, which are
Mapper, Shuffle, Sort, Reducer, and auxiliary Combiner and
Partitioner.
MapReduce jobs needs a lot of data transfer, for which Hadoop
uses Writable and WritableComparable interfaces.
MapReduce FileFormats has an InputFormat interface,
RecordReader, OutputFormat, and RecordWriter to improve the
processing and efficiency.
Data Locality
Move computation close to data, rather than data
to computation