Introduction to Hadoop-Mapreduce Platform

Presented by: Monzur Morshed Habibur Rahman


Hadoop is an open source implementation of the MapReduce platform and distributed file system, written in Java. This module explains the basics of how to begin using Hadoop to experiment and learn from the rest of this tutorial. It covers setting up the platform and connecting other tools to use it.

What Hadoop is
Inspired by Google Distributed file system similar to Google File System Parallel programming model similar to Google MapReduce Parallel database similar to Google Bigtable Open source Java project

Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.

Distributed file system (HDFS)

Distributed execution framework (MapReduce)

Query language (Pig) Distributed, column-oriented data store (HBase) Machine learning (Mahout)

Hadoop Distributed File system

Cluster filing system

Designed for huge files (many GBs)

Designed for lots of streaming reads and infrequent writes

Not a POSIX file system: requires client help

What Hadoop isnt

Hadoop is not a classical grid solution

HDFS is not a POSIX file system

HDFS is not designed for low latency access to a huge number of small files

Hadoop MapReduce is not designed for interactive applications

HBase is not a relational database and does not have transactions or SQL support HDFS and HBase are not focused on security, encryption or multi-tenancy

HDFS, MapReduce

Typical Hadoop Cluster

Commodity Hardware

Typically in 2 level architecture Nodes are commodity PCs 30-40 nodes/rack Uplink from rack is 3-4 gigabit Rack-internal is 1 gigabit

HDFS Architecture
Cluster Membership


Secondary NameNode


Cluster Membership

NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log


Data Flow

Web Servers

Scribe Servers

Network Storage

Oracle RAC

Hadoop Cluster


Image Source:

HDFS Hadoop Distributed File System

Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recover from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth User Space, runs on heterogeneous OS

Distributed File System

Data Coherency Write-once-read-many access model Client can only append to existing files Files are broken up into blocks Typically 128 MB block size Each block replicated on multiple Data Nodes Intelligent Client Client can find location of blocks Client accesses data directly from Data Node

MapReduce Paradigm

Simple data-parallel programming model designed for scalability and fault-tolerance Framework for distributed processing of large data sets

Originally designed by Google

Pluggable user code runs in generic framework

Pioneered by Google -Processes 20 petabytes of data per day

What is MapReduce used for?

At Google: - Index construction for Google Search - Article clustering for Google News - Statistical machine translation At Yahoo!: - Web map powering Yahoo! Search - Spam detection for Yahoo! Mail At Facebook: - Data mining - Ad optimization - Spam detection

What is MapReduce used for?

In research: Astronomical image analysis (Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington)

Mapreduce processing model

How the final multi-node cluster will look like

Who uses Hadoop?

Amazon/A9 Facebook Google IBM Joost New York Times PowerSet Veoh Yahoo!

MapReduce Programming Model

Data type: key-value records Map function:

(Kin, Vin) -> list(Kinter, Vinter)

Reduce function: (Kinter, list(Vinter)) -> list(Kout, Vout)

Example: Word Count

def mapper(line): foreachword in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))

MapReduce Execution Details

Single master controls job execution on multiple slaves Mappers preferentially placed on same node or same rack as their input block - Minimizes network usage Mappers save outputs to local disk before serving them to reducers - Allows recovery if a reducer crashes - Allows having more reducers than nodes

Fault Tolerance in MapReduce

1. If a task crashes: Retry on another node OK for a map because it has no dependencies OK for reduce because map outputs are on

If the same task fails repeatedly, fail the job or ignore that input block (usercontrolled)

Fault Tolerance in MapReduce

2. If a node crashes: Re-launch its current tasks on other nodes Re-run any maps the node previously ran Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3. If a task is going slowly (straggler):

Launch second copy of task on another node (speculative execution)

the other

Take the output of whichever copy finishes first, and kill

Surprisingly important in large clusters

- Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc - Single straggler may noticeably slow down a job

By providing a data-parallel programming model, MapReduce can control job execution in useful ways:

Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers

Some practical MapReduce examples


Input: (lineNumber, line) records Output: lines matching a given pattern

Map: if(line matches pattern): output(line)

Reduce: identify function Alternative: no reducer (map-only job)

Some practical MapReduce examples

2. Sort

Input: (key, value) records Output: same records, sorted by key Map: identity function Reduce: identify function
Trick: Pick partitioning function h such that k1<k2=> h(k1)<h(k2)

Some practical MapReduce examples

3. Inverted Index
Input: (filename, text) records Output: list of files containing each word

for each word in text.split(): output(word, filename)

Combine: uniquely file names for each word Reduce: def reduce(word, filenames): output(word, sort(filenames))

Inverted Index Example

Some practical MapReduce examples

4. Most Popular Words
Input: (filename, text) records Output: top 100 words occurring in the most files Two-stage solution:

Job 1: - Create inverted index, giving (word, list(file)) records Job 2: - Map each (word, list(file)) to (count, word) - Sort these records by count as in sort job

MapReduce in Hadoop

Three ways to write jobs in Hadoop: - Java API - Hadoop Streaming (for Python, Perl, etc) - Pipes API (C++)

MapReduce architecture

Scope of Mapreduce

Hadoop-Mapreduce Tutorial

We introduced MapReduce programming model for processing large scale data We discussed the supporting Hadoop Distributed File System The concepts were illustrated using a simple example We reviewed some important parts of the source code for the example.

Gfarm file system for POSIX & MPI-IO support

HDFS is not a POSIX file system but using Gfarm file System instead of HDFS. The Gfarm file system has advantage since it supports not only MapReduce applications but also POSIX and MPI-IO applications. Ref Article:
a) Hadoop MapReduce on Gfarm File System

b) Using the Gfarm File System as a POSIX compatible storage platform for Hadoop MapReduce applications

Thank You

