Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 96

Introduction To Hadoop/Big Data

University Institute of Engineering (UIE)


Data is Exploding
 We Live In The Data Age.

 Idc Estimate Put The Size Of The “Digital Universe” At 0.18


Zetta Bytes In 2006

 Forecasted A Tenfold Growth By 2020 To 40 Zetta Bytes.

 A Zetta Byte Is 10(21) Bytes, Or Equivalently One Thousand


Exa Bytes, One Million Petabytes, Or One Billion Terabytes.

 That’s Roughly The Same Order Of Magnitude As One Disk


Drive For Every Person In The World.

!!! University
BIG Institute
DATA !!!
of Engineering (UIE)
Today, we’re surrounded by data.
 People upload videos, take pictures on their cell phones, text friends, update their
Facebook status, leave comments around the web, click on ads, and so forth.

 Machines, too, are generating and keeping more and more data.

 The exponential growth of data first presented challenges to cutting-edge businesses


such as Google, Yahoo, Amazon, and Microsoft.

University Institute of Engineering (UIE)


Cont’d
 They needed to go through terabytes and petabytes of data to figure out which
websites were popular, what books were in demand, and what kinds of ads
appealed to people.

 Existing tools were becoming inadequate to process such large data sets.

University Institute of Engineering (UIE)


Data Sources
This flood of data is coming from many sources. Consider the following:

 The New York Stock Exchange generates about one terabyte of new trade data per day.
 Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
 More important, the amount of data generated by machines will be even greater than that
generated by People.
Ex: Machine logs, sensor networks, vehicle GPS traces, retail transactions

Examples:
– Facebook : over 500TB of data per day

– eBay :over more than 500TB of data per day

University Institute of Engineering (UIE)


Traditional Large Scale Systems
 Imagine you have 1GB of data that you need to process.

 The data are stored in a relational database in your desktop computer and this desktop computer
has no problem handling this load.

 Then your company starts growing very quickly, and that data grows to 10GB,And then 100GB.

University Institute of Engineering (UIE)


Traditional Large Scale Systems
 And you start to reach the limits of your current desktop computer.

 So you scale-up by investing in a larger computer, and you are then OK for a few more
months.

 When your data grows to 10TB, and then 100TB.

 And you are fast approaching the limits of that computer.


University Institute of Engineering (UIE)
Traditional Large Scale Systems

 Moreover, you are now asked to feed your application with unstructured data coming from
sources like Facebook, Twitter, RFID readers, sensors, and so on.

 We need to derive information from both the relational data and the unstructured data, and
wants this information as soon as possible.

University Institute of Engineering (UIE)


Traditional Large Scale Systems

University Institute of Engineering (UIE)


Traditional Large Scale Systems
 Traditionally , computation is processor bond.

 For decades, The primary push was to increase the computing power of a single machine.

Faster processor, more RAM

 A distributed system consists of multiple computers that communicate through


a network The computers interact with each other in order to achieve a common
goal.

 Distributed systems evolved to allow developers to use multiple machines for a single job.

University Institute of Engineering (UIE)


Distributed Systems: Problems
 Programming for traditional distributed systems is complex


Finite bandwidth is available

Data exchange requires synchronization

It is difficult to deal with partial failures of the system

Distributed Systems : Data Storage



Typically, data for a distributed system is stored on a SAN

At compute time, data is copied to the compute nodes

Fine for relatively limited amounts of data

University Institute of Engineering (UIE)


 There is no data locality concept in SAN every time you
need to bring the data through n/w to compute node to
process.

University Institute of Engineering (UIE)


 The traditional data processing model has data stored in 'storage cluster', data is
copied over to 'compute cluster' for processing and the results are written back
to storage cluster.

 This model how ever doesn't quite work for Big Data. Because copying out so
much data out to compute cluster might be too time consuming or impossible. So
what is the answer?

 One solution is to process Big Data 'in place' -- as in storage cluster double as
compute cluster. University Institute of Engineering (UIE)
University Institute of Engineering (UIE)
Hadoop’s History
 Hadoop is based on work done by Google in the late 1990s/early
2000s
– Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
 This work takes a radical new approach to the problem of
distributed computing
– Meets all the requirements we have for reliability and scalability
 Core concept: distribute the data as it is initially stored in the
system
– Individual nodes can work on data local to those nodes
– No data transfer over the network is required for initial processing
University Institute of Engineering (UIE)
What is HADOOP?
Hadoop is a free, Java-based programming framework that supports the processing of large
data sets in a distributed computing environment.

Hadoop Components
 Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce

 There are many other projects based around core Hadoop


– Often referred to as the ‘Hadoop Ecosystem’
– Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
University Institute of Engineering (UIE)
Cont’d

A set of machines running HDFS and MapReduce is known as a


Hadoop Cluster
– Individual machines are known as nodes
– A cluster can have as few as one node, as many as several
thousands
– More nodes = better performance!

University Institute of Engineering (UIE)


Requirements of New Approach
 Partial Failure Support

The system must support partial failure
– Failure of a component should result in a graceful degradation
of application performance
– Not complete failure of the entire system

 Data Recoverability

If a component of the system fails, its workload should be
assumed by still-functioning units in the system
– Failure should not result in the loss of any data

University Institute of Engineering (UIE)


 Component Recovery
 If a component of the system fails and then recovers, it should
be able to rejoin the system

– Without requiring a full restart of the entire system


 Consistency

Component failures during execution of a job should not affect


the outcome of the job

University Institute of Engineering (UIE)


Cont’d
Scalability
Adding load to the system should result in a graceful decline in performance of

individual jobs
– Not failure of the system

Increasing resources should support a proportional increase in


load capacity

University Institute of Engineering (UIE)


 Data locality

• HADOOP Brings data locality concept by effectively


utilizing underlying hardware and network.

University Institute of Engineering (UIE)


Recap

 What is Hadoop?
 What are the core components in Hadoop?
 Advantages of Hadoop?

University Institute of Engineering (UIE)


HADOOP Core Concepts:

University Institute of Engineering (UIE)


Hadoop Distributed File System
(HDFS)

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

HDFS:(HADOOP Distributed File System)


• Filesystems that manage the storage across a network of
machines are called distributed filesystems.
• HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on clusters
of nodes with commodity hardware.

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

• Very large files :“Very large” in this context means files that are
hundreds of megabytes, gigabytes, or terabytes in size. There
are Hadoop clusters running today that store petabytes of data.

• Streaming data access :HDFS is built around the idea that the
most efficient data processing pattern is a write-once, read-
many-times pattern.

• A dataset is typically generated or copied from source, then


various analyses are performed on that dataset over time.

• Each analysis will involve a large proportion, if not all, of the


dataset, so the time to read the whole dataset is more
important than the latency in reading the first record.

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

 Commodity hardware : Hadoop doesn’t require expensive, highly


reliable hardware to run on.

 It’s designed to run on clusters of commodity hardware (commonly


available hardware available from multiple vendors3) for which the
chance of node failure across the cluster is high, at least for large
clusters.

 HDFS is designed to carry on working without a noticeable


interruption to the user in the face of such failure.

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

HDFS Basic Concepts

• HDFS is a filesystem written in Java


• Based on Google’s GFS

• Sits on top of a native filesystem


• Such as ext3, ext4 or xfs

• Provides redundant storage for massive amounts of data


• Using cheap, unreliable computers

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

 Files are split into blocks


– Each block is usually 64MB or 128MB

 Data is distributed across many machines at load time


– Different blocks from the same file will be stored on different machines
– This provides for efficient MapReduce processing (see later)

 Blocks are replicated across multiple machines, known as DataNodes


– Default replication is three-fold
– Meaning that each block exists on three different machines

 A master node called the NameNode keeps track of which blocks make
up a file, and where those blocks are located
– Known as the metadata

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Recap

• What is distributed system?


• What is HDFS?
• What is the HDFS block size?
• What is default replication factor in Hadoop?

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

HDFS Architecture

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

HDFS Architecture Cont’d

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Functions of a NameNode
 Manages File System Namespace
 Maps a file name to a set of blocks
 Maps a block to the Data Nodes where it resides
 Cluster Configuration Management
 Replication Engine for Blocks

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

NameNode Metadata
 Metadata in Memory
 The entire metadata is in main memory
 No demand paging of metadata
 Types of metadata
 List of files
 List of Blocks for each file
 List of Data Nodes for each block
 File attributes, e.g. creation time, replication factor
 A Transaction Log
 Records file creations, file deletions etc.

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Recap

• What is the functionality of NameNode?


• What is metadata?

University Institute of Engineering (UIE)


HDFS - Blocks
 File Blocks

– 64MB (default), 128MB (recommended) – compare to 4KB in UNIX

– Behind the scenes, 1 HDFS block is supported by multiple operating


system (OS) blocks

128 MB HDFS Block

… OS Blocks

University Institute of Engineering (UIE)


Advantages of blocks:

– Fixed size – easy to calculate how many fit on a disk

– A file can be larger than any single disk in the network

– If a file or a chunk of the file is smaller than the block size, only
needed space is used. E.g.: 420MB file is split as:

128MB 128MB 128MB 36MB

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

DataNode
 A Block Server
 Stores data in the local file system (e.g. ext3)
 Stores metadata of a block (e.g. CRC)
 Serves data and metadata to Clients
 Block Report
 Periodically sends a report of all existing blocks to the Name Node
 Facilitates Pipelining of Data
 Forwards data to other specified Data Nodes

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Block Placement
 Current Strategy
 Two replicas stored in same rack
 Third replica stored in other rack
 Additional replicas are randomly placed
 Clients read from nearest replicas
 Would like to make this policy pluggable

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Recap

• What is the block size in Hadoop?


• What is the functionality of DataNode?
• How to replicate the data (block) in Hadoop
cluster?

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Secondary NameNode

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)
File system image and Edit log

 fsimage - Its the snapshot of the filesystem when namenode started


 Edit logs - Its the sequence of changes made to the filesystem after namenode started

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Recap

• What is the functionality of Secondary NameNode?


• What is fsimage and editlog ?

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Writing files to HDFS

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

How data is Replicated

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Rack Awareness

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Client Read File From HDFS

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Hadoop Cluster

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Recap

• Why client always contact to namenode at the time of file


writing ?
• Will namenode involves at the time of writing data after getting
metadata ? If not why ?
• What is replication pipelining ?
• Why we need to maintain rackawareness policy ?
• Why client always contact to namenode at the time of file reading
?
• What happened if we don’t have a metadata ?

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Hadoop Configuration Files

Filename Format Description

Environment variables that are


hadoop-env.sh Bash script used in the scripts to run Hadoop.
Configuration settings for Hadoop
Core, such as I/O settings that are
core-site.xml Hadoop configuration XML common to HDFS and
MapReduce.
Configuration settings for HDFS
daemons: the namenode, the
hdfs-site.xml Hadoop configuration XML
secondary namenode, and the
datanodes.
Configuration settings for
mapred-site.xml Hadoop configuration XML MapReduce daemons: the
jobtracker, and the tasktrackers.

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

HDFS Parameters
Property
Type Default value Description
name
The default filesystem. The URI defines the hostname
fs.default.na and port that the namenode’s RPC server runs on. The
URI file:///
me default port is 8020. This property should be set in
core-site.xml.
comma-
$ The list of directories where the namenode stores its
dfs.name.dir separated {hadoop.tmp.dir} persistent metadata. The namenode stores a copy of
directory
/dfs/name the metadata in each directory in the list.
names
comma-
$
separated A list of directories where the datanode stores blocks.
dfs.data.dir directory {hadoop.tmp.dir} Each block is stored in only one of these directories.
/dfs/data
names
comma- $ A list of directories where the secondary namenode
fs.checkpoint separated {hadoop.tmp.dir}
stores checkpoints. It stores a copy of the checkpoint in
.dir directory /dfs/
names namesecondary each directory in the list.

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

High Availability in Namenode

• Before CDH4, the NameNode was a single point of failure (SPOF) in an HDFS cluster.
• Each cluster had a single NameNode, and if that machine or process became
unavailable, the cluster as a whole would be unavailable until the NameNode was
either restarted or brought up on a separate machine. The Secondary NameNode
did not provide failover capability.
This reduced the total availability of the HDFS cluster in two major ways:
• In the case of an unplanned event such as a machine crash, the cluster would be
unavailable until an operator restarted the NameNode.
• Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in periods of cluster downtime.

University Institute of Engineering (UIE)


HDFS Commands
 Copy file abc.txt from local disk to the user’s directory in HDFS

hadoop fs -copyFromLocal abc.txt abc.txt

– This will copy the file to /user/training/abc.txt


 Get a directory listing of the user’s home directory in HDFS

hadoop fs -ls

 Get a directory listing of the HDFS root directory

hadoop fs –ls /

University Institute of Engineering (UIE)


HDFS Commands (cont’d)
 Display the contents of the HDFS file /user/fred/bar.txt

hadoop fs –cat /user/fred/bar.txt

 Move that file to the local disk, named as baz.txt

hadoop fs –copyToLocal /user/fred/bar.txt baz.txt

 Create a directory called input under the user’s training directory

hadoop fs –mkdir input

University Institute of Engineering (UIE)


HDFS Commands (cont’d)
 Delete the directory input_old and all its contents

hadoop fs –rmr input_old

University Institute of Engineering (UIE)


Recap
What is the functionality of NameNode High Availability?
How to import the data from local filesystem to HDFS?
How to import the data from HDFS to local filesystem?
How to create the directory from HDFS?

University Institute of Engineering (UIE)


University Institute of Engineering (UIE)
What is MapReduce?
 Simplified Massive Parallel Processing of Data on Large Clusters
 Splitting a Big Problem/Data into Little Pieces
 Everything is in the form of Key-Value pairs for flexibility
 MapReduce is a method for distributing a task across multiple nodes
 Each node processes data stored on that node
– Where possible
 Consists of two phases:
– Map
– Reduce

University Institute of Engineering (UIE)


Features of MapReduce
 Automatic parallelization and distribution

 Fault-tolerance

 A clean abstraction for programmers

– MapReduce programs are usually written in Java


– Can be written in any scripting language using Hadoop
– All of Hadoop is written in Java

 MapReduce abstracts all the ‘housekeeping’ away from the developer

– Developer can concentrate simply on writing the Map and Reduce functions

 For maper and reducer the input must be in the form of (key,value) pair
and their outputs also in the (key,value) pair only.

University Institute of Engineering (UIE)


Recap

What is the functionality of MapReduce?


How many phases are there in MapReduce?
What are the features in MapReduce?

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

MapReduce - Mapperflow

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

MapReduce- Shuffle and Sort

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

MapReduce – Reducers to Output

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Recap

• What is Input Split?


• What is RecordReader?
• What is Map?
• What is Partitioner?
• What is Shuffle?
• What is Sort?
• What is RecordWriter?
• What is Reducer?
University Institute of Engineering (UIE)
MapReduce: The JobTracker
 MapReduce jobs are controlled by a software daemon known as
the JobTracker

 The JobTracker resides on a ‘master node’


– Clients submit MapReduce jobs to the JobTracker

– The JobTracker assigns Map and Reduce tasks to other nodes


on the cluster

– These nodes each run a software daemon known as the


TaskTracker

– The TaskTracker is responsible for actually instantiating the Map


or Reduce task, and reporting progress back to the JobTracker

University Institute of Engineering (UIE)


MapReduce: The TaskTracker
 A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce
and Shuffle operations - from a JobTracker

 Although there is a single TaskTracker per slave node, each TaskTracker


can spawn multiple JVMs to handle many map or reduce tasks in parallel

University Institute of Engineering (UIE)


MapReduce-Job & Task Tracker
NAMENOD
E

Datanodes

JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data
processing job, the JobTracker partitions the work and assigns different map and reduce tasks to
each TaskTracker in the cluster

University Institute of Engineering (UIE)


MapReduce: The Mapper
 Hadoop attempts to ensure that Mappers run on nodes which
hold their portion of the data locally, to avoid network traffic

– Multiple Mappers run in parallel, each processing a portion of the


input data

 The Mapper reads data in the form of key/value pairs

 It outputs zero or more key/value pairs

map(in_key, in_value) ->


(inter_key, inter_value) list

University Institute of Engineering (UIE)


MapReduce: The Mapper (cont’d)
 The Mapper may use or completely ignore the input key

– For example, a standard pattern is to read a line of a file at a time


– The key is the byte offset into the file at which the line starts
– The value is the contents of the line itself
– Typically the key is considered irrelevant

If the Mapper writes anything out, the output must be in the form of
key/value pairs

University Institute of Engineering (UIE)


Example Mapper: Upper Case Mapper
Turn input into upper case (pseudo-code):

let map(k, v) =
emit(k.toUpper(), v.toUpper())

('foo', 'bar') -> ('FOO', 'BAR')


('foo', 'other') -> ('FOO', 'OTHER')
('baz', 'more data') -> ('BAZ', 'MORE DATA')

University Institute of Engineering (UIE)


Example Mapper: Explode Mapper
Output each input character separately (pseudo-code):

let map(k, v) =
foreach char c in v:
emit (k, c)

('foo', 'bar') -> ('foo', 'b'), ('foo', 'a'),


('foo', 'r')

('baz', 'other') -> ('baz', 'o'), ('baz', 't'),


('baz', 'h'), ('baz', 'e'),
('baz', 'r')

University Institute of Engineering (UIE)


MapReduce: The Reducer
 After the Map phase is over, all the intermediate values for a given intermediate key are combined
together into a list

 This list is given to a Reducer

– There may be a single Reducer, or multiple Reducers


– This is specified as part of the job configuration (see later)
– All values associated with a particular intermediate key are
guaranteed to go to the same Reducer
– The intermediate keys, and their value lists, are passed to the
Reducer in sorted key order
– This step is known as the ‘shuffle and sort’

 The Reducer outputs zero or more final key/value pairs

– These are written to HDFS


– In practice, the Reducer usually emits a single key/value pair for
each input key University Institute of Engineering (UIE)
Example Reducer: Sum Reducer
Add up all the values associated with each intermediate key
(pseudo-code):

let reduce(k, vals) =


sum = 0
foreach int i in vals:
sum += i
emit(k, sum)

(’bar', [9, 3, -17, 44]) -> (’bar', 39)


(’foo', [123, 100, 77]) -> (’foo', 300)

University Institute of Engineering (UIE)


Example Reducer: Identity Reducer
The Identity Reducer is very common (pseudo-code):

let reduce(k, vals) =


foreach v in vals:
emit(k, v)

('foo', [9, 3, -17, 44]) -> ('foo', 9), ('foo', 3),


('foo', -17), ('foo', 44)

('bar', [123, 100, 77]) -> ('bar', 123), ('bar', 100),


('bar', 77)

University Institute of Engineering (UIE)


Recap

 What is the functionality of Jobtracker?


 What is the Functionality of Tasktracker?

University Institute of Engineering (UIE)


Some Standard InputFormats
FileInputFormat
– The base class used for all file-based InputFormats
TextInputFormat
– The default
– Treats each \n-terminated line of a file as a value
– Key is the byte offset within the file of that line
KeyValueTextInputFormat
– Maps \n-terminated lines as ‘key SEP value’
– By default, separator is a tab
SequenceFileInputFormat
– Binary file of (key, value) pairs with some additional metadata
SequenceFileAsTextInputFormat
– Similar, but maps (key.toString(), value.toString())

5/9/13
University Institute of Engineering (UIE)
Keys and Values are Objects
 Keys and values in Hadoop are Objects

 Values are objects which implement Writable

 Keys are objects which implement WritableComparable

 Hadoop defines its own ‘box classes’ for strings, integers and so on

– IntWritable for ints


– LongWritable for longs
– FloatWritable for floats
– DoubleWritable for doubles
– Text for strings
– Etc.
 The Writable interface makes serialization quick and easy for Hadoop

 Any value’s type must implement the Writable interface

5/9/13
University Institute of Engineering (UIE)
What is WritableComparable?
 A WritableComparable is a Writable which is also Comparable

– Two WritableComparables can be compared against each other to determine


their ‘order’

– Keys must be WritableComparables because they are passed to the Reducer in


sorted order

 Note that despite their names, all Hadoop box classes implement both Writable
and WritableComparable

– For example, IntWritable is actually a WritableComparable

University Institute of Engineering (UIE)


Recap

What is TextInputFormat?
What is KeyValueTextInputFormat?
What is Writable?
What is WritableComparable?

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

How many Maps and Reduces


 We can’t set the number of mappers arbitrally, Because the
number of mappers is depends on number of Input splits
 Each input split is allocated for single mapper
 We can set number of Reducers arbitrally
conf.setNumReduceTasks(number);

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Word Count Example


• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Word Count Dataflow

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Word Count Example


• Jobs are controlled by configuring JobConfs
• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is executed
– conf.set(“mapred.job.name”, “MyApp”);
• Applications can add arbitrary values to the JobConf
– conf.set(“my.string”, “foo”);
– conf.set(“my.integer”, 12);
• JobConf is available to all tasks

University Institute of Engineering (UIE)


The Driver Code: Introduction
 The driver code runs on the client machine
 It configures the job, then submits it to the cluster

Driver Code:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.conf.configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

5/9/13
University Institute of Engineering (UIE)
public class WordCount extends Configured implements Tool
{
public int run(String args[]) throws Exception
{
If(args.length!=2)
{
System.out.println(“No of arguments must be two”);
Return -1;
}
JobConf conf=new JobConf(getConf(),WordCount.class);
conf.setJobName(thih.getClass().getName());

conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);

FileInputFormat.setInputPaths(conf,new Path(args[0]);
FileOutputFormat.setOutputPath(conf,new Path(args[1]);

conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
5/9/13
University Institute of Engineering (UIE)
JobClient.runJob(conf);
Return 1;
}
}
public static void main(String args[]) throws Exception,
{
int exitCode=ToolRunner.run(new WordCount(),args);
System.exit(exitCode);
}

5/9/13
University Institute of Engineering (UIE)
Mapper Code:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordMapper extends MapReduceBase implements


Mapper<LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key,Text value,OutputCollector<Text,IntWritable>
output,Reporter reporter) throws IOException
{
String s=value.toString();
for(String word:s.split(" "))
{
if(word.length()>0)
output.collect(new Text(word),new IntWritable(1));
}
}
}
5/9/13
University Institute of Engineering (UIE)
ReducerCode:
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import java.util.Iterator;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordReducer extends MapReduceBase implements


Reducer<Text,IntWritable,Text,IntWritable>
{
public void reduce(Text key,Iterator<IntWritable>
values,OutputCollector<Text,IntWritable> output,Reporter reporter) throws IOException
{
int sum=0;
while(values.hasNext())
{
IntWritable value=values.next();
sum+=value.get();
}
output.collect(key,new IntWritable(sum));
}
5/9/13
} University Institute of Engineering (UIE)
The Combiner
 Often, Mappers produce large amounts of intermediate data
– That data must be passed to the Reducers
– This can result in a lot of network traffic

 It is often possible to specify a Combiner


– Like a ‘mini-Reducer’
– Runs locally on a single Mapper’s output
– Output from the Combiner is sent to the Reducers

 Combiner and Reducer code are often identical


– Technically, this is possible if the operation performed is
commutative and associative
– In this case, input and output data types for the Combiner/
Reducer must be identical

University Institute of Engineering (UIE)


Specifying a Combiner
 To specify the Combiner class to be used in your MapReduce
code, put the following line in your Driver:

conf.setCombinerClass(YourCombinerClass.class);

 The Combiner uses the same interface as the Reducer

– Takes in a key and a list of values


– Outputs zero or more (key, value) pairs
– The actual method called is the reduce method in the class

University Institute of Engineering (UIE)


Recap

What is Combiner?
How to specify the Combiner?

University Institute of Engineering (UIE)


Department of Computer Science and Engineering (CSE)

Partitioners
• Partitioners are application code that define how keys are assigned to
reduces
• Default partitioning spreads keys evenly, but randomly
– Uses key.hashCode() % num_reduces
• Custom partitioning is often required, for example, to produce a total order
in the output
– Should implement Partitioner interface
– Set by calling conf.setPartitionerClass(MyPart.class)
– To get a total order, sample the map output keys and pick values to divide the keys into
roughly equal buckets and use that in your partitioner

University Institute of Engineering (UIE)


What Does The Partitioner Do?
 The Partitioner divides up the keyspace

– Controls which Reducer each intermediate key and its associated


values goes to
 Often, the default behavior is fine
– Default is the HashPartitioner

public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {

public void configure(JobConf job) {}

public int getPartition(K2 key, V2 value,


int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

University Institute of Engineering (UIE)


Creating a Custom Partitioner
 To create a custom partitioner:

1. Create a class for the partitioner


– Should implement the Partitioner interface

2. Create a method in the class called getPartition


– Receives the key, the value, and the number of Reducers
– Should return an int between 0 and one less than the number of
Reducers
– e.g., if it is told there are 10 Reducers, it should return an int
between 0 and 9
3. Specify the custom partitioner in your driver code

conf.setPartitionerClass(MyPartitioner.class);
University Institute of Engineering (UIE)
Shuffle & Sort
 Shuffle
- After the first map tasks have completed, the nodes may still be performing
several more map tasks each. But they also begin exchanging the intermediate
outputs from the map tasks to where they are required by the reducers. This
process of moving map outputs to the reducers is known as shuffling.

 Sort

- Each reduce task is responsible for reducing the values associated with
several intermediate keys. The set of intermediate keys on a single node is
automatically sorted by Hadoop before they are presented to the Reducer

University Institute of Engineering (UIE)


Recap

 What is Partitioner?
 What is the default Partitioner?
 How can we specify the Partitioner?
 What is Shuffle?
 What is Sort?

University Institute of Engineering (UIE)

You might also like