Bigdata Hadoop PPTs

Introduction To Hadoop/Big Data
University Institute of Engineering (UIE)

Data is Exploding
 We Live In The Data Age.
 Idc Estimate Put The Size Of The “Digital Universe” At 0.18

Zetta Bytes In 2006
 Forecasted A Tenfold Growth By 2020 To 40 Zetta Bytes.
 A Zetta Byte Is 10(21) Bytes, Or Equivalently One Thousand

Exa Bytes, One Million Petabytes, Or One Billion Terabytes.
 That’s Roughly The Same Order Of Magnitude As One Disk

Drive For Every Person In The World.
!!! University
BIG Institute
DATA !!!
of Engineering (UIE)
Today, we’re surrounded by data.
 People upload videos, take pictures on their cell phones, text friends, update their
Facebook status, leave comments around the web, click on ads, and so forth.
 Machines, too, are generating and keeping more and more data.
 The exponential growth of data first presented challenges to cutting-edge businesses

such as Google, Yahoo, Amazon, and Microsoft.

Cont’d
 They needed to go through terabytes and petabytes of data to figure out which
websites were popular, what books were in demand, and what kinds of ads
appealed to people.
 Existing tools were becoming inadequate to process such large data sets.

Data Sources
This flood of data is coming from many sources. Consider the following:
 The New York Stock Exchange generates about one terabyte of new trade data per day.
 Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
 More important, the amount of data generated by machines will be even greater than that
generated by People.
Ex: Machine logs, sensor networks, vehicle GPS traces, retail transactions
Examples:
– Facebook : over 500TB of data per day
– eBay :over more than 500TB of data per day

Traditional Large Scale Systems
 Imagine you have 1GB of data that you need to process.
 The data are stored in a relational database in your desktop computer and this desktop computer
has no problem handling this load.
 Then your company starts growing very quickly, and that data grows to 10GB,And then 100GB.

 And you start to reach the limits of your current desktop computer.
 So you scale-up by investing in a larger computer, and you are then OK for a few more
months.
 When your data grows to 10TB, and then 100TB.
 And you are fast approaching the limits of that computer.

 Moreover, you are now asked to feed your application with unstructured data coming from
sources like Facebook, Twitter, RFID readers, sensors, and so on.
 We need to derive information from both the relational data and the unstructured data, and
wants this information as soon as possible.


 Traditionally , computation is processor bond.
 For decades, The primary push was to increase the computing power of a single machine.
•
Faster processor, more RAM
 A distributed system consists of multiple computers that communicate through

a network The computers interact with each other in order to achieve a common
goal.
 Distributed systems evolved to allow developers to use multiple machines for a single job.

Distributed Systems: Problems
 Programming for traditional distributed systems is complex

Finite bandwidth is available

Data exchange requires synchronization

It is difficult to deal with partial failures of the system
Distributed Systems : Data Storage


Typically, data for a distributed system is stored on a SAN

At compute time, data is copied to the compute nodes

Fine for relatively limited amounts of data

 There is no data locality concept in SAN every time you
need to bring the data through n/w to compute node to
process.

 The traditional data processing model has data stored in 'storage cluster', data is
copied over to 'compute cluster' for processing and the results are written back
to storage cluster.
 This model how ever doesn't quite work for Big Data. Because copying out so
much data out to compute cluster might be too time consuming or impossible. So
what is the answer?
 One solution is to process Big Data 'in place' -- as in storage cluster double as
compute cluster. University Institute of Engineering (UIE)
Hadoop’s History
 Hadoop is based on work done by Google in the late 1990s/early
2000s
– Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
 This work takes a radical new approach to the problem of
distributed computing
– Meets all the requirements we have for reliability and scalability
 Core concept: distribute the data as it is initially stored in the
system
– Individual nodes can work on data local to those nodes
– No data transfer over the network is required for initial processing
What is HADOOP?
Hadoop is a free, Java-based programming framework that supports the processing of large
data sets in a distributed computing environment.
Hadoop Components
 Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce
 There are many other projects based around core Hadoop

– Often referred to as the ‘Hadoop Ecosystem’
– Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
Cont’d
A set of machines running HDFS and MapReduce is known as a

Hadoop Cluster
– Individual machines are known as nodes
– A cluster can have as few as one node, as many as several
thousands
– More nodes = better performance!

Requirements of New Approach
 Partial Failure Support

The system must support partial failure
– Failure of a component should result in a graceful degradation
of application performance
– Not complete failure of the entire system
 Data Recoverability

If a component of the system fails, its workload should be
assumed by still-functioning units in the system
– Failure should not result in the loss of any data

 Component Recovery
 If a component of the system fails and then recovers, it should
be able to rejoin the system
– Without requiring a full restart of the entire system

 Consistency
Component failures during execution of a job should not affect


the outcome of the job

Cont’d
Scalability
Adding load to the system should result in a graceful decline in performance of

individual jobs
– Not failure of the system
Increasing resources should support a proportional increase in


load capacity

 Data locality
• HADOOP Brings data locality concept by effectively

utilizing underlying hardware and network.

Recap
 What is Hadoop?
 What are the core components in Hadoop?
 Advantages of Hadoop?

HADOOP Core Concepts:

Hadoop Distributed File System
(HDFS)

Department of Computer Science and Engineering (CSE)
HDFS:(HADOOP Distributed File System)

• Filesystems that manage the storage across a network of
machines are called distributed filesystems.
• HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on clusters
of nodes with commodity hardware.

• Very large files :“Very large” in this context means files that are
hundreds of megabytes, gigabytes, or terabytes in size. There
are Hadoop clusters running today that store petabytes of data.
• Streaming data access :HDFS is built around the idea that the
most efficient data processing pattern is a write-once, read-
many-times pattern.
• A dataset is typically generated or copied from source, then

various analyses are performed on that dataset over time.
• Each analysis will involve a large proportion, if not all, of the

dataset, so the time to read the whole dataset is more
important than the latency in reading the first record.

 Commodity hardware : Hadoop doesn’t require expensive, highly

reliable hardware to run on.
 It’s designed to run on clusters of commodity hardware (commonly

available hardware available from multiple vendors3) for which the
chance of node failure across the cluster is high, at least for large
clusters.
 HDFS is designed to carry on working without a noticeable

interruption to the user in the face of such failure.

HDFS Basic Concepts
• HDFS is a filesystem written in Java

• Based on Google’s GFS
• Sits on top of a native filesystem

• Such as ext3, ext4 or xfs
• Provides redundant storage for massive amounts of data

• Using cheap, unreliable computers

 Files are split into blocks

– Each block is usually 64MB or 128MB
 Data is distributed across many machines at load time

– Different blocks from the same file will be stored on different machines
– This provides for efficient MapReduce processing (see later)
 Blocks are replicated across multiple machines, known as DataNodes

– Default replication is three-fold
– Meaning that each block exists on three different machines
 A master node called the NameNode keeps track of which blocks make
up a file, and where those blocks are located
– Known as the metadata

Recap
• What is distributed system?

• What is HDFS?
• What is the HDFS block size?
• What is default replication factor in Hadoop?

HDFS Architecture

HDFS Architecture Cont’d

Functions of a NameNode
 Manages File System Namespace
 Maps a file name to a set of blocks
 Maps a block to the Data Nodes where it resides
 Cluster Configuration Management
 Replication Engine for Blocks

NameNode Metadata
 Metadata in Memory
 The entire metadata is in main memory
 No demand paging of metadata
 Types of metadata
 List of files
 List of Blocks for each file
 List of Data Nodes for each block
 File attributes, e.g. creation time, replication factor
 A Transaction Log
 Records file creations, file deletions etc.

Recap
• What is the functionality of NameNode?

• What is metadata?

HDFS - Blocks
 File Blocks
– 64MB (default), 128MB (recommended) – compare to 4KB in UNIX
– Behind the scenes, 1 HDFS block is supported by multiple operating

system (OS) blocks
128 MB HDFS Block
… OS Blocks

Advantages of blocks:
– Fixed size – easy to calculate how many fit on a disk
– A file can be larger than any single disk in the network
– If a file or a chunk of the file is smaller than the block size, only
needed space is used. E.g.: 420MB file is split as:
128MB 128MB 128MB 36MB

DataNode
 A Block Server
 Stores data in the local file system (e.g. ext3)
 Stores metadata of a block (e.g. CRC)
 Serves data and metadata to Clients
 Block Report
 Periodically sends a report of all existing blocks to the Name Node
 Facilitates Pipelining of Data
 Forwards data to other specified Data Nodes

Block Placement
 Current Strategy
 Two replicas stored in same rack
 Third replica stored in other rack
 Additional replicas are randomly placed
 Clients read from nearest replicas
 Would like to make this policy pluggable

Recap
• What is the block size in Hadoop?

• What is the functionality of DataNode?
• How to replicate the data (block) in Hadoop
cluster?

Secondary NameNode

File system image and Edit log
 fsimage - Its the snapshot of the filesystem when namenode started

 Edit logs - Its the sequence of changes made to the filesystem after namenode started

Recap
• What is the functionality of Secondary NameNode?

• What is fsimage and editlog ?

Writing files to HDFS

How data is Replicated

Rack Awareness

Client Read File From HDFS

Hadoop Cluster

Recap
• Why client always contact to namenode at the time of file

writing ?
• Will namenode involves at the time of writing data after getting
metadata ? If not why ?
• What is replication pipelining ?
• Why we need to maintain rackawareness policy ?
• Why client always contact to namenode at the time of file reading
?
• What happened if we don’t have a metadata ?

Hadoop Configuration Files
Filename Format Description
Environment variables that are

hadoop-env.sh Bash script used in the scripts to run Hadoop.
Configuration settings for Hadoop
Core, such as I/O settings that are
core-site.xml Hadoop configuration XML common to HDFS and
MapReduce.
Configuration settings for HDFS
daemons: the namenode, the
hdfs-site.xml Hadoop configuration XML
secondary namenode, and the
datanodes.
Configuration settings for
mapred-site.xml Hadoop configuration XML MapReduce daemons: the
jobtracker, and the tasktrackers.

HDFS Parameters
Property
Type Default value Description
name
The default filesystem. The URI defines the hostname
fs.default.na and port that the namenode’s RPC server runs on. The
URI file:///
me default port is 8020. This property should be set in
core-site.xml.
comma-
$ The list of directories where the namenode stores its
dfs.name.dir separated {hadoop.tmp.dir} persistent metadata. The namenode stores a copy of
directory
/dfs/name the metadata in each directory in the list.
names
comma-
$
separated A list of directories where the datanode stores blocks.
dfs.data.dir directory {hadoop.tmp.dir} Each block is stored in only one of these directories.
/dfs/data
names
comma- $ A list of directories where the secondary namenode
fs.checkpoint separated {hadoop.tmp.dir}
stores checkpoints. It stores a copy of the checkpoint in
.dir directory /dfs/
names namesecondary each directory in the list.

High Availability in Namenode
• Before CDH4, the NameNode was a single point of failure (SPOF) in an HDFS cluster.
• Each cluster had a single NameNode, and if that machine or process became
unavailable, the cluster as a whole would be unavailable until the NameNode was
either restarted or brought up on a separate machine. The Secondary NameNode
did not provide failover capability.
This reduced the total availability of the HDFS cluster in two major ways:
• In the case of an unplanned event such as a machine crash, the cluster would be
unavailable until an operator restarted the NameNode.
• Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in periods of cluster downtime.

HDFS Commands
 Copy file abc.txt from local disk to the user’s directory in HDFS
hadoop fs -copyFromLocal abc.txt abc.txt
– This will copy the file to /user/training/abc.txt

 Get a directory listing of the user’s home directory in HDFS
hadoop fs -ls
 Get a directory listing of the HDFS root directory
hadoop fs –ls /

HDFS Commands (cont’d)
 Display the contents of the HDFS file /user/fred/bar.txt
hadoop fs –cat /user/fred/bar.txt
 Move that file to the local disk, named as baz.txt
hadoop fs –copyToLocal /user/fred/bar.txt baz.txt
 Create a directory called input under the user’s training directory
hadoop fs –mkdir input

HDFS Commands (cont’d)
 Delete the directory input_old and all its contents
hadoop fs –rmr input_old

Recap
What is the functionality of NameNode High Availability?
How to import the data from local filesystem to HDFS?
How to import the data from HDFS to local filesystem?
How to create the directory from HDFS?

What is MapReduce?
 Simplified Massive Parallel Processing of Data on Large Clusters
 Splitting a Big Problem/Data into Little Pieces
 Everything is in the form of Key-Value pairs for flexibility
 MapReduce is a method for distributing a task across multiple nodes
 Each node processes data stored on that node
– Where possible
 Consists of two phases:
– Map
– Reduce

Features of MapReduce
 Automatic parallelization and distribution
 Fault-tolerance
 A clean abstraction for programmers
– MapReduce programs are usually written in Java

– Can be written in any scripting language using Hadoop
– All of Hadoop is written in Java
 MapReduce abstracts all the ‘housekeeping’ away from the developer
– Developer can concentrate simply on writing the Map and Reduce functions
 For maper and reducer the input must be in the form of (key,value) pair
and their outputs also in the (key,value) pair only.

Recap
What is the functionality of MapReduce?

How many phases are there in MapReduce?
What are the features in MapReduce?

MapReduce - Mapperflow

MapReduce- Shuffle and Sort

MapReduce – Reducers to Output

Recap
• What is Input Split?

• What is RecordReader?
• What is Map?
• What is Partitioner?
• What is Shuffle?
• What is Sort?
• What is RecordWriter?
• What is Reducer?
MapReduce: The JobTracker
 MapReduce jobs are controlled by a software daemon known as
the JobTracker
 The JobTracker resides on a ‘master node’

– Clients submit MapReduce jobs to the JobTracker
– The JobTracker assigns Map and Reduce tasks to other nodes

on the cluster
– These nodes each run a software daemon known as the

TaskTracker
– The TaskTracker is responsible for actually instantiating the Map

or Reduce task, and reporting progress back to the JobTracker

MapReduce: The TaskTracker
 A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce
and Shuffle operations - from a JobTracker
 Although there is a single TaskTracker per slave node, each TaskTracker

can spawn multiple JVMs to handle many map or reduce tasks in parallel

MapReduce-Job & Task Tracker
NAMENOD
E
Datanodes
JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data
processing job, the JobTracker partitions the work and assigns different map and reduce tasks to
each TaskTracker in the cluster

MapReduce: The Mapper
 Hadoop attempts to ensure that Mappers run on nodes which
hold their portion of the data locally, to avoid network traffic
– Multiple Mappers run in parallel, each processing a portion of the

input data
 The Mapper reads data in the form of key/value pairs
 It outputs zero or more key/value pairs
map(in_key, in_value) ->

(inter_key, inter_value) list

MapReduce: The Mapper (cont’d)
 The Mapper may use or completely ignore the input key
– For example, a standard pattern is to read a line of a file at a time

– The key is the byte offset into the file at which the line starts
– The value is the contents of the line itself
– Typically the key is considered irrelevant
If the Mapper writes anything out, the output must be in the form of
key/value pairs

Example Mapper: Upper Case Mapper
Turn input into upper case (pseudo-code):
let map(k, v) =
emit(k.toUpper(), v.toUpper())
('foo', 'bar') -> ('FOO', 'BAR')

('foo', 'other') -> ('FOO', 'OTHER')
('baz', 'more data') -> ('BAZ', 'MORE DATA')

Example Mapper: Explode Mapper
Output each input character separately (pseudo-code):
let map(k, v) =
foreach char c in v:
emit (k, c)
('foo', 'bar') -> ('foo', 'b'), ('foo', 'a'),

('foo', 'r')
('baz', 'other') -> ('baz', 'o'), ('baz', 't'),

('baz', 'h'), ('baz', 'e'),
('baz', 'r')

MapReduce: The Reducer
 After the Map phase is over, all the intermediate values for a given intermediate key are combined
together into a list
 This list is given to a Reducer
– There may be a single Reducer, or multiple Reducers

– This is specified as part of the job configuration (see later)
– All values associated with a particular intermediate key are
guaranteed to go to the same Reducer
– The intermediate keys, and their value lists, are passed to the
Reducer in sorted key order
– This step is known as the ‘shuffle and sort’
 The Reducer outputs zero or more final key/value pairs
– These are written to HDFS

– In practice, the Reducer usually emits a single key/value pair for
each input key University Institute of Engineering (UIE)
Example Reducer: Sum Reducer
Add up all the values associated with each intermediate key
(pseudo-code):
let reduce(k, vals) =

sum = 0
foreach int i in vals:
sum += i
emit(k, sum)
(’bar', [9, 3, -17, 44]) -> (’bar', 39)

(’foo', [123, 100, 77]) -> (’foo', 300)

Example Reducer: Identity Reducer
The Identity Reducer is very common (pseudo-code):
let reduce(k, vals) =

foreach v in vals:
emit(k, v)
('foo', [9, 3, -17, 44]) -> ('foo', 9), ('foo', 3),

('foo', -17), ('foo', 44)
('bar', [123, 100, 77]) -> ('bar', 123), ('bar', 100),

('bar', 77)

Recap
 What is the functionality of Jobtracker?

 What is the Functionality of Tasktracker?

Some Standard InputFormats
FileInputFormat
– The base class used for all file-based InputFormats
TextInputFormat
– The default
– Treats each \n-terminated line of a file as a value
– Key is the byte offset within the file of that line
KeyValueTextInputFormat
– Maps \n-terminated lines as ‘key SEP value’
– By default, separator is a tab
SequenceFileInputFormat
– Binary file of (key, value) pairs with some additional metadata
SequenceFileAsTextInputFormat
– Similar, but maps (key.toString(), value.toString())
5/9/13
Keys and Values are Objects
 Keys and values in Hadoop are Objects
 Values are objects which implement Writable
 Keys are objects which implement WritableComparable
 Hadoop defines its own ‘box classes’ for strings, integers and so on
– IntWritable for ints

– LongWritable for longs
– FloatWritable for floats
– DoubleWritable for doubles
– Text for strings
– Etc.
 The Writable interface makes serialization quick and easy for Hadoop
 Any value’s type must implement the Writable interface
5/9/13
What is WritableComparable?
 A WritableComparable is a Writable which is also Comparable
– Two WritableComparables can be compared against each other to determine

their ‘order’
– Keys must be WritableComparables because they are passed to the Reducer in

sorted order
 Note that despite their names, all Hadoop box classes implement both Writable
and WritableComparable
– For example, IntWritable is actually a WritableComparable

Recap
What is TextInputFormat?
What is KeyValueTextInputFormat?
What is Writable?
What is WritableComparable?

How many Maps and Reduces

 We can’t set the number of mappers arbitrally, Because the
number of mappers is depends on number of Input splits
 Each input split is allocated for single mapper
 We can set number of Reducers arbitrally
conf.setNumReduceTasks(number);

Word Count Example

• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster

Word Count Dataflow

Word Count Example

• Jobs are controlled by configuring JobConfs
• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is executed
– conf.set(“mapred.job.name”, “MyApp”);
• Applications can add arbitrary values to the JobConf
– conf.set(“my.string”, “foo”);
– conf.set(“my.integer”, 12);
• JobConf is available to all tasks

The Driver Code: Introduction
 The driver code runs on the client machine
 It configures the job, then submits it to the cluster
Driver Code:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.conf.configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
5/9/13
public class WordCount extends Configured implements Tool
{
public int run(String args[]) throws Exception
{
If(args.length!=2)
{
System.out.println(“No of arguments must be two”);
Return -1;
}
JobConf conf=new JobConf(getConf(),WordCount.class);
conf.setJobName(thih.getClass().getName());
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]);
FileOutputFormat.setOutputPath(conf,new Path(args[1]);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
5/9/13
JobClient.runJob(conf);
Return 1;
}
}
public static void main(String args[]) throws Exception,
{
int exitCode=ToolRunner.run(new WordCount(),args);
System.exit(exitCode);
}
5/9/13
Mapper Code:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordMapper extends MapReduceBase implements

Mapper<LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key,Text value,OutputCollector<Text,IntWritable>
output,Reporter reporter) throws IOException
{
String s=value.toString();
for(String word:s.split(" "))
{
if(word.length()>0)
output.collect(new Text(word),new IntWritable(1));
}
}
}
5/9/13
ReducerCode:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordReducer extends MapReduceBase implements

Reducer<Text,IntWritable,Text,IntWritable>
{
public void reduce(Text key,Iterator<IntWritable>
values,OutputCollector<Text,IntWritable> output,Reporter reporter) throws IOException
{
int sum=0;
while(values.hasNext())
{
IntWritable value=values.next();
sum+=value.get();
}
output.collect(key,new IntWritable(sum));
}
5/9/13
} University Institute of Engineering (UIE)
The Combiner
 Often, Mappers produce large amounts of intermediate data
– That data must be passed to the Reducers
– This can result in a lot of network traffic
 It is often possible to specify a Combiner

– Like a ‘mini-Reducer’
– Runs locally on a single Mapper’s output
– Output from the Combiner is sent to the Reducers
 Combiner and Reducer code are often identical

– Technically, this is possible if the operation performed is
commutative and associative
– In this case, input and output data types for the Combiner/
Reducer must be identical

Specifying a Combiner
 To specify the Combiner class to be used in your MapReduce
code, put the following line in your Driver:
conf.setCombinerClass(YourCombinerClass.class);
 The Combiner uses the same interface as the Reducer
– Takes in a key and a list of values

– Outputs zero or more (key, value) pairs
– The actual method called is the reduce method in the class

Recap
What is Combiner?
How to specify the Combiner?

Partitioners
• Partitioners are application code that define how keys are assigned to
reduces
• Default partitioning spreads keys evenly, but randomly
– Uses key.hashCode() % num_reduces
• Custom partitioning is often required, for example, to produce a total order
in the output
– Should implement Partitioner interface
– Set by calling conf.setPartitionerClass(MyPart.class)
– To get a total order, sample the map output keys and pick values to divide the keys into
roughly equal buckets and use that in your partitioner

What Does The Partitioner Do?
 The Partitioner divides up the keyspace
– Controls which Reducer each intermediate key and its associated

values goes to
 Often, the default behavior is fine
– Default is the HashPartitioner
public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {
public void configure(JobConf job) {}
public int getPartition(K2 key, V2 value,

int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

Creating a Custom Partitioner
 To create a custom partitioner:
1. Create a class for the partitioner

– Should implement the Partitioner interface
2. Create a method in the class called getPartition

– Receives the key, the value, and the number of Reducers
– Should return an int between 0 and one less than the number of
Reducers
– e.g., if it is told there are 10 Reducers, it should return an int
between 0 and 9
3. Specify the custom partitioner in your driver code
conf.setPartitionerClass(MyPartitioner.class);
Shuffle & Sort
 Shuffle
- After the first map tasks have completed, the nodes may still be performing
several more map tasks each. But they also begin exchanging the intermediate
outputs from the map tasks to where they are required by the reducers. This
process of moving map outputs to the reducers is known as shuffling.
 Sort
- Each reduce task is responsible for reducing the values associated with
several intermediate keys. The set of intermediate keys on a single node is
automatically sorted by Hadoop before they are presented to the Reducer

Recap
 What is Partitioner?
 What is the default Partitioner?
 How can we specify the Partitioner?
 What is Shuffle?
 What is Sort?

Bigdata Hadoop PPTs

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bigdata Hadoop PPTs

Uploaded by

Copyright:

Available Formats

Introduction To Hadoop/Big Data

University Institute of Engineering (UIE)

 Idc Estimate Put The Size Of The “Digital Universe” At 0.18

 Forecasted A Tenfold Growth By 2020 To 40 Zetta Bytes.

 A Zetta Byte Is 10(21) Bytes, Or Equivalently One Thousand

 That’s Roughly The Same Order Of Magnitude As One Disk

 The exponential growth of data first presented challenges to cutting-edge businesses

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

– eBay :over more than 500TB of data per day

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

 When your data grows to 10TB, and then 100TB.

 And you are fast approaching the limits of that computer.

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

 A distributed system consists of multiple computers that communicate through

University Institute of Engineering (UIE)

Distributed Systems : Data Storage

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

 There are many other projects based around core Hadoop

A set of machines running HDFS and MapReduce is known as a

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

– Without requiring a full restart of the entire system

Component failures during execution of a job should not affect

the outcome of the job

University Institute of Engineering (UIE)

Increasing resources should support a proportional increase in

University Institute of Engineering (UIE)

• HADOOP Brings data locality concept by effectively

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

HDFS:(HADOOP Distributed File System)

University Institute of Engineering (UIE)

• A dataset is typically generated or copied from source, then

• Each analysis will involve a large proportion, if not all, of the

University Institute of Engineering (UIE)

 Commodity hardware : Hadoop doesn’t require expensive, highly

 It’s designed to run on clusters of commodity hardware (commonly

 HDFS is designed to carry on working without a noticeable

University Institute of Engineering (UIE)

HDFS Basic Concepts

• HDFS is a filesystem written in Java

• Sits on top of a native filesystem

• Provides redundant storage for massive amounts of data

University Institute of Engineering (UIE)

 Files are split into blocks

 Data is distributed across many machines at load time

 Blocks are replicated across multiple machines, known as DataNodes

University Institute of Engineering (UIE)

• What is distributed system?

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

HDFS Architecture Cont’d

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

• What is the functionality of NameNode?