Professional Documents
Culture Documents
Bigdata Hadoop PPTs
Bigdata Hadoop PPTs
!!! University
BIG Institute
DATA !!!
of Engineering (UIE)
Today, we’re surrounded by data.
People upload videos, take pictures on their cell phones, text friends, update their
Facebook status, leave comments around the web, click on ads, and so forth.
Machines, too, are generating and keeping more and more data.
Existing tools were becoming inadequate to process such large data sets.
The New York Stock Exchange generates about one terabyte of new trade data per day.
Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
More important, the amount of data generated by machines will be even greater than that
generated by People.
Ex: Machine logs, sensor networks, vehicle GPS traces, retail transactions
Examples:
– Facebook : over 500TB of data per day
The data are stored in a relational database in your desktop computer and this desktop computer
has no problem handling this load.
Then your company starts growing very quickly, and that data grows to 10GB,And then 100GB.
So you scale-up by investing in a larger computer, and you are then OK for a few more
months.
Moreover, you are now asked to feed your application with unstructured data coming from
sources like Facebook, Twitter, RFID readers, sensors, and so on.
We need to derive information from both the relational data and the unstructured data, and
wants this information as soon as possible.
For decades, The primary push was to increase the computing power of a single machine.
•
Faster processor, more RAM
Distributed systems evolved to allow developers to use multiple machines for a single job.
Finite bandwidth is available
Data exchange requires synchronization
It is difficult to deal with partial failures of the system
This model how ever doesn't quite work for Big Data. Because copying out so
much data out to compute cluster might be too time consuming or impossible. So
what is the answer?
One solution is to process Big Data 'in place' -- as in storage cluster double as
compute cluster. University Institute of Engineering (UIE)
University Institute of Engineering (UIE)
Hadoop’s History
Hadoop is based on work done by Google in the late 1990s/early
2000s
– Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
This work takes a radical new approach to the problem of
distributed computing
– Meets all the requirements we have for reliability and scalability
Core concept: distribute the data as it is initially stored in the
system
– Individual nodes can work on data local to those nodes
– No data transfer over the network is required for initial processing
University Institute of Engineering (UIE)
What is HADOOP?
Hadoop is a free, Java-based programming framework that supports the processing of large
data sets in a distributed computing environment.
Hadoop Components
Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce
Data Recoverability
If a component of the system fails, its workload should be
assumed by still-functioning units in the system
– Failure should not result in the loss of any data
individual jobs
– Not failure of the system
load capacity
What is Hadoop?
What are the core components in Hadoop?
Advantages of Hadoop?
• Very large files :“Very large” in this context means files that are
hundreds of megabytes, gigabytes, or terabytes in size. There
are Hadoop clusters running today that store petabytes of data.
• Streaming data access :HDFS is built around the idea that the
most efficient data processing pattern is a write-once, read-
many-times pattern.
A master node called the NameNode keeps track of which blocks make
up a file, and where those blocks are located
– Known as the metadata
Recap
HDFS Architecture
Functions of a NameNode
Manages File System Namespace
Maps a file name to a set of blocks
Maps a block to the Data Nodes where it resides
Cluster Configuration Management
Replication Engine for Blocks
NameNode Metadata
Metadata in Memory
The entire metadata is in main memory
No demand paging of metadata
Types of metadata
List of files
List of Blocks for each file
List of Data Nodes for each block
File attributes, e.g. creation time, replication factor
A Transaction Log
Records file creations, file deletions etc.
Recap
… OS Blocks
– If a file or a chunk of the file is smaller than the block size, only
needed space is used. E.g.: 420MB file is split as:
DataNode
A Block Server
Stores data in the local file system (e.g. ext3)
Stores metadata of a block (e.g. CRC)
Serves data and metadata to Clients
Block Report
Periodically sends a report of all existing blocks to the Name Node
Facilitates Pipelining of Data
Forwards data to other specified Data Nodes
Block Placement
Current Strategy
Two replicas stored in same rack
Third replica stored in other rack
Additional replicas are randomly placed
Clients read from nearest replicas
Would like to make this policy pluggable
Recap
Secondary NameNode
Recap
Rack Awareness
Hadoop Cluster
Recap
HDFS Parameters
Property
Type Default value Description
name
The default filesystem. The URI defines the hostname
fs.default.na and port that the namenode’s RPC server runs on. The
URI file:///
me default port is 8020. This property should be set in
core-site.xml.
comma-
$ The list of directories where the namenode stores its
dfs.name.dir separated {hadoop.tmp.dir} persistent metadata. The namenode stores a copy of
directory
/dfs/name the metadata in each directory in the list.
names
comma-
$
separated A list of directories where the datanode stores blocks.
dfs.data.dir directory {hadoop.tmp.dir} Each block is stored in only one of these directories.
/dfs/data
names
comma- $ A list of directories where the secondary namenode
fs.checkpoint separated {hadoop.tmp.dir}
stores checkpoints. It stores a copy of the checkpoint in
.dir directory /dfs/
names namesecondary each directory in the list.
• Before CDH4, the NameNode was a single point of failure (SPOF) in an HDFS cluster.
• Each cluster had a single NameNode, and if that machine or process became
unavailable, the cluster as a whole would be unavailable until the NameNode was
either restarted or brought up on a separate machine. The Secondary NameNode
did not provide failover capability.
This reduced the total availability of the HDFS cluster in two major ways:
• In the case of an unplanned event such as a machine crash, the cluster would be
unavailable until an operator restarted the NameNode.
• Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in periods of cluster downtime.
hadoop fs -ls
hadoop fs –ls /
Fault-tolerance
– Developer can concentrate simply on writing the Map and Reduce functions
For maper and reducer the input must be in the form of (key,value) pair
and their outputs also in the (key,value) pair only.
MapReduce - Mapperflow
Recap
Datanodes
JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data
processing job, the JobTracker partitions the work and assigns different map and reduce tasks to
each TaskTracker in the cluster
If the Mapper writes anything out, the output must be in the form of
key/value pairs
let map(k, v) =
emit(k.toUpper(), v.toUpper())
let map(k, v) =
foreach char c in v:
emit (k, c)
5/9/13
University Institute of Engineering (UIE)
Keys and Values are Objects
Keys and values in Hadoop are Objects
Hadoop defines its own ‘box classes’ for strings, integers and so on
5/9/13
University Institute of Engineering (UIE)
What is WritableComparable?
A WritableComparable is a Writable which is also Comparable
Note that despite their names, all Hadoop box classes implement both Writable
and WritableComparable
What is TextInputFormat?
What is KeyValueTextInputFormat?
What is Writable?
What is WritableComparable?
Driver Code:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.conf.configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
5/9/13
University Institute of Engineering (UIE)
public class WordCount extends Configured implements Tool
{
public int run(String args[]) throws Exception
{
If(args.length!=2)
{
System.out.println(“No of arguments must be two”);
Return -1;
}
JobConf conf=new JobConf(getConf(),WordCount.class);
conf.setJobName(thih.getClass().getName());
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]);
FileOutputFormat.setOutputPath(conf,new Path(args[1]);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
5/9/13
University Institute of Engineering (UIE)
JobClient.runJob(conf);
Return 1;
}
}
public static void main(String args[]) throws Exception,
{
int exitCode=ToolRunner.run(new WordCount(),args);
System.exit(exitCode);
}
5/9/13
University Institute of Engineering (UIE)
Mapper Code:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
conf.setCombinerClass(YourCombinerClass.class);
What is Combiner?
How to specify the Combiner?
Partitioners
• Partitioners are application code that define how keys are assigned to
reduces
• Default partitioning spreads keys evenly, but randomly
– Uses key.hashCode() % num_reduces
• Custom partitioning is often required, for example, to produce a total order
in the output
– Should implement Partitioner interface
– Set by calling conf.setPartitionerClass(MyPart.class)
– To get a total order, sample the map output keys and pick values to divide the keys into
roughly equal buckets and use that in your partitioner
conf.setPartitionerClass(MyPartitioner.class);
University Institute of Engineering (UIE)
Shuffle & Sort
Shuffle
- After the first map tasks have completed, the nodes may still be performing
several more map tasks each. But they also begin exchanging the intermediate
outputs from the map tasks to where they are required by the reducers. This
process of moving map outputs to the reducers is known as shuffling.
Sort
- Each reduce task is responsible for reducing the values associated with
several intermediate keys. The set of intermediate keys on a single node is
automatically sorted by Hadoop before they are presented to the Reducer
What is Partitioner?
What is the default Partitioner?
How can we specify the Partitioner?
What is Shuffle?
What is Sort?