You are on page 1of 13

Apache Hadoop

- Kumaresan Manickavelu
Problems With Scale

 Failure is the defining difference between


distributed and local programming
 If components fail, their workload must be picked

up by still-functioning units
 Nodes that fail and restart must be able to rejoin

the group activity without a full group restart


 Increased load should cause graceful decline
 Increasing resources should support a proportional
increase in load capacity
 Storing and Sharing data with processing units.

2
Hadoop Echo System
 Apache Hadoop is a collection of open-source software
for reliable, scalable, distributed computing.
 Hadoop Common: The common utilities that support the
other Hadoop subprojects.
 HDFS: A distributed file system that provides high
throughput access to application data.
 MapReduce: A software framework for distributed
processing of large data sets on compute clusters.
 Pig: A high-level data-flow language and execution
framework for parallel computation.
 HBase: A scalable, distributed database that supports
structured data storage for large tables.

3
HDFS
 Based on Google’s GFS
 Redundant storage of massive amounts of data on cheap and
unreliable computers
 Optimized for huge files that are mostly appended and read
 Architecture
 HDFS has a master/slave architecture
 An HDFS cluster consists of a single NameNode and a number of
DataNodes
 HDFS is built using the Java language; any machine that supports
Java can run the NameNode or the DataNode software
 The NameNode executes file system namespace operations like
opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes.
 The DataNodes are responsible for serving read and write
requests from the file system’s clients.

4
Map Reduce

 Provides a clean abstraction for


programmers to write distributed
application.
 Factors out many reliability concerns from
application logic
 A batch data processing system
 Automatic parallelization & distribution
 Fault-tolerance
 Status and monitoring tools
5
Programming Model

 Programmer has to implement interface of


two functions:

– map (in_key, in_value) ->


(out_key, intermediate_value) list

– reduce (out_key, intermediate_value list) ->


out_value list

6
Map Reduce Flow

7
Mapper (indexing
example)
 Input is the line no and the actual line.

 Input 1 : (“100”,“I Love India ”)


 Output 1 : (“I”,“100”), (“Love”,“100”),
(“India”,“100”)

 Input 2 : (“101”,“I Love eBay”)


 Output 2 : (“I”,“101”), (“Love”,“101”),
(“eBay”,“101”)

8
Reducer (indexing
example)
 Input is word and the line nos.

 Input 1 : (“I”,“100”,”101”)
 Input 2 : (“Love”,“100”,”101”)
 Input 3 : (“India”, “100”)
 Input 4 : (“eBay”, “101”)

 Output, the words are stored along with the


line nos.

9
Google Page Rank
example

 Mapper
 Input is a link and the html content

 Output is a list of outgoing link and pagerank of this

page

 Reducer
 Input is a link and a list of pagranks of pages linking

to this page
 Output is the pagerank of this page, which is the

weighted average of all input pageranks

10
Hadoop at Yahoo

 World's largest Hadoop production application.


 The Yahoo! Search Webmap is a Hadoop
application that runs on a more than 10,000
core Linux cluster
 Biggest contributor to Hadoop.
 Converting All its batches to Hadoop.

11
Hadoop at Amazon
 Hadoop can be run on Amazon Elastic Compute Cloud
(EC2) and Amazon Simple Storage Service (S3)
 The New York Times used 100 Amazon EC2 instances
and a Hadoop application to process 4TB of raw image
TIFF data (stored in S3) into 11 million finished PDFs in
the space of 24 hours at a computation cost of about
$240
 Amazon Elastic MapReduce is a new web service that
enables businesses, researchers, data analysts, and
developers to easily and cost-effectively process vast
amounts of data. It utilizes a hosted Hadoop framework.

12
Thanks

Questions?

kumaresan . manickavelu @ gmail.com

13

You might also like