Professional Documents
Culture Documents
- Kumaresan Manickavelu
Problems With Scale
up by still-functioning units
Nodes that fail and restart must be able to rejoin
2
Hadoop Echo System
Apache Hadoop is a collection of open-source software
for reliable, scalable, distributed computing.
Hadoop Common: The common utilities that support the
other Hadoop subprojects.
HDFS: A distributed file system that provides high
throughput access to application data.
MapReduce: A software framework for distributed
processing of large data sets on compute clusters.
Pig: A high-level data-flow language and execution
framework for parallel computation.
HBase: A scalable, distributed database that supports
structured data storage for large tables.
3
HDFS
Based on Google’s GFS
Redundant storage of massive amounts of data on cheap and
unreliable computers
Optimized for huge files that are mostly appended and read
Architecture
HDFS has a master/slave architecture
An HDFS cluster consists of a single NameNode and a number of
DataNodes
HDFS is built using the Java language; any machine that supports
Java can run the NameNode or the DataNode software
The NameNode executes file system namespace operations like
opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write
requests from the file system’s clients.
4
Map Reduce
6
Map Reduce Flow
7
Mapper (indexing
example)
Input is the line no and the actual line.
8
Reducer (indexing
example)
Input is word and the line nos.
Input 1 : (“I”,“100”,”101”)
Input 2 : (“Love”,“100”,”101”)
Input 3 : (“India”, “100”)
Input 4 : (“eBay”, “101”)
9
Google Page Rank
example
Mapper
Input is a link and the html content
page
Reducer
Input is a link and a list of pagranks of pages linking
to this page
Output is the pagerank of this page, which is the
10
Hadoop at Yahoo
11
Hadoop at Amazon
Hadoop can be run on Amazon Elastic Compute Cloud
(EC2) and Amazon Simple Storage Service (S3)
The New York Times used 100 Amazon EC2 instances
and a Hadoop application to process 4TB of raw image
TIFF data (stored in S3) into 11 million finished PDFs in
the space of 24 hours at a computation cost of about
$240
Amazon Elastic MapReduce is a new web service that
enables businesses, researchers, data analysts, and
developers to easily and cost-effectively process vast
amounts of data. It utilizes a hosted Hadoop framework.
12
Thanks
Questions?
13