The Hadoop Approach

HADOOP
Introduction
Hadoop is an open-source software project (written in Java) designed to let
developers write and run applications that process huge amounts of data. While it could potentially
improve a wide range of other software, the ecosystem supporting its implementation is still
developing.
HDFS, the Hadoop Distributed File System, is a distributed file system designed to
hold very large amounts of data (terabytes or even petabytes), and provide high-throughput
access to this information. Files are stored in a redundant fashion across multiple machines to
ensure their durability to failure and high availability to very parallel applications.
Hadoop is a large-scale distributed batch processing infrastructure. While it can be

used on a single machine, its true power lies in its ability to scale to hundreds or thousands of
computers, each with several processor cores. Hadoop is also designed to efficiently distribute
large amounts of work across a set of machines.
How large an amount of work? Orders of magnitude larger than many existing
systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale.
Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to
terabytes or petabytes. At this scale, it is likely that the input data set will not even fit on a single
computer's hard drive, much less in memory. So Hadoop includes a distributed file system which
breaks up input data and sends fractions of the original data to several machines in your cluster to
hold. This results in the problem being processed in parallel using all of the machines in the
cluster and computes output results as efficiently as possible.
The Hadoop Approach
Hadoop is designed to efficiently process large volumes of information by connecting many

commodity computers together to work in parallel. The theoretical 1000-CPU machine described
earlier would cost a very large amount of money, far more than 1,000 single-CPU or 250 quad-
core machines. Hadoop will tie these smaller and more reasonably priced machines together into
a single cost-effective compute cluster.
Hadoop History
• Dec 2004 – Google paper published

• July 2005 – Nutch uses new MapReduce implementation
• Jan 2006 – Doug Cutting joins Yahoo!
• Feb 2006 – Hadoop becomes a new Lucene subproject
• Apr 2007 – Yahoo! running Hadoop on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Feb 2008 – Yahoo! production search index with Hadoop
• July 2008 – First reports of a 4000-node cluster
Versions of Hadoop:
The versioning strategy used is major.minor.revision. Increments to the major version number
represent large differences in operation or interface and possibly significant incompatible changes.
At the time of this writing (September 2008), there have been no major upgrades; all Hadoop
versions have their major version set to 0. The minor version represents a large set of feature
improvements and enhancements. Hadoop instances with different minor versions may use
different versions of the HDFS file formats and protocols, requiring a DFS upgrade to migrate from
one to the next. Revisions are used to provide bug fixes. Within a minor version, the most recent
revision contains the most stable patches.
Versions of Hadoop Date of Release Status
Release 0.18.3 (Latest) 29 January, 2009 available

Release 0.19.0 21 November, 2008 available
Release 0.18.1 17 September, 2008 available
Release 0.18.0 22 August, 2008 available
Release 0.17.2 19 August, 2008 available
Release 0.17.1 23 June, 2008 available
Release 0.17.0 20 May, 2008 available
Release 0.16.4 5 May, 2008 available
Release 0.16.3 16 April, 2008 available
Release 0.16.2 2 April, 2008 available
Release 0.16.1 13 March, 2008 available
Release 0.16.0 7 February, 2008 available
Release 0.15.3 18 January, 2008 available
Release 0.15.2 2 January, 2008 available
Release 0.15.0 29 October 2007 available
Release 0.14.3 19 October, 2007 available
Release 0.14.1 4 September, 2007 available
what is Distributed File System ?
A distributed file system is designed to hold a large amount of data and provide access to this data
to many clients distributed across a network. There are a number of distributed file systems that
solve this problem in different ways.
NFS, the Network File System, is the most ubiquitous* distributed file system. It is
one of the oldest still in use. While its design is straightforward, it is also very constrained. NFS
provides remote access to a single logical volume stored on a single machine. An NFS server
makes a portion of its local file system visible to external clients. The clients can then mount this
remote file system directly into their own Linux file system, and interact with it as though it were
part of the local drive. One of the primary advantages of this model is its transparency. Clients do
not need to be particularly aware that they are working on files stored remotely.
But as a distributed file system, it is limited in its power. The files in an NFS volume
all reside on a single machine. This means that it will only store as much information as can be
stored in one machine, and does not provide any reliability guarantees if that machine goes down
(e.g., by replicating the files to other servers). Finally, as all the data is stored on a single machine,
all the clients must go to this machine to retrieve their data.
Hadoop Distributed File System(HDFS)
HDFS is designed to be robust to a number of the problems that other DFS's such
as NFS are vulnerable to. In particular:HDFS is designed to store a very large amount of
information (terabytes or petabytes). This requires spreading the data across a large number of
machines. It also supports much larger file sizes than NFS. HDFS should store data reliably. If
individual machines in the cluster malfunction, data should still be available. HDFS should provide
fast, scalable access to this information. It should be possible to serve a larger number of clients
by simply adding more machines to the cluster. HDFS should integrate well with Hadoop
MapReduce, allowing data to be read and computed upon locally when possible. But while HDFS
is very scalable, its high performance design also restricts it to a particular class of applications; it
is not as general-purpose as NFS. There are a large number of additional decisions and trade-offs
that were made with HDFS.
In particular Applications that use HDFS are assumed to perform long sequential
streaming reads from files. HDFS is optimized to provide streaming read performance; this comes
at the expense of random seek times to arbitrary positions in files. Data will be written to the
HDFS once and then read several times; updates to files after they have already been closed are
not supported. (An extension to Hadoop will provide support for appending new data to the ends
of files; it is scheduled to be included in Hadoop 0.19 but is not available yet.) Due to the large
size of files, and the sequential nature of reads, the system does not provide a mechanism for
local caching of data. The overhead of caching is great enough that data should simply be re-read
from HDFS source. Individual machines are assumed to fail on a frequent basis, both permanently
and intermittently. The cluster must be able to withstand the complete failure of several machines,
possibly many happening at the same time (e.g., if a rack fails all together). While performance
may degrade proportional to the number of machines lost, the system as a whole should not
become overly slow, nor should information be lost. Data replication strategies combat this
problem.
HDFS Architecture
The design of HDFS is based on the design of GFS, the Google File System. HDFS
is a block-structured file system: individual files are broken into blocks of a fixed size. These
blocks are stored across a cluster of one or more machines with data storage capacity. Individual
machines in the cluster are referred to as DataNodes. A file can be made of several blocks, and
they are not necessarily stored on the same machine; the target machines which hold each block
are chosen randomly on a block-by-block basis. Thus access to a file may require the cooperation
of multiple machines, but supports file sizes far larger than a single-machine DFS; individual files
can require more space than a single hard drive could hold. If several machines must be involved
in the serving of a file, then a file could be rendered unavailable by the loss of any one of those
machines. HDFS combats this problem by replicating each block across a number of machines
NameNode and DataNodes
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates access to files by clients. In
addition, there are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on. HDFS exposes a file system namespace and
allows user data to be stored in files. Internally, a file is split into one or more blocks and these
blocks are stored in a set of DataNodes. The NameNode executes file system namespace
operations like opening, closing, and renaming files and directories. It also determines the
mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write
requests from the file system’s clients. The DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode.
The NameNode and DataNode are pieces of software designed to run on

commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is
built using the Java language; any machine that supports Java can run the NameNode or the
DataNode software. Usage of the highly portable Java language means that HDFS can be
deployed on a wide range of machines. A typical deployment has a dedicated machine that runs
only the NameNode software. Each of the other machines in the cluster runs one instance of the
DataNode software. The architecture does not preclude running multiple DataNodes on the same
machine but in a real deployment that is rarely the case.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system.
The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in
such a way that user data never flows through the NameNode.
The File System Namespace
HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar
to most other existing file systems; one can create and remove files, move a file from one
directory to another, or rename a file. HDFS does not yet implement user quotas or access
permissions. HDFS does not support hard links or soft links. However, the HDFS architecture
does not preclude implementing these features.
The NameNode maintains the file system namespace. Any change to the
file system namespace or its properties is recorded by the NameNode. An application can specify
the number of replicas of a file that should be maintained by HDFS. The number of copies of a file
is called the replication factor of that file. This information is stored by the NameNode.
Data Replication
HDFS is designed to reliably store very large files across machines in a large
cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the
same size. The blocks of a file are replicated for fault tolerance. The block size and replication
factor are configurable per file. An application can specify the number of replicas of a file. The
replication factor can be specified at file creation time and can be changed later. Files in HDFS are
write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically

receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a
Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all
blocks on a DataNode.
Most block-structured file systems use a block size on the order of 4 or 8 KB. By
contrast, the default block size in HDFS is 64MB -- orders of magnitude larger. HDFS expects to
store a modest number of very large files: hundreds of megabytes, or gigabytes each. After all, a
100 MB file is not even two full blocks. Files on your computer may also frequently be accessed
"randomly," with applications cherry-picking small amounts of information from several different
locations in a file which are not sequentially laid out. By contrast, HDFS expects to read a block
start-to-finish for a program. This makes it particularly useful to the MapReduce style of
programming. Because HDFS stores files as a set of large blocks across several machines, these
files are not part of the ordinary file system, but it will not include any of the files stored inside the
HDFS. This is because HDFS runs in a separate namespace, isolated from the contents of your
local files. The files inside HDFS (or more accurately: the blocks that make them up) are stored in
a particular directory managed by the DataNode service, but the files will named only with block
ids.
HDFS does come with its own utilities for file management, which act very similar to
these familiar tools. It is important for this file system to store its metadata reliably. Furthermore,
while the file data is accessed in a write once and read many model, the metadata structures
(e.g., the names of files and directories) can be modified by a large number of clients concurrently.
It is important that this information is never desynchronized. Therefore, it is all handled by a single
machine, called the NameNode. The NameNode stores all the metadata for the file system.
Because of the relatively low amount of metadata per file (it only tracks file names, permissions,
and the locations of each block of each file), all of this information can be stored in the main
memory of the NameNode machine, allowing fast access to the metadata. Of course, NameNode
information must be preserved even if the NameNode machine fails; there are multiple redundant
systems that allow the NameNode to preserve the file system's metadata even if the NameNode
itself crashes irrecoverably. NameNode failure is more severe for the cluster than DataNode
failure. While individual DataNodes may crash and the entire cluster will continue to operate, the
loss of the NameNode will render the cluster inaccessible until it is manually restored.
Hadoop is an open source implementation of the MapReduce platform and

distributed file system, written in Java. Developing for Hadoop requires a Java programming
environment. Hadoop requires the Java Standard Edition (Java SE), version 6, which is the most
current version at the time of this writing. Users of Linux, Mac OSX, or other Unix-like
environments are able to install Hadoop and run it on one (or more) machines with no additional
software beyond Java.
The best hardware scales for Hadoop?
The short answer is dual processor/dual core machines with 4-8GB of RAM using ECC (Error-
Correcting Code memory - a type of memory that includes special circuitry for testing the
accuracy of data as it passes in and out of memory. ) memory. Machines should be moderately
high-end commodity machines to be most cost-effective and typically cost 1/2 - 2/3 the cost of
normal production application servers but are not desktop-class machines. This cost tends to be
$2-5K. For a more detailed discussion, see MachineScaling page.
Hadoop - Typical Node configuration
• 2P 2C CPU's
• 4-8GB; ECC preferred, though more expensive
• 2 x 250GB SATA drives
• Cost is about $2-5K
• 1-5 TB external storage
Running Hadoop
Running Hadoop on top of Windows requires installing cygwin, a Linux-like

environment that runs within Windows. Hadoop works reasonably well on cygwin, but it is officially
for "development purposes only." Hadoop on cygwin may be unstable, and installing cygwin itself
can be cumbersome.
To aid developers in getting started easily with Hadoop, It provided a virtual

machine image containing a preconfigured Hadoop installation. The virtual machine image will run
inside of a "sandbox" environment in which we can run another operating system. The OS inside
the sandbox does not know that there is another operating environment outside of it; it acts as
though it is on its own computer. This sandbox environment is referred to as the "guest machine"
running a "guest operating system."
The actual physical machine running the VM software is referred to as the "host
machine" and it runs the "host operating system." The virtual machine provides other host-
machine applications with the appearance that another physical computer is available on the
same network. Applications running on the host machine see the VM as a separate machine with
its own IP address, and can interact with the programs inside the VM in this fashion.
Application developers do not need to use the virtual machine to run Hadoop.
Developers on Linux typically use Hadoop in their native development environment, and Windows
users often install cygwin for Hadoop development. The virtual machine allows users a convenient
alternative development platform with a minimum of configuration required. Another advantage of
the virtual machine is its easy reset functionality. If your experiments break the Hadoop
configuration or render the operating system unusable, you can always simply copy the virtual
machine image from the CD back to where you installed it on your computer, and start from a
known-good state.
Configuring HDFS
The HDFS for your cluster can be configured in a very short amount of time.
First we will fill out the relevant sections of the Hadoop configuration file, then format the
NameNode.
Cluster configuration
The HDFS configuration is located in a set of XML files in the Hadoop configuration
directory; conf/ under the main Hadoop install directory (where you unzipped Hadoop to). The
conf/hadoop-defaults.xml file contains default values for every parameter in Hadoop. This
file is considered read-only. You override this configuration by setting new values in
conf/hadoop-site.xml. This file should be replicated consistently across all machines in the
cluster. (It is also possible, though not advisable, to host it on NFS.)
The following settings are necessary to configure HDFS:
These settings are described individually below:

fs.default.name - This is the URI (protocol specifier, hostname, and port) that describes the
NameNode for the cluster. Each node in the system on which Hadoop is expected to operate
needs to know the address of the NameNode. The DataNode instances will register with this
NameNode, and make their data available through it. Individual client programs will connect to this
address to retrieve the locations of actual file blocks.
dfs.data.dir - This is the path on the local file system in which the DataNode instance should store
its data. It is not necessary that all DataNode instances store their data under the same local path
prefix, as they will all be on separate machines; it is acceptable that these machines are
heterogeneous. However, it will simplify configuration if this directory is standardized throughout
the system. By default, Hadoop will place this under /tmp. This is fine for testing purposes, but is
an easy way to lose actual data in a production system, and thus must be overridden.
dfs.name.dir - This is the path on the local file system of the NameNode instance where the
NameNode metadata is stored. It is only used by the NameNode instance to find its information,
and does not exist on the DataNodes. The caveat above about /tmp applies to this as well; this
setting must be overridden in a production system.
Another configuration parameter, not listed above, is dfs.replication. This is
the default replication factor for each block of data in the file system. For a production cluster, this
should usually be left at its default value of 3. (You are free to increase your replication factor,
though this may be unnecessary and use more space than is required. Fewer than three replicas
impact the high availability of information, and possibly the reliability of its storage.)
Starting HDFS
Now we must format the file system that we just configured:
user@namenode:hadoop$ bin/hadoop namenode -format
This process should only be performed once. When it is complete, we are free to start the
distributed file system:
user@namenode:hadoop$ bin/start-dfs.sh
This command will start the NameNode server on the master machine (which is where the start-
dfs.sh script was invoked). It will also start the DataNode instances on each of the slave
machines. In a single-machine "cluster," this is the same machine as the NameNode instance. On
a real cluster of two or more machines, this script will ssh into each slave machine and start a
DataNode instance.
Interacting With HDFS
The VMware image will expose a single-node HDFS instance for your use in MapReduce
applications. If you are logged in to the virtual machine, you can interact with HDFS using the
command-line tools. You can also manipulate HDFS through the MapReduce plugin.
● Using the Command Line
The bulk of commands that communicate with the cluster are performed by a
monolithic script named bin/hadoop. This will load the Hadoop system with the
Java virtual machine and execute a user command. The commands are specified
in the following form:
user@machine:hadoop$ bin/hadoop moduleName -cmd args...
The moduleName tells the program which subset of Hadoop functionality to use.
-cmd is the name of a specific command within this module to execute. Its
arguments follow the command name.Two such modules are relevant to HDFS: dfs
and dfsadmin.
● Using the MapReduce Plugin For Eclipse
An easier way to manipulate files in HDFS may be through the Eclipse plugin. In
the DFS location viewer, right-click on any folder to see a list of actions available.
You can create new subdirectories, upload individual files or whole subdirectories,
or download files and directories to the local disk.
FS Shell
HDFS allows user data to be organized in the form of files and directories. It provides a
commandline interface called FS shell that lets a user interact with the data in HDFS. The syntax
of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with.
Here are some sample action/command pairs:
Action Command
Create a directory named /foodir bin/hadoop dfs -mkdir /foodir
Remove a directory named /foodir bin/hadoop dfs -rmr /foodir
View the contents of a file named /foodir/myfile.txt bin/hadoop dfs -cat /foodir/myfile.txt
FS shell is targeted for applications that need a scripting language to interact with the stored data.
DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster. These are commands
that are used only by an HDFS administrator. Here are some sample action/command pairs:
Action Command
Put the cluster in Safemode bin/hadoop dfsadmin -safemode enter
Generate a list of DataNodes bin/hadoop dfsadmin -report
bin/hadoop dfsadmin -refreshNodes
Recommission or decommission DataNode(s)
Hadoop Dfs Read Write Example
Simple Example to Read and Write files from Hadoop DFS.
Reading from and writing to Hadoop DFS is no different from how it is done with other file
systems. The example HadoopDFSFileReadWrite.java reads a file from HDFS and writes it to
another file on HDFS (copy command).
Hadoop FileSystem API describes the methods available to user. Let us walk through the code to
understand how it is done.
Create a FileSystem instance by passing a new Configuration object. Please note that the
following example code assumes that the Configuration object will automatically load the hadoop-
default.xml and hadoop-site.xml configuration files. You may need to explicitly add these resource
paths if you are not running inside of the Hadoop runtime environment.
Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);
Given an input/output file name as string, we construct inFile/outFile Path objects. Most of the
FileSystem APIs accepts Path objects.
Path inFile = new Path(argv[0]);

Path outFile = new Path(argv[1]);
Validate the input/output paths before reading/writing.
if (!fs.exists(inFile))
printAndExit("Input file not found");
if (!fs.isFile(inFile))
printAndExit("Input should be a file");
if (fs.exists(outFile))
printAndExit("Output already exists");
Open inFile for reading.
FSDataInputStream in = fs.open(inFile);
Open outFile for writing.
FSDataOutputStream out = fs.create(outFile);
Read from input stream and write to output stream until EOF.
while ((bytesRead = in.read(buffer)) > 0) {

out.write(buffer, 0, bytesRead);
}
Close the streams when done.
in.close();
out.close();
Browser Interface
A typical HDFS install configures a web server to expose the HDFS namespace through a
configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents
of its files using a web browser.
* Job - In Hadoop, the combination of all of the JAR files and classes needed to run a map/reduce
program is called a job. All of these components are themselves collected into a JAR which is
usually referred to as a job file. To execute a job, you submit it to a JobTracker. On the command
line, this is done with the command:
hadoop jar your-job-file-goes-here.jar
This assumes that your job file has a main class that is defined as if it were executable from the
command line and that this main class defines a JobConf data structure that is used to carry all of
the configuration information about your program around. The wordcount example shows how a
typical map/reduce program is written. Be warned, however, that the wordcount program is not
usually run directly, but instead there is a single example driver program that provides a main
method that then calls the wordcount main method itself. This added complexity decreases the
number of jars involved in the example structure, but doesn't really serve any other purpose.
• Task - Whereas a job describes all of the inputs, outputs, classes and libraries used in a
map/reduce program, a task is the program that executes the individual map and reduce
steps. They are executed on TaskTracker nodes chosen by the JobTracker.
How well does Hadoop scale?
Hadoop has been demonstrated on clusters of up to 2000 nodes. Sort performance on 900 nodes
is good (sorting 9TB of data on 900 nodes takes around 1.8 hours) and improving using these
non-default configuration values:
dfs.block.size = 134217728
dfs.namenode.handler.count = 40
mapred.reduce.parallel.copies = 20
mapred.child.java.opts = -Xmx512m
fs.inmemory.size.mb = 200
io.sort.factor = 100
io.sort.mb = 200
io.file.buffer.size = 131072
Sort performances on 1400 nodes and 2000 nodes are pretty good too - sorting 14TB of data on a
1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5 hours. The
updates to the above configuration being:
mapred.job.tracker.handler.count = 60
mapred.reduce.parallel.copies = 50
tasktracker.http.threads = 50
mapred.child.java.opts = -Xmx1024m
JRE 1.6 or higher is highly recommended. e.g., it deals with large number of connections much
more efficiently.
MapReduce
Map/reduce is the style in which most programs running on Hadoop are written. In this style, input
is broken in tiny pieces which are processed independently (the map part). The results of these
independent processes are then collated into groups and processed as groups (the reduce part).
Follow the link for a much more complete description.
MapReduce is a programming model and an associated implementation for
processing and generating large data sets. Users specify a map function that processes a
key/value pair to generate a set of intermediate key/value pairs, and a reduce function that
merges all intermediate values associated with the same intermediate key. Many real world tasks
are expressible in this model, as shown in the paper.
The computation takes a set of input key/value pairs, and produces a set of output
key/value pairs. The user of the MapReduce library expresses the computation as two functions:
Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate values associated with
the same intermediate key I and passes them to the Reduce function. The Reduce function, also
written by the user, accepts an intermediate key I and a set of values for that key. It merges
together these values to form a possibly smaller set of values. Typically just zero or one output
value is produced per Reduce invocation. The intermediate values are supplied to the user's
reduce function via an iterator. This allows us to handle lists of values that are too large to t in
memory.
Who are Using Hadoop?
● Yahoo - 10,000+ nodes, multi PB of storage

● Google
● Facebook - 320 nodes, 1.3PB - reporting, analytics
● MySpace - 80 nodes, 80TB - analytics
● Hadoop Korean user group - 50 nodes
● IBM - in conjunction with universities
● Joost - analytics
● Koubei.com - community and search, China - analytics
● Last.fm - 55 nodes, 85TB - chart calculation, analytics
● New York Times - image conversion
● Powerset - Search - 400 instances on EC2, S3
● Veoh - analytics, search, recommendations
● Rackspace - processing "several hundred gigabytes of email log data" every day
● Amazon/A9
● Intel Research
What is Hadoop used for?
● Search
Yahoo, Amazon, Zvents,
● Log processing
Facebook, Yahoo, ContextWeb. Joost, Last.fm
● Recommendation Systems
Facebook
● Data Warehouse
Facebook, AOL
● Video and Image Analysis
New York Times, Eyealike
Problems with Hadoop
● Bandwidth to data
○ Need to process 100TB datasets
○ On 1000 node cluster reading from remote storage (on LAN)

■ Scanning @ 10MB/s = 165 min
○ On 1000 node cluster reading from local storage

■ Scanning @ 50-200MB/s = 33-8 min
○ Moving computation is more efficient than moving data

■ Need visibility into data placement
● Scaling reliably is hard
○ Need to store petabytes of data

■ On 1000s of nodes
■ MTBF < 1 day (Mean Time Between Failure)
■ With so many disks, nodes, switches something is always broken
○ Need fault tolerant store

■ Handle hardware faults transparently and efficiently
■ Provide reasonable availability guarantees
*Ubiquitous distributed file system

In Ubiquitous computing (Ubicomp) or Pervasive Computing,
environments people are surrounded by a multitude of different computing devices. Typical
representatives are PDAs, PCs and - more and more - embedded sensor nodes. Platforms are
able to communicate, preferably wireless, and exchange information with each other. By collecting
and interpreting information from sensors and network such devices can improve the functionality
of existing applications or even provide new functionality to the user. For example, by interpreting
incoming information as a hint to the current situation, applications are able to adapt to the user
requirements and to support him in various tasks.
Prominent examples where such technology is developed and tested
are Aware-Home and Home Media Space. An important area within Ubicomp is the embedding of
sensor nodes in mundane everyday objects and environments. Such applications were explored
for instance in a restaurant context at PLAY Research.
At their core, all models of ubiquitous computing share a vision of
small, inexpensive, robust networked processing devices, distributed at all scales throughout
everyday life and generally turned to distinctly common-place ends. For example, a domestic
ubiquitous computing environment might interconnect lighting and environmental controls with
personal biometric monitors woven into clothing so that illumination and heating conditions in a
room might be modulated, continuously and imperceptibly. Another common scenario posits
refrigerators "aware" of their suitably-tagged contents, able to both plan a variety of menus from
the food actually on hand, and warn users of stale or spoiled food.

The Hadoop Approach

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

The Hadoop Approach

Uploaded by

Copyright:

HADOOP

Hadoop is a large-scale distributed batch processing infrastructure. While it can be

The Hadoop Approach

Hadoop is designed to efficiently process large volumes of information by connecting many

• Dec 2004 – Google paper published

Versions of Hadoop Date of Release Status

Release 0.18.3 (Latest) 29 January, 2009 available

what is Distributed File System ?

NameNode and DataNodes

The NameNode and DataNode are pieces of software designed to run on

The File System Namespace

The NameNode makes all decisions regarding replication of blocks. It periodically

Hadoop is an open source implementation of the MapReduce platform and

Hadoop - Typical Node configuration

Running Hadoop on top of Windows requires installing cygwin, a Linux-like

To aid developers in getting started easily with Hadoop, It provided a virtual

These settings are described individually below:

user@namenode:hadoop$ bin/hadoop namenode -format

Interacting With HDFS

● Using the Command Line

user@machine:hadoop$ bin/hadoop moduleName -cmd args...

● Using the MapReduce Plugin For Eclipse

Hadoop Dfs Read Write Example

Simple Example to Read and Write files from Hadoop DFS.

Configuration conf = new Configuration();

Path inFile = new Path(argv[0]);

Open inFile for reading.

Open outFile for writing.

FSDataOutputStream out = fs.create(outFile);

while ((bytesRead = in.read(buffer)) > 0) {

Close the streams when done.

hadoop jar your-job-file-goes-here.jar

Who are Using Hadoop?

● Yahoo - 10,000+ nodes, multi PB of storage

What is Hadoop used for?

○ Need to process 100TB datasets

○ On 1000 node cluster reading from remote storage (on LAN)

○ On 1000 node cluster reading from local storage

○ Moving computation is more efficient than moving data

● Scaling reliably is hard

○ Need to store petabytes of data

○ Need fault tolerant store

*Ubiquitous distributed file system

You might also like