Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Apache Hadoop v2 is not just a major release number, but represents generational shift

in the architecture of Apache Hadoop. With YARN, Apache Hadoop is recast as a


significantly more powerful platform one that takes Hadoop beyond merely batch
applications to taking its position as a data operating system.
To recap, Apache Hadoop v1 comprised of HDFS & MapReduce.
With HDFS one could store data of all manner, however MapReduce was the only
algorithm you could use to process that data in parallel. That was very limiting since
MapReduce, although very general, proved inadequate to satisfy all the demands being
placed on Apache Hadoop.
As Apache Hadoop crystallizes into a key component of a Modern Data Architecture,
users and customers want to store all data in HDFS and interact with that data in
multiple ways:

Real-time processing of events (sensor, telecommunications, fraud etc.) even


before it lands on HDFS

Interactive query capabilities for interrogating new data for data analysts (SQL)
and data scientists (SQL plus scripting etc.)

The need to productionize the insight i.e. batch-processing, reporting etc. in a


well-defined and timely manner

The community has worked together to make HDFS itself a much more scalable,
efficient and enterprise-friendly storage platform by addressing key functionality High
Availability for the HDFS NameNode, Federation for scaling & HDFS Snapshots to list a
few.
With YARN, Apache Hadoop now clearly delineates the system (resource management,
security, SLAs etc.) from the application framework (e.g. MapReduce) and allows for
multiple ways to interact with the data in HDFS (batch with MapReduce, streaming with
Apache Storm, interactive SQL with Apache Hive and Apache Tez).

We are already seeing the benefits of this vision in the form of many and varied
applications and services being re-vectored on top of YARN such as Apache Storm for
event processing, Apache Giraph for graph processing, Apache Tez for interactive SQL
queries, HOYA for running services such as Apache HBase and Apache Accumulo on
YARN and so on. Exciting times indeed!
As a result the Hadoop stack looks very different with Hadoop v2:

Personally, its a huge thrill to see this baby grow up and reach adulthood since
the original Jira ticket (MAPREDUCE-279) opened more than 5 years ago!

Apache Hadoop v2
As a lot of people are aware, Apache Hadoop 2 landed the Beta tag a few months ago.
Since then the community has spent a lot of time validating the APIs, protocols and the
system itself. As a result we are now very confident in our ability to not only handle the
workloads that will be thrown at Apache Hadoop, but also in our ability to do so in a
forward compatible manner such that Apache Hadoop v2 represents a stable base atop
which the ecosystem can flourish in the future.
For those who, like me, are more comfortable with simplified lists (*smile*), here are the
enhancements and major features:

YARN

High Availability for HDFS

HDFS Federation

HDFS Snapshots

NFSv3 access to data in HDFS

Binary Compatibility for MapReduce applications between Hadoop v1 and


Hadoop v2 to ease migration

Performance

Support for running Hadoop on Microsoft Windows

Integration testing for the entire Apache Hadoop ecosystem at the ASF.

Onwards
Although its a major milestone and a big reason to celebrate, the Apache Hadoop
community will continue to drive it forward under the aegis of the the ASF. There are
ever more things to do, user-cases to fulfill and users to thrill. The HDFS community is
striving hard to finish up the addition of symlinks to HDFS which just didnt make the cut
at the last minute. On the YARN side we plan to add more enhancements such as

advanced scheduling features, high availability for YARN Resource Manager, enhanced
support for long-running services and generally make it easier to run other applications
such as Apache Storm within YARN. Stay tuned!

Terminology and Architecture


MapReduce from Hadoop 1 (MapReduce 1) has been split into two components. The cluster resource
management capabilities have become YARN (Yet Another Resource Negotiator), while the MapReducespecific capabilities remain MapReduce. In the MapReduce 1 architecture, the cluster was managed by a
service called the JobTracker. TaskTracker services lived on each node and would launch tasks on behalf
of jobs. The JobTracker would serve information about completed jobs. In MapReduce 2, the functions of
the JobTracker have been split between three services. The ResourceManager is a persistent YARN
service that receives and runs applications (a MapReduce job is an application) on the cluster. It contains
the scheduler, which, as previously, is pluggable. The MapReduce-specific capabilities of the JobTracker
have been moved into the MapReduce Application Master, one of which is started to manage each
MapReduce job and terminated when the job completes. The JobTrackers function of serving information
about completed jobs has been moved to the JobHistoryServer. The TaskTracker has been replaced with
the NodeManager, a YARN service that manages resources and deployment on a node. It is responsible
for launching containers, each of which can house a map or reduce task.

The new architecture has its advantages. First, by breaking up the JobTracker into a few different
services, it avoids many of the scaling issues faced by MapReduce in Hadoop 1. More importantly, it
makes it possible to run frameworks other than MapReduce on a Hadoop cluster. For example, Impala
can also run on YARN and share resources on a cluster with MapReduce.

http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoopmapreduce-client-core/MapReduceTutorial.html

You might also like