Professional Documents
Culture Documents
Real-Time Processing of Events (Sensor, Telecommunications, Fraud Etc.) Even
Real-Time Processing of Events (Sensor, Telecommunications, Fraud Etc.) Even
Interactive query capabilities for interrogating new data for data analysts (SQL)
and data scientists (SQL plus scripting etc.)
The community has worked together to make HDFS itself a much more scalable,
efficient and enterprise-friendly storage platform by addressing key functionality High
Availability for the HDFS NameNode, Federation for scaling & HDFS Snapshots to list a
few.
With YARN, Apache Hadoop now clearly delineates the system (resource management,
security, SLAs etc.) from the application framework (e.g. MapReduce) and allows for
multiple ways to interact with the data in HDFS (batch with MapReduce, streaming with
Apache Storm, interactive SQL with Apache Hive and Apache Tez).
We are already seeing the benefits of this vision in the form of many and varied
applications and services being re-vectored on top of YARN such as Apache Storm for
event processing, Apache Giraph for graph processing, Apache Tez for interactive SQL
queries, HOYA for running services such as Apache HBase and Apache Accumulo on
YARN and so on. Exciting times indeed!
As a result the Hadoop stack looks very different with Hadoop v2:
Personally, its a huge thrill to see this baby grow up and reach adulthood since
the original Jira ticket (MAPREDUCE-279) opened more than 5 years ago!
Apache Hadoop v2
As a lot of people are aware, Apache Hadoop 2 landed the Beta tag a few months ago.
Since then the community has spent a lot of time validating the APIs, protocols and the
system itself. As a result we are now very confident in our ability to not only handle the
workloads that will be thrown at Apache Hadoop, but also in our ability to do so in a
forward compatible manner such that Apache Hadoop v2 represents a stable base atop
which the ecosystem can flourish in the future.
For those who, like me, are more comfortable with simplified lists (*smile*), here are the
enhancements and major features:
YARN
HDFS Federation
HDFS Snapshots
Performance
Integration testing for the entire Apache Hadoop ecosystem at the ASF.
Onwards
Although its a major milestone and a big reason to celebrate, the Apache Hadoop
community will continue to drive it forward under the aegis of the the ASF. There are
ever more things to do, user-cases to fulfill and users to thrill. The HDFS community is
striving hard to finish up the addition of symlinks to HDFS which just didnt make the cut
at the last minute. On the YARN side we plan to add more enhancements such as
advanced scheduling features, high availability for YARN Resource Manager, enhanced
support for long-running services and generally make it easier to run other applications
such as Apache Storm within YARN. Stay tuned!
The new architecture has its advantages. First, by breaking up the JobTracker into a few different
services, it avoids many of the scaling issues faced by MapReduce in Hadoop 1. More importantly, it
makes it possible to run frameworks other than MapReduce on a Hadoop cluster. For example, Impala
can also run on YARN and share resources on a cluster with MapReduce.
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoopmapreduce-client-core/MapReduceTutorial.html