Ibm Hadoop

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

What is Apache Hadoop?

Apache Hadoop® is an open source software framework that provides highly


reliable distributed processing of large data sets using simple programming
models. Hadoop, known for its scalability, is built on clusters of commodity
computers, providing a cost-effective solution for storing and processing
massive amounts of structured, semi-structured and unstructured data with no
format requirements.

A data lake architecture including Hadoop can offer a flexible data management
solution for your big data analytics initiatives. Because Hadoop is an open
source software project and follows a distributed computing model, it can offer a
lower total cost of ownership for a big data software and storage solution.

Hadoop can also be installed on cloud servers to better manage the compute
and storage resources required for big data. Leading cloud vendors such as
Amazon Web Services (AWS) and Microsoft Azure offer solutions. Cloudera
supports Hadoop workloads both on-premises and in the cloud, including
options for one or more public cloud environments from multiple vendors.

The Hadoop ecosystem

The Hadoop framework, built by the Apache Software Foundation, includes:

​ Hadoop Common: The common utilities and libraries that support the
other Hadoop modules. Also known as Hadoop Core.
​ Hadoop HDFS (Hadoop Distributed File System): A distributed file
system for storing application data on commodity hardware. It provides
high-throughput access to data and high fault tolerance. The HDFS
architecture features a NameNode to manage the file system namespace
and file access and multiple DataNodes to manage data storage.
​ Hadoop YARN: A framework for managing cluster resources and
scheduling jobs. YARN stands for Yet Another Resource Negotiator. It
supports more workloads, such as interactive SQL, advanced modeling
and real-time streaming.
​ Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets.

IBM India Pvt Ltd | No.12, Subramanya Arcade, Bannerghatta Main Road,
Bengaluru India - 560 029
​ Hadoop Ozone: A scalable, redundant and distributed object store
designed for big data applications.

Enhance Hadoop with additional software projects

What is HBase?
HBase is a column-oriented non-relational database management system that runs
on top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant
way of storing sparse data sets, which are common in many big data use cases. It is
well suited for real-time data processing or random read/write access to large
volumes of data.

Unlike relational database systems, HBase does not support a structured query
language like SQL; in fact, HBase isn’t a relational data store at all. HBase
applications are written in Java™ much like a typical Apache MapReduce
application. HBase does support writing applications in Apache Avro, REST and
Thrift.

An HBase system is designed to scale linearly. It comprises a set of standard tables


with rows and columns, much like a traditional database. Each table must have an
element defined as a primary key, and all access attempts to HBase tables must use
this primary key.

Avro, as a component, supports a rich set of primitive data types including: numeric,
binary data and strings; and a number of complex types including arrays, maps,
enumerations and records. A sort order can also be defined for the data.

HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built


into HBase, but if you’re running a production cluster, it’s suggested that you have a
dedicated ZooKeeper cluster that’s integrated with your HBase cluster.

HBase works well with Hive, a query engine for batch processing of big data, to
enable fault-tolerant big data applications.

What is Apache Hive?


Apache Hive is an open source data warehouse software for reading, writing and
managing large data set files that are stored directly in either the Apache Hadoop
Distributed File System (HDFS) or other data storage systems such as Apache
HBase. Hive enables SQL developers to write Hive Query Language (HQL)
statements that are similar to standard SQL statements for data query and analysis.

IBM India Pvt Ltd | No.12, Subramanya Arcade, Bannerghatta Main Road,
Bengaluru India - 560 029
It is designed to make MapReduce programming easier because you don’t have to
know and write lengthy Java code. Instead, you can write queries more simply in
HQL, and Hive can then create the map and reduce the functions.
Included with the installation of Hive is the Hive metastore, which enables you to
apply a table structure onto large amounts of unstructured data. Once you create a
Hive table, defining the columns, rows, data types, etc., all of this information is
stored in the metastore and becomes part of the Hive architecture. Other tools such
as Apache Spark and Apache Pig can then access the data in the metastore.
As with any database management system (DBMS), you can run your Hive queries
from a command-line interface (known as the Hive shell), from a Java™ Database
Connectivity (JDBC) or from an Open Database Connectivity (ODBC) application,
using the Hive JDBC/ODBC drivers. You can run a Hive Thrift Client within
applications written in C++, Java, PHP, Python or Ruby, similar to using these
client-side languages with embedded SQL to access a database such as IBM Db2®
or IBM Informix®.
Hive looks like traditional database code with SQL access. However, Hive is based
on Apache Hadoop and Hive operations, resulting in key differences. First, Hadoop
is intended for long sequential scans and, because Hive is based on Hadoop,
queries have a very high latency (many minutes). This means Hive is less
appropriate for applications that need very fast response times. Second, Hive is
read-based and therefore not appropriate for transaction processing that typically
involves a high percentage of write operations. It is better suited for data
warehousing tasks such as extract/transform/load (ETL), reporting and data analysis
and includes tools that enable easy access to data via SQL.
If you're interested in SQL on Hadoop, in addition to Hive, IBM offers IBM Db2 Big
SQL, which makes accessing Hive data sets faster and more secure. Check out the
video below for a quick overview of Hive and Db2 Big SQL.

Oozie

A Java-based workload scheduler to manage Hadoop jobs

Hadoop for developers

Apache Hadoop was written in Java, but depending on the big data project,
developers can program in their choice of language, such as Python, R or Scala.
The included Hadoop Streaming utility allows developers to create and execute
MapReduce jobs with any script or executable as the mapper or the reducer.

IBM India Pvt Ltd | No.12, Subramanya Arcade, Bannerghatta Main Road,
Bengaluru India - 560 029
Spark versus Hadoop

Apache Spark is often compared to Hadoop as it is also an open source framework for
big data processing. In fact, Spark was initially built to improve the processing
performance and extend the types of computations possible with Hadoop MapReduce.
Spark uses in-memory processing, which means it is vastly faster than the read/write
capabilities of MapReduce.

While Hadoop is best for batch processing of huge volumes of data, Spark supports
both batch and real-time data processing and is ideal for streaming data and graph
computations. Both Hadoop and Spark have machine learning libraries, but again,
because of the in-memory processing, Spark’s machine learning is much faster.

Better data-driven decisions Integrate real-time data (streaming audio, video, social
media sentiment and clickstream data) and other semi-structured and unstructured
data not used in a data warehouse or relational database. More comprehensive data
provides more accurate decisions.

IBM and Hadoop, better together

Support predictive and prescriptive analytics for today’s AI. Combine Apache’s
enterprise-grade Hadoop distribution with a single ecosystem of integrated products
and services from both IBM and Hadoop to improve data discovery, testing, ad hoc
and near real-time queries. Take advantage of the collaboration between IBM and
Apache to deliver enterprise Hadoop solutions.

IBM India Pvt Ltd | No.12, Subramanya Arcade, Bannerghatta Main Road,
Bengaluru India - 560 029

You might also like