BDALAB Experiment08

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Name : Amogh Prabhu Subject: BDA Lab

Roll No : 201080044 Professor: Prof. Vaibhav Dhore

Experiment 08:

Aim: Setup and install HBase and oozie. Execute basic commands.

Theory:

HBase is a data model that is similar to Google’s big table. It is an open


source, distributed database developed by Apache software foundation written
in Java. HBase is an essential part of our Hadoop ecosystem.
HBase runs on top of HDFS (Hadoop Distributed File System). It can store
massive amounts of data from terabytes to petabytes. It is column oriented and
horizontally scalable.

Applications of Apache HBase:


1. Real-time analytics: HBase is an excellent choice for real-time analytics
applications that require low-latency data access. It provides fast read and
write performance and can handle large amounts of data, making it suitable
for real-time data analysis.
2. Social media applications: HBase is an ideal database for social
media applications that require high scalability and performance. It can
handle the large volume of data generated by social media platforms and
provide real-time analytics capabilities.
3. IoT applications: HBase can be used for Internet of Things (IoT)
applications that require storing and processing large volumes of sensor
data. HBase’s scalable architecture and fast write performance make it
a suitable choice for IoT applications that require low-latency data
processing.
4. Online transaction processing: HBase can be used as an online
transaction processing (OLTP) database, providing high availability,
consistency, and low-latency data access. HBase’s distributed architecture
and automatic failover capabilities make it a good fit for OLTP
applications that require high availability.
5. Ad serving and clickstream analysis: HBase can be used to store and
process large volumes of clickstream data for ad serving and clickstream
analysis. HBase’s column-oriented data storage and indexing capabilities
make it a good fit for these types of applications.

Features of HBase:
1. It is linearly scalable across various nodes as well as modularly
scalable, as it is divided across various nodes.
2. HBase provides consistent read and writes.
3. It provides atomic read and write means during one read or write process,
all other processes are prevented from performing any read or write
operations.
4. It provides easy to use Java API for client access.
5. It supports Thrift and REST API for non-Java front ends which
supports XML, Protobuf and binary data encoding options.
6. It supports a Block Cache and Bloom Filters for real-time queries and for
high volume query optimization.
7. HBase provides automatic failure support between Region Servers.
8. It supports exporting metrics with the Hadoop metrics subsystem to files.
9. It doesn’t enforce relationships within your data.
10. It is a platform for storing and retrieving data with random access.
Architecture of HBase:

All the 3 components are described below:

HMaster:
The implementation of Master Server in HBase is HMaster. It is a process in
which regions are assigned to the region server as well as DDL (create, delete
table) operations. It monitors all Region Server instances present in the cluster. In
a distributed environment, Master runs several background threads. HMaster has
many features like controlling load balancing, failover etc.

Region Server:
HBase Tables are divided horizontally by row key range into Regions. Regions are
the basic building elements of HBase cluster that consists of the distribution of
tables and are composed of Column families. Region Server runs on HDFS
DataNode which is present in the Hadoop cluster. Regions of Region Server are
responsible for several things, like handling,
managing, executing as well as reads and writes HBase operations on that set of
regions. The default size of a region is 256 MB.

Zookeeper:
It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed synchronization, server
failure notification etc. Clients communicate with region servers via zookeeper.

Advantages of HBase:
1. Can store large data sets
2. Database can be shared
3. Cost-effective from gigabytes to petabytes
4. High availability through failover and replication

Disadvantages of HBase –
1. No support SQL structure
2. No transaction support
3. Sorted only on key
4. Memory issues on the cluster
Installation:

1. Choose a Hbase release from this website


https://www.apache.org/dyn/closer.lua/hbase/. I have used version
2.5.8 from the stable distribution

2. Extract the downloaded file


$ tar xzvf hbase-2.5.8-bin.tar.gz
3. Change the file name to Hbase and move it to /usr/local directory

4. set the JAVA_HOME environment variable before starting HBase


# Set environment variables here.
# The java implementation to use.
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
5. Add the following lines to hbase.site.xml file.

6. Edit the .bashrc for hadoopuser


7. First start the HDFS using start-all.sh command and then run start-
hbase.sh command to start the HMaster

We have used the standalone HBase setup, so the zookeeper instance is


automatically created.

8. Running HBase commands on shell


Apache Oozie is the tool in which all sorts of programs can be pipelined in a
desired order to work in Hadoop’s distributed environment. Oozie also provides a
mechanism to run the job at a given schedule.

One of the main advantages of Oozie is that it is tightly integrated with the Hadoop
stack supporting various Hadoop jobs like Hive, Pig, Sqoop as well as system-
specific jobs like Java and Shell.

Oozie is an Open Source Java Web-Application available under Apache


license 2.0. It is responsible for triggering the workflow actions, which in turn
uses the Hadoop execution engine to actually execute the task.
Hence, Oozie is able to leverage the existing Hadoop machinery for load
balancing, fail-over, etc.

Oozie detects completion of tasks through callback and polling. When Oozie
starts a task, it provides a unique callback HTTP URL to the task, and notifies that
URL when it is complete. If the task fails to invoke the callback URL, Oozie can
poll the task for completion.

Following three types of jobs are common in Oozie −

1. Oozie Workflow Jobs − These are represented as Directed Acyclic


Graphs (DAGs) to specify a sequence of actions to be executed.
2. Oozie Coordinator Jobs − These consist of workflow jobs triggered by
time and data availability.
3. Oozie Bundle − These can be referred to as a package of multiple
coordinator and workflow jobs.
Architecture:

Let us see the components of Apache Oozie architecture.

1. Oozie Client: An Apache Oozie client is a command-line utility that


interacts with the Oozie server using the Oozie command-line tool, the Oozie
Java client API, or the Oozie HTTP REST API. The Oozie
command-line tool and the Oozie Java API eventually use the Oozie HTTP REST
API to communicate with the Oozie server.
2. Oozie Server: Apache Oozie server is a Java web application that runs in a Java
servlet container. The Oozie server does not store any user or job information in
memory. Oozie maintains all this information such as running or completed in the
SQL database. When a user requests to process a job, the Oozie server fetches the
conforming job state from the SQL database and performs the requested operation,
and updates the SQL database with the new state of the job.
3. Oozie Database: Apache Oozie database stores all of the stageful
information such as workflow definitions, running and completed jobs. Oozie
fetches the corresponding job-state from the SQL database while processing a
user request and performs the requested operation, and updates the SQL
database with the new state of the job. Oozie provides support for databases
such as Derby, MySQL, Oracle, and PostgreSQL.
Installation:

1. Download Apache Oozie 5.2.1 from the below link.


$ wget
https://downloads.apache.org/oozie/5.2.0/oozie-5.2.1.tar.gz

2. Extract the downloaded Apache Oozie tar using the below command.
3. We will install the Maven tool to compile the Apache Oozie source. Use the
below command to install Maven.
4. Now we will compile Apache Oozie to create binary files for the distro. Go to
“/bin” directory of Oozie home and run the below command.

5. After this create a "libext" directory under the Oozie directory and now go to
the “libext” directory and download the “ext-2.2.zip” file.
$ wget
http://archive.cloudera.com/gplextras/misc/ext-2.2.zip

6. After this copy all Hadoop libraries into the “libext” folder.
7. Now we will get Oozie binary file in the target directory.

8. Set OOZIE_HOME and JAVE_HOME path on the “.bashrc” file.


9. Go to oozie home directory and supply below command to setup oozie.

10. Now start the Oozie server using the below command.

Conclusion:
Through this experiment, I learned how to set up and install HBase and Oozie, as well as
how to execute basic commands using these tools. I performed tasks such as configuring
the installations and running simple commands to interact with the databases. These tools
are used for managing large volumes of data efficiently, with HBase serving as a NoSQL
database for real-time read/write access, and Oozie acting as a workflow scheduler for
Hadoop jobs, facilitating automation and coordination of data processing tasks.

You might also like