Professional Documents
Culture Documents
BDALAB Experiment08
BDALAB Experiment08
BDALAB Experiment08
Experiment 08:
Aim: Setup and install HBase and oozie. Execute basic commands.
Theory:
Features of HBase:
1. It is linearly scalable across various nodes as well as modularly
scalable, as it is divided across various nodes.
2. HBase provides consistent read and writes.
3. It provides atomic read and write means during one read or write process,
all other processes are prevented from performing any read or write
operations.
4. It provides easy to use Java API for client access.
5. It supports Thrift and REST API for non-Java front ends which
supports XML, Protobuf and binary data encoding options.
6. It supports a Block Cache and Bloom Filters for real-time queries and for
high volume query optimization.
7. HBase provides automatic failure support between Region Servers.
8. It supports exporting metrics with the Hadoop metrics subsystem to files.
9. It doesn’t enforce relationships within your data.
10. It is a platform for storing and retrieving data with random access.
Architecture of HBase:
HMaster:
The implementation of Master Server in HBase is HMaster. It is a process in
which regions are assigned to the region server as well as DDL (create, delete
table) operations. It monitors all Region Server instances present in the cluster. In
a distributed environment, Master runs several background threads. HMaster has
many features like controlling load balancing, failover etc.
Region Server:
HBase Tables are divided horizontally by row key range into Regions. Regions are
the basic building elements of HBase cluster that consists of the distribution of
tables and are composed of Column families. Region Server runs on HDFS
DataNode which is present in the Hadoop cluster. Regions of Region Server are
responsible for several things, like handling,
managing, executing as well as reads and writes HBase operations on that set of
regions. The default size of a region is 256 MB.
Zookeeper:
It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed synchronization, server
failure notification etc. Clients communicate with region servers via zookeeper.
Advantages of HBase:
1. Can store large data sets
2. Database can be shared
3. Cost-effective from gigabytes to petabytes
4. High availability through failover and replication
Disadvantages of HBase –
1. No support SQL structure
2. No transaction support
3. Sorted only on key
4. Memory issues on the cluster
Installation:
One of the main advantages of Oozie is that it is tightly integrated with the Hadoop
stack supporting various Hadoop jobs like Hive, Pig, Sqoop as well as system-
specific jobs like Java and Shell.
Oozie detects completion of tasks through callback and polling. When Oozie
starts a task, it provides a unique callback HTTP URL to the task, and notifies that
URL when it is complete. If the task fails to invoke the callback URL, Oozie can
poll the task for completion.
2. Extract the downloaded Apache Oozie tar using the below command.
3. We will install the Maven tool to compile the Apache Oozie source. Use the
below command to install Maven.
4. Now we will compile Apache Oozie to create binary files for the distro. Go to
“/bin” directory of Oozie home and run the below command.
5. After this create a "libext" directory under the Oozie directory and now go to
the “libext” directory and download the “ext-2.2.zip” file.
$ wget
http://archive.cloudera.com/gplextras/misc/ext-2.2.zip
6. After this copy all Hadoop libraries into the “libext” folder.
7. Now we will get Oozie binary file in the target directory.
10. Now start the Oozie server using the below command.
Conclusion:
Through this experiment, I learned how to set up and install HBase and Oozie, as well as
how to execute basic commands using these tools. I performed tasks such as configuring
the installations and running simple commands to interact with the databases. These tools
are used for managing large volumes of data efficiently, with HBase serving as a NoSQL
database for real-time read/write access, and Oozie acting as a workflow scheduler for
Hadoop jobs, facilitating automation and coordination of data processing tasks.