Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

HBASE

Q) What is HBase ?

HBase is an open-source, column-oriented distributed database system in a


Hadoop environment. Initially, it was Google Big Table, afterward, it was re-named as HBase
and is primarily written in Java.  Apache HBase is needed for real-time Big Data applications.

HBase can store massive amounts of data from terabytes to petabytes. The tables present in
HBase consists of billions of rows having millions of columns.

HBase is a column-oriented database and can manage structured and un-structured data. It
supports NoSQL tool for access huge amount of data from non-relational data model.

HBase is a column-oriented NoSQL database. Although it looks similar to a relational


database which contains rows and columns, but it is not a relational database. Relational
databases are row oriented while HBase is column-oriented. So, let us first understand the
difference between Column-oriented and Row-oriented databases.

Row-oriented vs column-oriented Databases:


Row-oriented databases store table records in a sequence of rows. Whereas column-
oriented databases store table records in a sequence of columns, i.e. the entries in a column
are stored in contiguous locations on disks.
To better understand it, let us take an example and consider the table below. 

If this table is stored in a row-oriented database. It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo, 
2, Vin Diesel, Brazil, 520, Mustang
In row-oriented databases data is stored on the basis of rows or tuples as you  can see
above.
While the column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
In a column-oriented databases, all the column values are stored together like first column
values will be stored together, then the second column values will be stored together and
data in other columns are stored in a similar manner.

 When the amount of data is very huge, like in terms of petabytes or exabytes, we
use column-oriented approach, because the data of a single column is stored
together and can be accessed faster.
 While row-oriented approach comparatively handles less number of rows and
columns efficiently, as row-oriented database stores data is a structured format.
 When we need to process and analyze a large set of semi-structured or unstructured
data, we use column oriented approach. Such as applications dealing with Online
Analytical Processing like data mining, data warehousing, applications including
analytics, etc.
 Whereas, Online Transactional Processing such as banking and finance domains
which handle structured data and require transactional properties (ACID properties)
use row-oriented approach.
Relational Databases vs. HBase

When talking of data stores, we first think of Relational Databases with structured data
storage and a sophisticated query engine. However, a Relational Database incurs a big
penalty to improve performance as the data size increases. HBase, on the other hand, is
designed from the ground up to provide scalability and partitioning to enable efficient
data structure serialization, storage and retrieval. Broadly, the differences between a
Relational Database and HBase are:

HDFS vs. HBase

HDFS is a distributed file system that is well suited for storing large files. It’s designed to
support batch processing of data but doesn’t provide fast individual record lookups. HBase
is built on top of HDFS and is designed to provide access to single rows of data in large
tables. Overall, the differences between HDFS and HBase are
Q) Explain HBase Architecture and its components
HBase has three major components i.e., HMaster Server, HBase Region Server,
Regions and Zookeeper.

HBase architecture has a single HBase master node (HMaster) and several slaves i.e. region
servers. Each region server (slave) serves a set of regions, and a region can be served only by
a single region server. Whenever a client sends a write request, HMaster receives the
request and forwards it to the corresponding region server.
The below figure explains the hierarchy of the HBase Architecture. We will talk about each
one of them individually.

Region

A region contains all the rows between the start key and the end key assigned to that
region. HBase tables can be divided into a number of regions in such a way that all the
columns of a column family is stored in one region. Each region contains the rows in a
sorted order.

Many regions are assigned to a Region Server, which is responsible for handling, managing,
executing reads and writes operations on that set of regions.

 A table can be divided into a number of regions. A Region is a sorted range of rows
storing data between a start key and an end key.
 A Region has a default size of 256MB which can be configured according to the need.
 A Group of regions is served to the clients by a Region Server.
 A Region Server can serve approximately 1000 regions to the client.
Now starting from the top of the hierarchy, I would first like to explain you about HMaster
Server which acts similarly as a NameNode in HDFS.
Region Server runs on HDFS DataNode and consists of the following components –

 Block Cache – This is the read cache. Most frequently read data is stored in the read
cache and whenever the block cache is full, recently used data is evicted.
 MemStore- This is the write cache and stores new data that is not yet written to the
disk. Every column family in a region has a MemStore.
 Write Ahead Log (WAL) is a file that stores new data that is not persisted to
permanent storage.
 HFile is the actual storage file that stores the rows as sorted key values on a disk.

HMaster
As in the below image, you can see the HMaster handles a collection of Region Server which
resides on DataNode. Let us understand how HMaster does that.

 HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
 It coordinates and manages the Region Server (similar as NameNode manages
DataNode in HDFS).
 It assigns regions to the Region Servers on startup and re-assigns regions to Region
Servers during recovery and load balancing.
ZooKeeper
This below image explains the ZooKeeper’s coordination mechanism.

 Zookeeper acts like a coordinator inside HBase distributed environment. It helps in


maintaining server state inside the cluster by communicating through sessions.
 Every Region Server along with HMaster Server sends continuous heartbeat at
regular interval to Zookeeper and it checks which server is alive and available as
mentioned in above image. It also provides server failure notifications so that,
recovery measures can be executed.
 Referring from the above image you can see, there is an inactive server, which acts
as a backup for active server. If the active server fails, it comes for the rescue.
 The active HMaster sends heartbeats to the Zookeeper while the inactive HMaster
listens for the notification send by active HMaster. If the active HMaster fails to send
a heartbeat the session is deleted and the inactive HMaster becomes active.
 While if a Region Server fails to send a heartbeat, the session is expired and all
listeners are notified about it. Then HMaster performs suitable recovery actions
which we will discuss later in this blog.
 Zookeeper also maintains the .META Server’s path, which helps any client
in searching for any region. The Client first has to check with .META Server in which
Region Server a region belongs, and it gets the path of that Region Server. 

Q) Explain Schema Design in HBase?

The HBase schema design is very different compared to the relation database schema
design. Below are some of general concept that should be followed while designing schema
in Hbase:
 Row key: Each table in HBase table is indexed on row key. Data is sorted
lexicographically by this row key. There are no secondary indices available on HBase
table.
 Automaticity: Avoid designing table that requires atomacity across all rows. All
operations on HBase rows are atomic at row level.
 Even distribution: Read and write should uniformly distributed across all nodes
available in cluster. Design row key in such a way that, related entities should be
stored in adjacent rows to increase read efficacy.
HBase Schema Row key, Column family, Column qualifier, individual and Row value Size
Limit
Consider below is the size limit when designing schema in Hbase:
 Row keys: 4 KB per key
 Column families: not more than 10 column families per table
 Column qualifiers: 16 KB per qualifier
 Individual values: less than 10 MB per cell
 All values in a single row: max 10 MB
HBase Schema Row key, Column family, Column qualifier, individual and Row value Size
Limit
Consider below is the size limit when designing schema in Hbase:
 Row keys: 4 KB per key
 Column families: not more than 10 column families per table
 Column qualifiers: 16 KB per qualifier
 Individual values: less than 10 MB per cell
 All values in a single row: max 10 MB

Reverse Domain Names


If you are storing data that is represented by the domain names then consider using reverse
domain name as a row keys for your HBase Tables. For example, com.company.name.
This technique works perfectly fine when you have data spread across multiple reverse
domains. If you have very few reverse domain then you may end up storing data on single
node causing hotspotting.

Hashing
When you have the data which is represented by the string identifier, then that is good
choice for your Hbase table row key. Use hash of that string identifier as a row key instead
of raw string. For example, if you are storing user data that is identified by user ID’s then
hash of user ID is better choice for your row key.

Timestamps
When you retrieve data based on time when it was stored, it is best to include the
timestamp in your row key. For example, you are trying to store the machine log identified
by machine number then append the timestamp to the machine number when designing
row key, machine001#1435310751234.

Combines Row Key


You can combine multiple key to design row key for your HBase table based on your
requirements.
HBase Column Families and Column Qualifiers
Below are some of guidance on column families and column qualifier:
Column Families
In HBase, you have upto 10 column families to get best performance out of HBase cluster. If
your row contains multiple values that are related to each other, then you should place then
in same family names. Also, the names of your column families should be short, since they
are included in the data that is transferred for each request.
Column Qualifiers
You can create as many column qualifiers as you need in each row. The empty cells in the
row does not consume any space. The names of your column qualifiers should be short,
since they are included in the data that is transferred for each request.
Creating HBase Schema Design
You can create the schema using Apache HBase shell or Java API’s:
Below is the example of create table schema:

hbase(main):001:0> create 'test_table_schema', 'cf'

0 row(s) in 2.7740 seconds

=> Hbase::Table - test_table_schema

Q) What is Apache ZooKeeper?

Apache ZooKeeper is a software project of Apache Software Foundation. It is an open-


source technology that maintains configuration information and provides synchronized as
well as group services which are deployed on Hadoop cluster to administer the
infrastructure.

The ZooKeeper framework was originally built at Yahoo! for easier accessing of applications
but, later on, ZooKeeper was used for organizing services used by distributed frameworks
like Hadoop, HBase, etc., and Apache ZooKeeper became a standard. It was designed to be a
vigorous service that enabled application developers to focus mainly on their application
logic rather than coordination.
In a distributed environment, coordinating and managing a service has become a difficult
process. Apache ZooKeeper was used to solve this problem because of its simple
architecture, as well as API, that allows developers to implement common coordination
tasks like electing a master server, managing group membership, and managing metadata.
Apache ZooKeeper is used for maintaining centralized configuration information, naming,
providing distributed synchronization, and providing group services in a simple interface so
that we don’t have to write it from scratch. Apache Kafka also uses ZooKeeper to manage
configuration. ZooKeeper allows developers to focus on the core application logic, and it
implements various protocols on the cluster so that the applications need not implement
them on their own.
ZooKeeper Architecture
Apache ZooKeeper works on the Client–Server architecture in which clients are machine
nodes and servers are nodes.
The following figure shows the relationship between the servers and their clients. In this, we
can see that each client sources the client library, and further they communicate with any of
the ZooKeeper nodes.

Components of the ZooKeeper architecture has been explained in the following table.

Part Description
Client node in our distributed applications cluster is used to access information from the
server. It sends a message to the server to let the server know that the client is alive, and
Client
if there is no response from the connected server the client automatically resends the
message to another server.
The server gives an acknowledgement to the client to inform that the server is alive, and
Server
it provides all services to clients.
Leader If any of the server nodes is failed, this server node performs automatic recovery.
Follower It is a server node which follows the instructions given by the leader.
Working of Apache ZooKeeper
 The first thing that happens as soon as the ensemble (a group of ZooKeeper servers)
starts is, it waits for the clients to connect to the servers.
 After that, the clients in the ZooKeeper ensemble will connect to one of the nodes.
That node can be any of a leader node or a follower node.
 Once the client is connected to a particular node, the node assigns a session ID to
the client and sends an acknowledgement to that particular client.
 If the client does not get any acknowledgement from the node, then it resends the
message to another node in the ZooKeeper ensemble and tries to connect with it.
 On receiving the acknowledgement, the client makes sure that the connection is not
lost by sending the heartbeats to the node at regular intervals.
 Finally, the client can perform functions like read, write, or store the data as per the
need.

Features of Apache ZooKeeper


Apache ZooKeeper provides a wide range of good features to the user. 

 Updating the Node’s Status :

Apache ZooKeeper is capable of updating every node that allows it to store updated
information about each node across the cluster.
 Managing the Cluster: This technology can manage the cluster in such a way that the
status of each node is maintained in real time, leaving lesser chances for errors and
ambiguity.
 Naming Service: ZooKeeper attaches a unique identification to every node which is
quite similar to the DNA that helps identify it.
 Automatic Failure Recovery: Apache ZooKeeper locks the data while modifying
which helps the cluster recover it automatically if a failure occurs in the database.

Benefits of Apache ZooKeeper

 

Simplicity: Coordination is done with the help of a shared hierarchical namespace.


 Reliability: The system keeps performing even if more than one node fails.
 Order: It keeps track by stamping each update with a number denoting its order.
 Speed: It runs with a ratio of 10:1 in the cases where ‘reads’ are more common.
 Scalability: The performance can be enhanced by deploying more machines.

ZooKeeper Use Cases


There are many use cases of ZooKeeper. Some of the most prominent of them are as
follows:

 Managing the configuration


 Naming services
 Choosing the leader
 Queuing messages
 Managing the notification system
 Synchronization

One of the ways in which we can communicate with the ZooKeeper ensemble is by using the
ZooKeeper Command Line Interface (CLI). This gives us the feature of using various options, and
also for the sake of debugging there is increased dependence on the CLI.

Applications of Zookeeper

a. Apache Solr

For leader election and centralized configuration, Apache Solr uses Zookeeper.
b. Apache Mesos

A tool which offers efficient resource isolation and sharing across distributed applications, as
a cluster manager is what we call Apache Mesos. Hence for the fault-tolerant replicated
master, Mesos uses ZooKeeper.
c. Yahoo!

As we all know ZooKeeper was originally built at “Yahoo!”, so for several requirements like
robustness, data transparency, centralized configuration, better performance, as well as
coordination, they designed Zookeeper. 
d. Apache Hadoop
As we know behind the growth of Big Data industry, Apache Hadoop is the driving force. So,
for configuration management and coordination, Hadoop relies on ZooKeeper.

Multiple ZooKeeper servers endure large Hadoop clusters, that’s why in order to retrieve
and update synchronization information, each client machine communicates with one of the
ZooKeeper servers. Like: Human Genome Project, as there are terabytes of data in Human
Genome Project, So, In order to analyze the dataset and find interesting facts for human
development, it usesHadoop MapReduce framework.
e. Apache HBase

An open source, distributed, NoSQL database which we use for real-time read/write access


of large datasets is what we call Apache HBase. While it comes to Zookeeper,  installation
of HBase distributed application depends on a running ZooKeeper cluster.
In addition, to track the status of distributed data throughout the master and region servers,
Apache HBase uses ZooKeeper. Now use-cases of HBase are −
i. Telecom

One of them is the Telecom industry. As it stores billions of mobile call records and further
access them in real time. Hence we can say it uses HBase to process all the records in real
time, easily and efficiently.
ii. Social network
Like Twitter, LinkedIn, and Facebook receives huge volumes of data on daily basis so to find
recent trends and other interesting facts it also uses HBase.
f. Apache Accumulo

Moreover, on top of Apache ZooKeeper (and Apache Hadoop), another sorted distributed
key/value store “Apache Accumulo” is built.
g. Neo4j

For write master selection and read slave coordination, No4j is a distributed graph
database which uses ZooKeeper.
h. Cloudera

Basically, for centralized configuration management, Cloudera search integrates search


functionality with Hadoop by using ZooKeeper.

Q) Explain Big Data Applications or Big Data Analytics ?


GOVERNMENT

Big data analytics has proven to be very useful in the government sector. Big data analysis
played a large role in Barack Obama’s successful 2012 re-election campaign. Also most
recently, Big data analysis was majorly responsible for the BJP and its allies to win a highly
successful Indian General Election 2014. The Indian Government utilizes numerous
techniques to ascertain how the Indian electorate is responding to government action, as
well as ideas for policy augmentation.

SOCIAL MEDIA ANALYTICS

The advent of social media has led to an outburst of big data. Various solutions have been
built in order to analyze social media activity like IBM’s Cognos Consumer Insights, a point
solution running on IBM’s BigInsights Big Data platform, can make sense of the chatter.
Social media can provide valuable real-time insights into how the market is responding to
products and campaigns. With the help of these insights, the companies can adjust their
pricing, promotion, and campaign placements accordingly. Before utilizing the big data
there needs to be some preprocessing to be done on the big data in order to derive some
intelligent and valuable results. Thus to know the consumer mindset the application of
intelligent decisions derived from big data is necessary.

TECHNOLOGY

The technological applications of big data comprise of the following companies which deal
with huge amounts of data every day and put them to use for business decisions as well. For
example, eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB
Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay‟s
90PB data warehouse. Amazon.com handles millions of back-end operations every day, as
well as queries from more than half a million third-party sellers. The core technology that
keeps Amazon running is Linux-based and as of 2005, they had the world’s three largest
Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Facebook handles 50 billion
photos from its user base. Windermere Real Estate uses anonymous GPS signals from nearly
100 million drivers to help new home buyers determine their typical drive times to and from
work throughout various times of the day.

CALL CENTER ANALYTICS

Now we turn to the customer-facing Big Data application examples, of which call center
analytics are particularly powerful. What’s going on in a customer’s call center is often a
great barometer and influencer of market sentiment, but without a Big Data solution, much
of the insight that a call center can provide will be overlooked or discovered too late. Big
Data solutions can help identify recurring problems or customer and staff behavior patterns
on the fly not only by making sense of time/quality resolution metrics but also by capturing
and processing call content itself.
BANKING

The use of customer data invariably raises privacy issues. By uncovering hidden connections
between seemingly unrelated pieces of data, big data analytics could potentially reveal
sensitive personal information. Research indicates that 62% of bankers are cautious in their
use of big data due to privacy issues. Further, outsourcing of data analysis activities or
distribution of customer data across departments for the generation of richer insights also
amplifies security risks. Such as customers’ earnings, savings, mortgages, and insurance
policies ended up in the wrong hands. Such incidents reinforce concerns about data privacy
and discourage customers from sharing personal information in exchange for customized
offers.

AGRICULTURE

A biotechnology firm uses sensor data to optimize crop efficiency. It plants test crops and
runs simulations to measure how plants react to various changes in condition. Its data
environment constantly adjusts to changes in the attributes of various data it collects,
including temperature, water levels, soil composition, growth, output, and gene sequencing
of each plant in the test bed. These simulations allow it to discover the optimal
environmental conditions for specific gene types.

MARKETING

Marketers have begun to use facial recognition software to learn how well their advertising
succeeds or fails at stimulating interest in their products. A recent study published in the
Harvard Business Review looked at what kinds of advertisements compelled viewers to
continue watching and what turned viewers off. Among their tools was “a system that
analyses facial expressions to reveal what viewers are feeling.” The research was designed
to discover what kinds of promotions induced watchers to share the ads with their social
network, helping marketers create ads most likely to “go viral” and improve sales.

SMART PHONES

Perhaps more impressive, people now carry facial recognition technology in their pockets.
Users of I Phone and Android smartphones have applications at their fingertips that use
facial recognition technology for various tasks. For example, Android users with the
remember app can snap a photo of someone, then bring up stored information about that
person based on their image when their own memory lets them down a potential boon for
salespeople.

TELECOM

Now a day’s big data is used in different fields. In telecom also it plays a very good role.
Operators face an uphill challenge when they need to deliver new, compelling, revenue-
generating services without overloading their networks and keeping their running costs
under control. The market demands new set of data management and analysis capabilities
that can help service providers make accurate decisions by taking into account customer,
network context and other critical aspects of their businesses. Most of these decisions must
be made in real time, placing additional pressure on the operators. Real-time predictive
analytics can help leverage the data that resides in their multitude systems, make it
immediately accessible and help correlate that data to generate insight that can help them
drive their business forward.

HEALTHCARE

Traditionally, the healthcare industry has lagged behind other industries in the use of big
data, part of the problem stems from resistance to change providers are accustomed to
making treatment decisions independently, using their own clinical judgment, rather than
relying on protocols based on big data. Other obstacles are more structural in nature. This is
one of the best place to set an example for Big Data Application.Even within a single
hospital, payor, or pharmaceutical company, important information often remains siloed
within one group or department because organizations lack procedures for integrating data
and communicating findings.

Health care stakeholders now have access to promising new threads of knowledge. This
information is a form of “big data,” so called not only for its sheer volume but for its
complexity, diversity, and timelines. Pharmaceutical industry experts, payers, and providers
are now beginning to analyze big data to obtain insights. Recent technologic advances in the
industry have improved their ability to work with such data, even though the files are
enormous and often have different database structures and technical characteristics.

Q) Explain about Big Data Drivers

While most definitions of Big Data


focus on the new forms of
unstructured data flowing through
businesses with new levels of
“volume, velocity, variety, and
complexity”, I tend to answer the
question using a simple equation:

Big Data = Transactions +


Interactions + Observations
The following graphic illustrates
what I mean:

7 KEY DRIVERS BEHIND THE BIG DATA MARKET?


Business

1. Opportunity to enable innovative new business models


2. Potential for new insights that drive competitive advantage
Technical

1. Data collected and stored continues to grow exponentially


2. Data is increasingly everywhere and in many formats
3. Traditional solutions are failing under new requirements
Financial

1. Cost of data systems, as a percentage of IT spend, continues to grow


2. Cost advantages of commodity hardware & open source software

Q) What are Benfits or Advantages of Bigdata ?


Benfits or Advantages of Big Data

Following are the benefits or advantages of Big Data:

➨Big data analysis derives innovative solutions. Big data analysis helps in understanding and
targeting customers. It helps in optimizing business processes. 
➨It helps in improving science and research. 
➨It improves healthcare and public health with availability of record of patients. 
➨It helps in financial tradings, sports, polling, security/law enforcement etc. 
➨Any one can access vast information via surveys and deliver anaswer of any query.

Drawbacks or disadvantages of Big Data


Following are the drawbacks or disadvantages of Big Data:
➨Traditional storage can cost lot of money to store big data. 
➨Lots of big data is unstructured. 
➨Big data analysis violates principles of privacy. 
➨It can be used for manipulation of customer records. 
➨It may increase social stratification. 
➨Big data analysis is not useful in short run. It needs to be analyzed for longer duration to
leverage its benefits. 

You might also like