Professional Documents
Culture Documents
Hbase: Q) What Is Hbase ?
Hbase: Q) What Is Hbase ?
Q) What is HBase ?
HBase can store massive amounts of data from terabytes to petabytes. The tables present in
HBase consists of billions of rows having millions of columns.
HBase is a column-oriented database and can manage structured and un-structured data. It
supports NoSQL tool for access huge amount of data from non-relational data model.
If this table is stored in a row-oriented database. It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo,
2, Vin Diesel, Brazil, 520, Mustang
In row-oriented databases data is stored on the basis of rows or tuples as you can see
above.
While the column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
In a column-oriented databases, all the column values are stored together like first column
values will be stored together, then the second column values will be stored together and
data in other columns are stored in a similar manner.
When the amount of data is very huge, like in terms of petabytes or exabytes, we
use column-oriented approach, because the data of a single column is stored
together and can be accessed faster.
While row-oriented approach comparatively handles less number of rows and
columns efficiently, as row-oriented database stores data is a structured format.
When we need to process and analyze a large set of semi-structured or unstructured
data, we use column oriented approach. Such as applications dealing with Online
Analytical Processing like data mining, data warehousing, applications including
analytics, etc.
Whereas, Online Transactional Processing such as banking and finance domains
which handle structured data and require transactional properties (ACID properties)
use row-oriented approach.
Relational Databases vs. HBase
When talking of data stores, we first think of Relational Databases with structured data
storage and a sophisticated query engine. However, a Relational Database incurs a big
penalty to improve performance as the data size increases. HBase, on the other hand, is
designed from the ground up to provide scalability and partitioning to enable efficient
data structure serialization, storage and retrieval. Broadly, the differences between a
Relational Database and HBase are:
HDFS is a distributed file system that is well suited for storing large files. It’s designed to
support batch processing of data but doesn’t provide fast individual record lookups. HBase
is built on top of HDFS and is designed to provide access to single rows of data in large
tables. Overall, the differences between HDFS and HBase are
Q) Explain HBase Architecture and its components
HBase has three major components i.e., HMaster Server, HBase Region Server,
Regions and Zookeeper.
HBase architecture has a single HBase master node (HMaster) and several slaves i.e. region
servers. Each region server (slave) serves a set of regions, and a region can be served only by
a single region server. Whenever a client sends a write request, HMaster receives the
request and forwards it to the corresponding region server.
The below figure explains the hierarchy of the HBase Architecture. We will talk about each
one of them individually.
Region
A region contains all the rows between the start key and the end key assigned to that
region. HBase tables can be divided into a number of regions in such a way that all the
columns of a column family is stored in one region. Each region contains the rows in a
sorted order.
Many regions are assigned to a Region Server, which is responsible for handling, managing,
executing reads and writes operations on that set of regions.
A table can be divided into a number of regions. A Region is a sorted range of rows
storing data between a start key and an end key.
A Region has a default size of 256MB which can be configured according to the need.
A Group of regions is served to the clients by a Region Server.
A Region Server can serve approximately 1000 regions to the client.
Now starting from the top of the hierarchy, I would first like to explain you about HMaster
Server which acts similarly as a NameNode in HDFS.
Region Server runs on HDFS DataNode and consists of the following components –
Block Cache – This is the read cache. Most frequently read data is stored in the read
cache and whenever the block cache is full, recently used data is evicted.
MemStore- This is the write cache and stores new data that is not yet written to the
disk. Every column family in a region has a MemStore.
Write Ahead Log (WAL) is a file that stores new data that is not persisted to
permanent storage.
HFile is the actual storage file that stores the rows as sorted key values on a disk.
HMaster
As in the below image, you can see the HMaster handles a collection of Region Server which
resides on DataNode. Let us understand how HMaster does that.
HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
It coordinates and manages the Region Server (similar as NameNode manages
DataNode in HDFS).
It assigns regions to the Region Servers on startup and re-assigns regions to Region
Servers during recovery and load balancing.
ZooKeeper
This below image explains the ZooKeeper’s coordination mechanism.
The HBase schema design is very different compared to the relation database schema
design. Below are some of general concept that should be followed while designing schema
in Hbase:
Row key: Each table in HBase table is indexed on row key. Data is sorted
lexicographically by this row key. There are no secondary indices available on HBase
table.
Automaticity: Avoid designing table that requires atomacity across all rows. All
operations on HBase rows are atomic at row level.
Even distribution: Read and write should uniformly distributed across all nodes
available in cluster. Design row key in such a way that, related entities should be
stored in adjacent rows to increase read efficacy.
HBase Schema Row key, Column family, Column qualifier, individual and Row value Size
Limit
Consider below is the size limit when designing schema in Hbase:
Row keys: 4 KB per key
Column families: not more than 10 column families per table
Column qualifiers: 16 KB per qualifier
Individual values: less than 10 MB per cell
All values in a single row: max 10 MB
HBase Schema Row key, Column family, Column qualifier, individual and Row value Size
Limit
Consider below is the size limit when designing schema in Hbase:
Row keys: 4 KB per key
Column families: not more than 10 column families per table
Column qualifiers: 16 KB per qualifier
Individual values: less than 10 MB per cell
All values in a single row: max 10 MB
Hashing
When you have the data which is represented by the string identifier, then that is good
choice for your Hbase table row key. Use hash of that string identifier as a row key instead
of raw string. For example, if you are storing user data that is identified by user ID’s then
hash of user ID is better choice for your row key.
Timestamps
When you retrieve data based on time when it was stored, it is best to include the
timestamp in your row key. For example, you are trying to store the machine log identified
by machine number then append the timestamp to the machine number when designing
row key, machine001#1435310751234.
The ZooKeeper framework was originally built at Yahoo! for easier accessing of applications
but, later on, ZooKeeper was used for organizing services used by distributed frameworks
like Hadoop, HBase, etc., and Apache ZooKeeper became a standard. It was designed to be a
vigorous service that enabled application developers to focus mainly on their application
logic rather than coordination.
In a distributed environment, coordinating and managing a service has become a difficult
process. Apache ZooKeeper was used to solve this problem because of its simple
architecture, as well as API, that allows developers to implement common coordination
tasks like electing a master server, managing group membership, and managing metadata.
Apache ZooKeeper is used for maintaining centralized configuration information, naming,
providing distributed synchronization, and providing group services in a simple interface so
that we don’t have to write it from scratch. Apache Kafka also uses ZooKeeper to manage
configuration. ZooKeeper allows developers to focus on the core application logic, and it
implements various protocols on the cluster so that the applications need not implement
them on their own.
ZooKeeper Architecture
Apache ZooKeeper works on the Client–Server architecture in which clients are machine
nodes and servers are nodes.
The following figure shows the relationship between the servers and their clients. In this, we
can see that each client sources the client library, and further they communicate with any of
the ZooKeeper nodes.
Components of the ZooKeeper architecture has been explained in the following table.
Part Description
Client node in our distributed applications cluster is used to access information from the
server. It sends a message to the server to let the server know that the client is alive, and
Client
if there is no response from the connected server the client automatically resends the
message to another server.
The server gives an acknowledgement to the client to inform that the server is alive, and
Server
it provides all services to clients.
Leader If any of the server nodes is failed, this server node performs automatic recovery.
Follower It is a server node which follows the instructions given by the leader.
Working of Apache ZooKeeper
The first thing that happens as soon as the ensemble (a group of ZooKeeper servers)
starts is, it waits for the clients to connect to the servers.
After that, the clients in the ZooKeeper ensemble will connect to one of the nodes.
That node can be any of a leader node or a follower node.
Once the client is connected to a particular node, the node assigns a session ID to
the client and sends an acknowledgement to that particular client.
If the client does not get any acknowledgement from the node, then it resends the
message to another node in the ZooKeeper ensemble and tries to connect with it.
On receiving the acknowledgement, the client makes sure that the connection is not
lost by sending the heartbeats to the node at regular intervals.
Finally, the client can perform functions like read, write, or store the data as per the
need.
Apache ZooKeeper is capable of updating every node that allows it to store updated
information about each node across the cluster.
Managing the Cluster: This technology can manage the cluster in such a way that the
status of each node is maintained in real time, leaving lesser chances for errors and
ambiguity.
Naming Service: ZooKeeper attaches a unique identification to every node which is
quite similar to the DNA that helps identify it.
Automatic Failure Recovery: Apache ZooKeeper locks the data while modifying
which helps the cluster recover it automatically if a failure occurs in the database.
One of the ways in which we can communicate with the ZooKeeper ensemble is by using the
ZooKeeper Command Line Interface (CLI). This gives us the feature of using various options, and
also for the sake of debugging there is increased dependence on the CLI.
Applications of Zookeeper
a. Apache Solr
For leader election and centralized configuration, Apache Solr uses Zookeeper.
b. Apache Mesos
A tool which offers efficient resource isolation and sharing across distributed applications, as
a cluster manager is what we call Apache Mesos. Hence for the fault-tolerant replicated
master, Mesos uses ZooKeeper.
c. Yahoo!
As we all know ZooKeeper was originally built at “Yahoo!”, so for several requirements like
robustness, data transparency, centralized configuration, better performance, as well as
coordination, they designed Zookeeper.
d. Apache Hadoop
As we know behind the growth of Big Data industry, Apache Hadoop is the driving force. So,
for configuration management and coordination, Hadoop relies on ZooKeeper.
Multiple ZooKeeper servers endure large Hadoop clusters, that’s why in order to retrieve
and update synchronization information, each client machine communicates with one of the
ZooKeeper servers. Like: Human Genome Project, as there are terabytes of data in Human
Genome Project, So, In order to analyze the dataset and find interesting facts for human
development, it usesHadoop MapReduce framework.
e. Apache HBase
One of them is the Telecom industry. As it stores billions of mobile call records and further
access them in real time. Hence we can say it uses HBase to process all the records in real
time, easily and efficiently.
ii. Social network
Like Twitter, LinkedIn, and Facebook receives huge volumes of data on daily basis so to find
recent trends and other interesting facts it also uses HBase.
f. Apache Accumulo
Moreover, on top of Apache ZooKeeper (and Apache Hadoop), another sorted distributed
key/value store “Apache Accumulo” is built.
g. Neo4j
For write master selection and read slave coordination, No4j is a distributed graph
database which uses ZooKeeper.
h. Cloudera
Big data analytics has proven to be very useful in the government sector. Big data analysis
played a large role in Barack Obama’s successful 2012 re-election campaign. Also most
recently, Big data analysis was majorly responsible for the BJP and its allies to win a highly
successful Indian General Election 2014. The Indian Government utilizes numerous
techniques to ascertain how the Indian electorate is responding to government action, as
well as ideas for policy augmentation.
The advent of social media has led to an outburst of big data. Various solutions have been
built in order to analyze social media activity like IBM’s Cognos Consumer Insights, a point
solution running on IBM’s BigInsights Big Data platform, can make sense of the chatter.
Social media can provide valuable real-time insights into how the market is responding to
products and campaigns. With the help of these insights, the companies can adjust their
pricing, promotion, and campaign placements accordingly. Before utilizing the big data
there needs to be some preprocessing to be done on the big data in order to derive some
intelligent and valuable results. Thus to know the consumer mindset the application of
intelligent decisions derived from big data is necessary.
TECHNOLOGY
The technological applications of big data comprise of the following companies which deal
with huge amounts of data every day and put them to use for business decisions as well. For
example, eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB
Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay‟s
90PB data warehouse. Amazon.com handles millions of back-end operations every day, as
well as queries from more than half a million third-party sellers. The core technology that
keeps Amazon running is Linux-based and as of 2005, they had the world’s three largest
Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Facebook handles 50 billion
photos from its user base. Windermere Real Estate uses anonymous GPS signals from nearly
100 million drivers to help new home buyers determine their typical drive times to and from
work throughout various times of the day.
Now we turn to the customer-facing Big Data application examples, of which call center
analytics are particularly powerful. What’s going on in a customer’s call center is often a
great barometer and influencer of market sentiment, but without a Big Data solution, much
of the insight that a call center can provide will be overlooked or discovered too late. Big
Data solutions can help identify recurring problems or customer and staff behavior patterns
on the fly not only by making sense of time/quality resolution metrics but also by capturing
and processing call content itself.
BANKING
The use of customer data invariably raises privacy issues. By uncovering hidden connections
between seemingly unrelated pieces of data, big data analytics could potentially reveal
sensitive personal information. Research indicates that 62% of bankers are cautious in their
use of big data due to privacy issues. Further, outsourcing of data analysis activities or
distribution of customer data across departments for the generation of richer insights also
amplifies security risks. Such as customers’ earnings, savings, mortgages, and insurance
policies ended up in the wrong hands. Such incidents reinforce concerns about data privacy
and discourage customers from sharing personal information in exchange for customized
offers.
AGRICULTURE
A biotechnology firm uses sensor data to optimize crop efficiency. It plants test crops and
runs simulations to measure how plants react to various changes in condition. Its data
environment constantly adjusts to changes in the attributes of various data it collects,
including temperature, water levels, soil composition, growth, output, and gene sequencing
of each plant in the test bed. These simulations allow it to discover the optimal
environmental conditions for specific gene types.
MARKETING
Marketers have begun to use facial recognition software to learn how well their advertising
succeeds or fails at stimulating interest in their products. A recent study published in the
Harvard Business Review looked at what kinds of advertisements compelled viewers to
continue watching and what turned viewers off. Among their tools was “a system that
analyses facial expressions to reveal what viewers are feeling.” The research was designed
to discover what kinds of promotions induced watchers to share the ads with their social
network, helping marketers create ads most likely to “go viral” and improve sales.
SMART PHONES
Perhaps more impressive, people now carry facial recognition technology in their pockets.
Users of I Phone and Android smartphones have applications at their fingertips that use
facial recognition technology for various tasks. For example, Android users with the
remember app can snap a photo of someone, then bring up stored information about that
person based on their image when their own memory lets them down a potential boon for
salespeople.
TELECOM
Now a day’s big data is used in different fields. In telecom also it plays a very good role.
Operators face an uphill challenge when they need to deliver new, compelling, revenue-
generating services without overloading their networks and keeping their running costs
under control. The market demands new set of data management and analysis capabilities
that can help service providers make accurate decisions by taking into account customer,
network context and other critical aspects of their businesses. Most of these decisions must
be made in real time, placing additional pressure on the operators. Real-time predictive
analytics can help leverage the data that resides in their multitude systems, make it
immediately accessible and help correlate that data to generate insight that can help them
drive their business forward.
HEALTHCARE
Traditionally, the healthcare industry has lagged behind other industries in the use of big
data, part of the problem stems from resistance to change providers are accustomed to
making treatment decisions independently, using their own clinical judgment, rather than
relying on protocols based on big data. Other obstacles are more structural in nature. This is
one of the best place to set an example for Big Data Application.Even within a single
hospital, payor, or pharmaceutical company, important information often remains siloed
within one group or department because organizations lack procedures for integrating data
and communicating findings.
Health care stakeholders now have access to promising new threads of knowledge. This
information is a form of “big data,” so called not only for its sheer volume but for its
complexity, diversity, and timelines. Pharmaceutical industry experts, payers, and providers
are now beginning to analyze big data to obtain insights. Recent technologic advances in the
industry have improved their ability to work with such data, even though the files are
enormous and often have different database structures and technical characteristics.
➨Big data analysis derives innovative solutions. Big data analysis helps in understanding and
targeting customers. It helps in optimizing business processes.
➨It helps in improving science and research.
➨It improves healthcare and public health with availability of record of patients.
➨It helps in financial tradings, sports, polling, security/law enforcement etc.
➨Any one can access vast information via surveys and deliver anaswer of any query.