Professional Documents
Culture Documents
Zookeeper HBase SPARK
Zookeeper HBase SPARK
Distributed Application
Benefits of ZooKeeper
Part Description
Hierarchical Namespace
Sessions
Watches
--------------------------------------------------------------------------------------
HBase
What is HBase?
One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using
HBase. HBase sits on top of the Hadoop File System and provides read
and write access.
HDFS HBase
HDFS is a distributed
HBase is a database built on top of
file system suitable for
the HDFS.
storing large files.
Row key: Each table in HBase table is indexed on row key. Data is
sorted lexicographically by this row key. There are no secondary
indices available on HBase table.
Automaticity: Avoid designing table that requires atomacity across
all rows. All operations on HBase rows are atomic at row level.
Even distribution: Read and write should uniformly distributed
across all nodes available in cluster. Design row key in such a way
that, related entities should be stored in adjacent rows to increase
read efficacy.
HBase Schema Row key, Column family, Column qualifier, individual
and Row value Size Limit
When choosing row key for HBase tables, you should design table in
such a way that there should not be any hotspotting. To get best
performance out of HBase cluster, you should design a row key that
would allow system to write evenly across all the nodes.
Poorly designed row key can cause the full table scan when you
request some data out of it.
If you are storing data that is represented by the domain names then
consider using reverse domain name as a row keys for your HBase
Tables. For example, com.company.name.
This technique works perfectly fine when you have data spread across
multiple reverse domains. If you have very few reverse domain then
you may end up storing data on single node causing hotspotting.
Hashing
When you have the data which is represented by the string identifier,
then that is good choice for your Hbase table row key. Use hash of
that string identifier as a row key instead of raw string. For example, if
you are storing user data that is identified by user ID’s then hash of
user ID is better choice for your row key.
Timestamps
When you retrieve data based on time when it was stored, it is best to
include the timestamp in your row key. For example, you are trying to
store the machine log identified by machine number then append the
timestamp to the machine number when designing row
key, machine001#1435310751234.
Combines Row Key
You can combine multiple key to design row key for your HBase table
based on your requirements.
Column Families
Column Qualifiers
You can create as many column qualifiers as you need in each row.
The empty cells in the row does not consume any space. The names of
your column qualifiers should be short, since they are included in the
data that is transferred for each request.
Column-Oriented
Row-Oriented Database
Database
HBase RDBMS
Features of HBase
Applications of HBase
Year Event
Nov
Google released the paper on BigTable.
2006
Jan
HBase became the sub project of Hadoop.
2008
Oct
HBase 0.18.1 was released.
2008
Jan
HBase 0.19.0 was released.
2009
Sept
HBase 0.20.0 was released.
2009
Server, Zookeeper.
HMaster –
The implementation of Master Server in HBase is HMaster. It is a
process in which regions are assigned to region server as well as DDL
(create, delete table) operations. It monitor all Region Server instances
present in the cluster. In a distributed environment, Master runs
several background threads. HMaster has many features like
controlling load balancing, failover etc.
Region Server
Zookeeper
–
It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed
synchronization, server failure notification etc. Clients communicate
with region servers via zookeeper.
--------------------------------------------------------------------------------------
-
SPARK
SPARK ARCHITECTURE
Spark Core
o The Spark Core is the heart of Spark and performs the core
functionality.
o It holds the components for task scheduling, fault recovery,
interacting with storage systems and memory management.
Spark SQL
Spark Streaming
MLlib
GraphX
What is RDD?
Parallelized Collections
External Datasets
o Transformation
o Action
Transformation
Transformation Description
Action
NoSQL