Professional Documents
Culture Documents
Hadoop and HBase
Hadoop and HBase
Hadoop and HBase
Introduction
Hadoop and HBase
• Hadoop
l
What is map-reduce?
l
What is Hadoop?
l
What is HDFS?
• HBase
l
What are noSQL databases?
l
What is Hbase?
Map Reduce
• Programming model developed at Google
• Sort/merge based distributed computing
• Initially, it was intended for their internal search/indexing application, but
now used extensively by more organizations (e.g., Yahoo, Amazon.com,
IBM, etc.)
• It is functional style programming (e.g., LISP) that is naturally parallelizable
across a large cluster of workstations or PCS.
• The underlying system takes care of the partitioning of the input data,
scheduling the program’s execution across several machines, handling
machine failures, and managing required inter-machine communication.
(This is the key for Hadoop’s success)
Word Count Problem
Can we do word count in parallel?
Voldemort
l
Scalaris
l
Redis
l
l
Schema-less which comes in multiple flavors, column-based,
Cassandra (column-based)
l
CouchDB (document-based)
l
MongoDB(document-based)
l
Neo4J (graph-based)
l
HBase (column-based)
l
HBase
HBase is an open-source, distributed, colu
HBase Benefits
l
No real indexes
l
Automatic partitioning
l
Scale linearly and automatically with new nodes
l
Commodity hardware
l
Fault tolerance
l
Batch processing
Hbase Data Model
(Column Oriented)
l
Tables are sorted by Row
l
Table schema only define it’s column families .
l
Each family consists of any number of columns
l
Each column consists of any number of versions
l
Columns only exist when inserted, NULLs are free.
l
Columns within a family are sorted and stored together
l
Everything except table names are byte[]
l
(Row, Family: Column, Timestamp) Value
HBase Data Model
(Column Oriented)
Regions
l
Regions are the basic element of availability and distribution for
Hbase Members
l
Master
l
Responsible for monitoring region servers
l
Load balancing for regions
l
Redirect client to correct region servers
l
The current SPOF (single point of failure)
l
regionserver slaves
l
Serving requests(Write/Read/Scan) of Client
l
Send HeartBeat to Master
l
Throughput and Region numbers are scalable by region servers
HBase Architecture
ZooKeeper
A highly available, scalable, distributed,
ZooKeeper
• Znode
– In-memory data node in the Zookeeper data
– Have a hierarchical namespace
– UNIX like notation for path
• Types of Znode
– Regular
– Ephemeral
ZooKeeper
ZooKeeper Service
Leader
Server
Server
Server
Server
Server
Client
Client
Client
Client
Client
Client
Client
All servers store a copy of the data (in memory)
A leader is elected at startup
Followers service clients, all updates go through leader
Update responses are sent when a majority of servers have persisted the chang
Hadoop/Hbase Deployment
Master
node
Slave
nodes
Questions