Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 32

Hbase

Understanding MapReduce
Unit-2
P-2
HBase Data Model
• HBase Data Model is a set of components that consists of Tables, Rows,
Column families, Cells, Columns, and Versions. HBase tables contain
column families and rows with elements defined as Primary keys. A
column in HBase data model table represents attributes to the objects.
• HBase Data Model consists of following elements,
1. Set of tables
2. Each table with column families and rows
3. Each table must have an element defined as Primary Key.
4. Row key acts as a Primary key in HBase.
5. Any access to HBase tables uses this Primary Key
6. Each column present in HBase denotes attribute corresponding to object
Storage Mechanism in HBase
• HBase is a column-oriented database and data is
stored in tables.
• The tables are sorted by RowId.
• The column families that are present in the schema
are key-value pairs.
• If we observe in detail each column family having
multiple numbers of columns.
• The column values stored into disk memory.
• Each cell of the table has its own Metadata like
timestamp and other information.
• Coming to HBase the following are the key terms
representing table schema
• Table: Collection of rows present.
• Row: Collection of column families.
• Column Family: Collection of columns.
• Column: Collection of key-value pairs.
• Namespace: Logical grouping of tables.
• Cell: A {row, column, version} tuple exactly
specifies a cell definition in HBase.
HBase Read and Write Data
• Step 1) Client wants to write data and in turn first
communicates with Regions server and then regions
• Step 2) Regions contacting memstore for storing associated
with the column family
• Step 3) First data stores into Memstore, where the data is
sorted and after that, it flushes into HFile. The main reason for
using Memstore is to store data in a Distributed file system
based on Row Key. Memstore will be placed in Region server
main memory while HFiles are written into HDFS.
• Step 4) Client wants to read data from Regions
• Step 5) In turn Client can have direct access to Mem store, and
it can request for data.
• Step 6) Client approaches HFiles to get the data. The data are
fetched and retrieved by the Client.
REST and Thrift
• In Apache HBase, both REST and Thrift are interfaces that
allow external applications to interact with the HBase cluster
programmatically.
• REST (Representational State Transfer):
– RESTful APIs in HBase provide a simple HTTP-based interface for
accessing and manipulating HBase data.
– The REST API allows users to perform CRUD (Create, Read,
Update, Delete) operations on HBase tables, rows, and cells using
HTTP methods like GET, PUT, POST, and DELETE.
– RESTful endpoints in HBase typically follow a URI pattern that
represents HBase resources such as tables, rows, and cells. For
example, /table/{table_name}/row/{row_key} can be used to
access a specific row in a table.
– The REST interface is suitable for web applications, mobile
applications, and other clients that can communicate over HTTP.
• Thrift:
• Thrift is a framework and set of tools for building
cross-language RPC (Remote Procedure Call) services.
• Thrift APIs in HBase offer a more efficient and
lightweight alternative to the RESTful interface,
especially for high-throughput and low-latency use
cases.
• Thrift APIs define a set of service methods for
performing operations such as scanning, getting,
putting, deleting, and batch processing on HBase
tables.
• Enables developers to use HBase APIs in languages like
Java, Python, Ruby, C++, etc.
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source
and a destination.
• It has easy java API for client.
• It provides data replication across clusters.
Other tools
• Apache Hive:
– Apache Hive is a data warehouse infrastructure built on top
of Hadoop. It provides a SQL-like interface (HiveQL) for
querying and managing large datasets stored in Hadoop's
HDFS. Hive translates HiveQL queries into MapReduce or
Tez jobs for execution.
• Apache Pig:
– Apache Pig is a high-level platform for creating
MapReduce programs with a simple scripting language
called Pig Latin. It abstracts the complexities of
MapReduce programming and allows for data
manipulation and analysis tasks to be expressed concisely.
• Apache Flume:
Flume is a distributed, reliable, and available system for
efficiently collecting, aggregating, and moving large
amounts of log data from various sources to centralized
data storage like Hadoop's HDFS. It is part of the Apache
Software Foundation's ecosystem and is often used for
ingesting streaming data into Hadoop for further
processing and analysis.
• Apache ZooKeeper:
Apache ZooKeeper is a centralized service for
maintaining configuration information, naming, and
providing distributed synchronization and group services.
It is used by Hadoop and other distributed systems for
coordination and consensus.
• Apache Sqoop:
– Apache Sqoop is a tool designed for efficiently
transferring bulk data between Hadoop and
structured data stores such as relational
databases. It supports parallel data transfer and
integrates with Hadoop ecosystem components.
• Apache Oozie:
– Apache Oozie is a workflow scheduler system for
managing Hadoop jobs. It allows users to define,
schedule, and execute workflows composed of
Hadoop jobs, Pig scripts, Hive queries, and other
actions.
MapReduce
• MapReduce is the heart of Apache Hadoop.
• The term "MapReduce" refers to two separate and
distinct tasks that Hadoop programs perform.
• The first is the map job, which takes a set of data
and converts it into another set of data, where
individual elements are broken down into tuples
(key/value pairs).
• The reduce job takes the output from a map as
input and combines those data tuples into a smaller
set of tuples.
Working of MapReduce approach
Logical flow of data in MapReduce
Map function
Reducer function
Data analysis in MapReduce model
Word count example
Hbase installation

You might also like