Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Zookeeper

Apache Zookeeper is an open source distributed coordination service


that helps to manage a large set of hosts. Management and
coordination in a distributed environment is tricky. Zookeeper
automates this process and allows developers to focus on building
software features rather than worry about it’s distributed nature.

Distributed Application

A distributed application can run on multiple systems in a network at


a given time (simultaneously) by coordinating among themselves to
complete a particular task in a fast and efficient manner. Normally,
complex and time-consuming tasks, which will take hours to complete
by a non-distributed application (running in a single system) can be
done in minutes by a distributed application by using computing
capabilities of all the system involved.

The time to complete the task can be further reduced by configuring


the distributed application to run on more systems. A group of
systems in which a distributed application is running is called
a Cluster and each machine running in a cluster is called a Node.

A distributed application has two


parts, Server and Client application. Server applications are actually
distributed and have a common interface so that clients can connect
to any server in the cluster and get the same result. Client
applications are the tools to interact with a distributed application.
Benefits of Distributed Applications

 Reliability − Failure of a single or a few systems does not make


the whole system to fail.
 Scalability − Performance can be increased as and when needed
by adding more machines with minor change in the
configuration of the application with no downtime.
 Transparency − Hides the complexity of the system and shows
itself as a single entity / application.

Challenges of Distributed Applications

 Race condition − Two or more machines trying to perform a


particular task, which actually needs to be done only by a single
machine at any given time. For example, shared resources
should only be modified by a single machine at any given time.
 Deadlock − Two or more operations waiting for each other to
complete indefinitely.
 Inconsistency − Partial failure of data.

What is Apache ZooKeeper Meant For?

Apache ZooKeeper is a service used by a cluster (group of nodes) to


coordinate between themselves and maintain shared data with robust
synchronization techniques. ZooKeeper is itself a distributed
application providing services for writing a distributed application.

The common services provided by ZooKeeper are as follows −

 Naming service − Identifying the nodes in a cluster by name. It


is similar to DNS, but for nodes.
 Configuration management − Latest and up-to-date
configuration information of the system for a joining node.
 Cluster management − Joining / leaving of a node in a cluster
and node status at real time.
 Leader election − Electing a node as leader for coordination
purpose.
 Locking and synchronization service − Locking the data while
modifying it. This mechanism helps you in automatic fail
recovery while connecting other distributed applications like
Apache HBase.
 Highly reliable data registry − Availability of data even when
one or a few nodes are down.

Distributed applications offer a lot of benefits, but they throw a few


complex and hard-to-crack challenges as well. ZooKeeper framework
provides a complete mechanism to overcome all the challenges. Race
condition and deadlock are handled using fail-safe synchronization
approach. Another main drawback is inconsistency of data, which
ZooKeeper resolves with atomicity.

Benefits of ZooKeeper

Here are the benefits of using ZooKeeper −

 Simple distributed coordination process


 Synchronization − Mutual exclusion and co-operation between
server processes. This process helps in Apache HBase for
configuration management.
 Ordered Messages
 Serialization − Encode the data according to specific rules.
Ensure your application runs consistently. This approach can be
used in MapReduce to coordinate queue to execute running
threads.
 Reliability
 Atomicity − Data transfer either succeed or fail completely, but
no transaction is partial.
Architecture of ZooKeeper

Take a look at the following diagram. It depicts the “Client-Server


Architecture” of ZooKeeper.

Each one of the components that is a part of the ZooKeeper


architecture has been explained in the following table.

Part Description

Clients, one of the nodes in our distributed application


cluster, access information from the server. For a
particular time interval, every client sends a message
to the server to let the sever know that the client is
Client alive.
Similarly, the server sends an acknowledgement when
a client connects. If there is no response from the
connected server, the client automatically redirects the
message to another server.

Server Server, one of the nodes in our ZooKeeper ensemble,


provides all the services to clients. Gives
acknowledgement to client to inform that the server is
alive.

Group of ZooKeeper servers. The minimum number of


Ensemble
nodes that is required to form an ensemble is 3.

Server node which performs automatic recovery if any


Leader of the connected node failed. Leaders are elected on
service startup.

Follower Server node which follows leader instruction.

Zookeeper Data Model

Hierarchical Namespace

The following diagram depicts the tree structure of ZooKeeper file


system used for memory representation. ZooKeeper node is referred
as znode. Every znode is identified by a name and separated by a
sequence of path (/).

 In the diagram, first you have a root znode separated by “/”.


Under root, you have two logical nmespaces config and workers.

 The config namespace is used for centralized configuration
management and the workers namespace is used for naming.
 Under config namespace, each znode can store upto 1MB of
data. This is similar to UNIX file system except that the parent
znode can store data as well. The main purpose of this structure
is to store synchronized data and describe the metadata of the
znode. This structure is called as ZooKeeper Data Model.
Every znode in the ZooKeeper data model maintains a stat structure.
A stat simply provides the metadata of a znode. It consists of Version
number, Action control list (ACL), Timestamp, and Data length.

 Version number − Every znode has a version number, which


means every time the data associated with the znode changes,
its corresponding version number would also increased. The use
of version number is important when multiple zookeeper clients
are trying to perform operations over the same znode.
 Action Control List (ACL) − ACL is basically an authentication
mechanism for accessing the znode. It governs all the znode read
and write operations.
 Timestamp − Timestamp represents time elapsed from znode
creation and modification. It is usually represented in
milliseconds. ZooKeeper identifies every change to the znodes
from “Transaction ID” (zxid). Zxid is unique and maintains time
for each transaction so that you can easily identify the time
elapsed from one request to another request.
 Data length − Total amount of the data stored in a znode is the
data length. You can store a maximum of 1MB of data.
Types of Znodes

Znodes are categorized as persistence, sequential, and ephemeral.

 Persistence znode − Persistence znode is alive even after the


client, which created that particular znode, is disconnected. By
default, all znodes are persistent unless otherwise specified.
 Ephemeral znode − Ephemeral znodes are active until the client
is alive. When a client gets disconnected from the ZooKeeper
ensemble, then the ephemeral znodes get deleted automatically.
For this reason, only ephemeral znodes are not allowed to have a
children further. If an ephemeral znode is deleted, then the next
suitable node will fill its position. Ephemeral znodes play an
important role in Leader election.
 Sequential znode − Sequential znodes can be either persistent
or ephemeral. When a new znode is created as a sequential
znode, then ZooKeeper sets the path of the znode by attaching a
10 digit sequence number to the original name. For example, if a
znode with path /myapp is created as a sequential znode,
ZooKeeper will change the path to /myapp0000000001 and set
the next sequence number as 0000000002. If two sequential
znodes are created concurrently, then ZooKeeper never uses the
same number for each znode. Sequential znodes play an
important role in Locking and Synchronization.

Sessions

Sessions are very important for the operation of ZooKeeper. Requests


in a session are executed in FIFO order. Once a client connects to a
server, the session will be established and a session id is assigned to
the client.

The client sends heartbeats at a particular time interval to keep the


session valid. If the ZooKeeper ensemble does not receive heartbeats
from a client for more than the period (session timeout) specified at
the starting of the service, it decides that the client died.

Session timeouts are usually represented in milliseconds. When a


session ends for any reason, the ephemeral znodes created during that
session also get deleted.

Watches

Watches are a simple mechanism for the client to get notifications


about the changes in the ZooKeeper ensemble. Clients can set watches
while reading a particular znode. Watches send a notification to the
registered client for any of the znode (on which client registers)
changes.

Znode changes are modification of data associated with the znode or


changes in the znode’s children. Watches are triggered only once. If a
client wants a notification again, it must be done through another
read operation. When a connection session is expired, the client will
be disconnected from the server and the associated watches are also
removed.

--------------------------------------------------------------------------------------

HBase

What is HBase?

HBase is a distributed column-oriented database built on top of the


Hadoop file system. It is an open-source project and is horizontally
scalable.

HBase is a data model that is similar to Google’s big table designed to


provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System
(HDFS).

It is a part of the Hadoop ecosystem that provides random real-time


read/write access to data in the Hadoop File System.

One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using
HBase. HBase sits on top of the Hadoop File System and provides read
and write access.

HBase and HDFS

HDFS HBase

HDFS is a distributed
HBase is a database built on top of
file system suitable for
the HDFS.
storing large files.

HDFS does not support


HBase provides fast lookups for
fast individual record
larger tables.
lookups.
It provides high latency
It provides low latency access to
batch processing; no
single rows from billions of records
concept of batch
(Random access).
processing.

HBase internally uses Hash tables


It provides only
and provides random access, and it
sequential access of
stores the data in indexed HDFS
data.
files for faster lookups.

Storage Mechanism in HBase/HBase schema Design

HBase is a column-oriented database and the tables in it are sorted


by row. The table schema defines only column families, which are the
key value pairs. A table have multiple column families and each
column family can have any number of columns. Subsequent column
values are stored contiguously on the disk. Each cell value of the table
has a timestamp. In short, in an HBase:

 Table is a collection of rows.


 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.

HBase Table Schema Design General Concepts


The HBase schema design is very different compared to the relation
database schema design. Below are some of general concept that
should be followed while designing schema in Hbase:

 Row key: Each table in HBase table is indexed on row key. Data is
sorted lexicographically by this row key. There are no secondary
indices available on HBase table.
 Automaticity: Avoid designing table that requires atomacity across
all rows. All operations on HBase rows are atomic at row level.
 Even distribution: Read and write should uniformly distributed
across all nodes available in cluster. Design row key in such a way
that, related entities should be stored in adjacent rows to increase
read efficacy.
HBase Schema Row key, Column family, Column qualifier, individual
and Row value Size Limit

Consider below is the size limit when designing schema in Hbase:

 Row keys: 4 KB per key


 Column families: not more than 10 column families per table
 Column qualifiers: 16 KB per qualifier
 Individual values: less than 10 MB per cell
 All values in a single row: max 10 MB

HBase Row Key Design

When choosing row key for HBase tables, you should design table in
such a way that there should not be any hotspotting. To get best
performance out of HBase cluster, you should design a row key that
would allow system to write evenly across all the nodes.

Poorly designed row key can cause the full table scan when you
request some data out of it.

Type of HBase Row Keys

There are some commonly used HBase row keys:

Reverse Domain Names

If you are storing data that is represented by the domain names then
consider using reverse domain name as a row keys for your HBase
Tables. For example, com.company.name.

This technique works perfectly fine when you have data spread across
multiple reverse domains. If you have very few reverse domain then
you may end up storing data on single node causing hotspotting.
Hashing

When you have the data which is represented by the string identifier,
then that is good choice for your Hbase table row key. Use hash of
that string identifier as a row key instead of raw string. For example, if
you are storing user data that is identified by user ID’s then hash of
user ID is better choice for your row key.

Timestamps
When you retrieve data based on time when it was stored, it is best to
include the timestamp in your row key. For example, you are trying to
store the machine log identified by machine number then append the
timestamp to the machine number when designing row
key, machine001#1435310751234.
Combines Row Key

You can combine multiple key to design row key for your HBase table
based on your requirements.

HBase Column Families and Column Qualifiers

Below are some of guidance on column families and column qualifier:

Column Families

In HBase, you have upto 10 column families to get best performance


out of HBase cluster. If your row contains multiple values that are
related to each other, then you should place then in same family
names. Also, the names of your column families should be short, since
they are included in the data that is transferred for each request.

Column Qualifiers

You can create as many column qualifiers as you need in each row.
The empty cells in the row does not consume any space. The names of
your column qualifiers should be short, since they are included in the
data that is transferred for each request.

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as


sections of columns of data, rather than as rows of data. Shortly, they
will have column families.

Column-Oriented
Row-Oriented Database
Database

It is suitable for Online


It is suitable for Online
Analytical Processing
Transaction Process (OLTP).
(OLAP).
Such databases are designed for Column-oriented databases
small number of rows and are designed for huge
columns. tables.

HBase and RDBMS

HBase RDBMS

HBase is schema-less, it An RDBMS is governed by


doesn't have the concept of its schema, which describes
fixed columns schema; defines the whole structure of
only column families. tables.

It is built for wide tables. It is thin and built for small


HBase is horizontally scalable. tables. Hard to scale.

No transactions are there in


RDBMS is transactional.
HBase.

It has de-normalized data. It will have normalized data.

It is good for semi-structured It is good for structured


as well as structured data. data.

Features of HBase

 HBase is linearly scalable.


 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.

Applications of HBase

 It is used whenever there is a need to write heavy applications.


 HBase is used whenever we need to provide fast random access
to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use
HBase internally.
HBase History

Year Event

Nov
Google released the paper on BigTable.
2006

Feb Initial HBase prototype was created as a Hadoop


2007 contribution.

Oct The first usable HBase along with Hadoop 0.15.0


2007 was released.

Jan
HBase became the sub project of Hadoop.
2008

Oct
HBase 0.18.1 was released.
2008

Jan
HBase 0.19.0 was released.
2009

Sept
HBase 0.20.0 was released.
2009

May HBase became Apache top-level project.


2010

HBase architecture has 3 main components: HMaster, Region

Server, Zookeeper.
HMaster –
The implementation of Master Server in HBase is HMaster. It is a
process in which regions are assigned to region server as well as DDL
(create, delete table) operations. It monitor all Region Server instances
present in the cluster. In a distributed environment, Master runs
several background threads. HMaster has many features like
controlling load balancing, failover etc.

Region Server

HBase Tables are divided horizontally by row key range into


Regions. Regions are the basic building elements of HBase cluster
that consists of the distribution of tables and are comprised of Column
families. Region Server runs on HDFS DataNode which is present in
Hadoop cluster. Regions of Region Server are responsible for several
things, like handling, managing, executing as well as reads and writes
HBase operations on that set of regions. The default size of a region is
256 MB.

Zookeeper


It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed
synchronization, server failure notification etc. Clients communicate
with region servers via zookeeper.

--------------------------------------------------------------------------------------
-

SPARK

Apache Spark is an open-source unified analytics engine for large-


scale data processing. Spark provides an interface for programming
clusters with implicit data parallelism and fault tolerance.
Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC


Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010
under a BSD license. It was donated to Apache software foundation in
2013, and now Apache Spark has become a top level Apache project
from Feb-2014.

Features of Apache Spark

Apache Spark has following features.

 Speed − Spark helps to run an application in Hadoop cluster, up


to 100 times faster in memory, and 10 times faster when
running on disk. This is possible by reducing number of
read/write operations to disk. It stores the intermediate
processing data in memory.
 Supports multiple languages − Spark provides built-in APIs in
Java, Scala, or Python. Therefore, you can write applications in
different languages. Spark comes up with 80 high-level operators
for interactive querying.
 Advanced Analytics − Spark not only supports ‘Map’ and
‘reduce’. It also supports SQL queries, Streaming data, Machine
learning (ML), and Graph algorithms.

SPARK ARCHITECTURE

Spark Core
o The Spark Core is the heart of Spark and performs the core
functionality.
o It holds the components for task scheduling, fault recovery,
interacting with storage systems and memory management.

Spark SQL

o The Spark SQL is built on the top of Spark Core. It provides


support for structured data.
o It allows to query the data via SQL (Structured Query Language)
as well as the Apache Hive variant of SQL?called the HQL (Hive
Query Language).
o It supports JDBC and ODBC connections that establish a
relation between Java objects and existing databases, data
warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet,
and JSON.

Spark Streaming

o Spark Streaming is a Spark component that supports scalable


and fault-tolerant processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform
streaming analytics.
o It accepts data in mini-batches and performs RDD
transformations on that data.
o Its design ensures that the applications written for streaming
data can be reused to analyze batches of historical data with
little modification.
o The log files generated by web servers can be considered as a
real-time example of a data stream.

MLlib

o The MLlib is a Machine Learning library that contains various


machine learning algorithms.
o These include correlations and hypothesis testing, classification
and regression, clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used
by Apache Mahout.

GraphX

o The GraphX is a library that is used to manipulate graphs and


perform graph-parallel computations.
o It facilitates to create a directed graph with arbitrary properties
attached to each vertex and edge.
o To manipulate graph, it supports various fundamental operators
like subgraph, join Vertices, and aggregate Messages.

What is RDD?

The RDD (Resilient Distributed Dataset) is the Spark's core


abstraction. It is a collection of elements, partitioned across the
nodes of the cluster so that we can execute various parallel
operations on it.

There are two ways to create RDDs:

Parallelized Collections

o To create parallelized collection, call SparkContext's parallelize


method on an existing collection in the driver program. Each
element of collection is copied to form a distributed dataset that
can be operated on in parallel.

External Datasets

o In Spark, the distributed datasets can be created from any type


of storage sources supported by Hadoop such as HDFS,
Cassandra, HBase and even our local file system. Spark provides
the support for text files, SequenceFiles, and other types of
Hadoop InputFormat.
o SparkContext's textFile method can be used to create RDD's
text file. This method takes a URI for the file (either a local path
on the machine or a hdfs://) and reads the data of the file.
RDD Operations

The RDD provides the two types of operations

o Transformation
o Action

Transformation

o In Spark, the role of transformation is to create a new dataset


from an existing one. The transformations are considered lazy as
they only computed when an action requires a result to be
returned to the driver program.

some of the frequently used RDD Transformations are

Transformation Description

map(func) It returns a new distributed dataset formed by


passing each element of the source through a
function func.

filter(func) It returns a new dataset formed by selecting


those elements of the source on which func
returns true.

flatMap(func) Here, each input item can be mapped to zero or


more output items, so func should return a
sequence rather than a single item.

mapPartitions(func) It is similar to map, but runs separately on each


partition (block) of the RDD, so func must be of
type Iterator<T> => Iterator<U> when running
on an RDD of type T.

mapPartitionsWithIndex(func) It is similar to mapPartitions that provides func


with an integer value representing the index of
the partition, so func must be of type (Int,
Iterator<T>) => Iterator<U> when running on an
RDD of type T.

sample(withReplacement, It samples the fraction fraction of the data, with


fraction, seed) or without replacement, using a given random
number generator seed.

union(otherDataset) It returns a new dataset that contains the union


of the elements in the source dataset and the
argument.

intersection(otherDataset) It returns a new RDD that contains the


intersection of elements in the source dataset
and the argument.

distinct([numPartitions])) It returns a new dataset that contains the


distinct elements of the source dataset.

groupByKey([numPartitions]) It returns a dataset of (K, Iterable) pairs when


called on a dataset of (K, V) pairs.

Action

Apache Spark Resilient Distributed Dataset(RDD) Action is defined


as the spark operations that return raw values

Some of the actions are


Action Description

reduce(func) It aggregate the elements of the dataset using a


function func (which takes two arguments and
returns one). The function should be
commutative and associative so that it can be
computed correctly in parallel.

collect() It returns all the elements of the dataset as an


array at the driver program. This is usually
useful after a filter or other operation that
returns a sufficiently small subset of the data.

count() It returns the number of elements in the dataset.

first() It returns the first element of the dataset (similar


to take(1)).

take(n) It returns an array with the first n elements of


the dataset.

takeSample(withReplacement, It returns an array with a random sample of


num, [seed]) num elements of the dataset, with or without
replacement, optionally pre-specifying a random
number generator seed.

takeOrdered(n, [ordering]) It returns the first n elements of the RDD using


either their natural order or a custom
comparator.

saveAsTextFile(path) It is used to write the elements of the dataset as


a text file (or set of text files) in a given directory
in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark calls
toString on each element to convert it to a line of
text in the file.

saveAsSequenceFile(path) It is used to write the elements of the dataset as


(Java and Scala) a Hadoop SequenceFile in a given path in the
local filesystem, HDFS or any other Hadoop-
supported file system.

saveAsObjectFile(path) It is used to write the elements of the dataset in


(Java and Scala) a simple format using Java serialization, which
can then be loaded
usingSparkContext.objectFile().

countByKey() It is only available on RDDs of type (K, V). Thus,


it returns a hashmap of (K, Int) pairs with the
count of each key.

NoSQL

Architecture Pattern is a logical way of categorizing data that will be


stored on the Database. NoSQL is a type of database which helps to
perform operations on big data and store it in a valid format. It is
widely used because of its flexibility and a wide variety of services.

Architecture Patterns of NoSQL:

The data is stored in NoSQL in any of the following four data


architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:

This model is one of the most basic models of NoSQL databases. As


the name suggests, the data is stored in form of Key-Value Pairs. The
key is usually a sequence of strings, integers or characters but can
also be a more advanced data type. The value is typically linked or co-
related to the key. The key-value pair storage databases generally
store data as a hash table where each key is unique. The value can be
of any type (JSON, BLOB(Binary Large Object), strings, etc). This type
of pattern is usually used in shopping websites or e-commerce
applications.
Advantages:
 Can handle large amounts of data and heavy load,
 Easy retrieval of data by keys.
Limitations:
 Complex queries may attempt to involve multiple key-value pairs
which may delay performance.
 Data can be involving many-to-many relationships which may
collide.
Examples:
 DynamoDB
 Berkeley DB

2. Column Store Database:

Rather than storing data in relational tuples, the data is stored in


individual cells which are further grouped into columns. Column-
oriented databases work only on columns. They store large amounts of
data into columns together. Format and titles of the columns can
diverge from one row to other. Every column is treated separately. But
still, each individual column may contain multiple other columns like
traditional databases.
Basically, columns are mode of storage in this type.
Advantages:
 Data is readily available
 Queries like SUM, AVERAGE, COUNT can be easily performed on
columns.
Examples:
 HBase
 Bigtable by Google
 Cassandra
3. Document Database:
The document database fetches and accumulates data in form of key-
value pairs but here, the values are called as Documents. Document
can be stated as a complex data structure. Document here can be a
form of text, arrays, strings, JSON, XML or any such format. The use
of nested documents is also very common. It is very effective as most
of the data created is usually in form of JSONs and is unstructured.
Advantages:
 This type of format is very useful and apt for semi-structured data.
 Storage retrieval and managing of documents is easy.
Limitations:
 Handling multiple documents is challenging
 Aggregation operations may not work accurately.
Examples:
 MongoDB
 CouchDB
Figure – Document Store Model in form of JSON documents
4. Graph Databases:
Clearly, this architecture pattern deals with the storage and
management of data in graphs. Graphs are basically structures that
depict connections between two or more objects in some data. The
objects or entities are called as nodes and are joined together by
relationships called Edges. Each edge has a unique identifier. Each
node serves as a point of contact for the graph. This pattern is very
commonly used in social networks where there are a large number of
entities and each entity has one or many characteristics which are
connected by edges. The relational database pattern has tables that
are loosely connected, whereas graphs are often very strong and rigid
in nature.
Advantages:
 Fastest traversal because of connections.
 Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
 Neo4J
 FlockDB( Used by Twitter)

Figure – Graph model format of NoSQL Databases

You might also like