Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Big Data

Agenda
• Hadoop Ecosystem
• Hadoop Distributed File System
• Concepts of Blocks in HDFS Architecture
• Role of Namenodes and Datanodes

Presented by : A.H.Shanthakumara
Big Data

Hadoop Ecosystem:
➢ Hadoop plays an integral part in almost all big data processes.
➢ It is almost impossible to use big data without the tools and
techniques of hadoop.
➢ Hadoop ecosystem is a framework of various types of complex
and evolving tools and components such as HDFS and its
architecture, MapReduce, YARN, HBase and HIVE.
➢ May very different from each other in terms of their architecture
but they all derive their functionality is from the scalability and
power of hadoop
➢ Enable user to process large data sets in real-time and provide
tools to support various types of Hadoop projects.
Presented by : A.H.Shanthakumara
Big Data

Hadoop Ecosystem:

Presented by : A.H.Shanthakumara
Big Data

Various elements of Hadoop involves at various


stages of processing data:

Presented by : A.H.Shanthakumara
Big Data

Hadoop Distributed File System:


Some terms/concepts related to HDFS:
➢ Huge documents: huge means records in Gigabyte, Terabytes
are even Petabytes in size.
➢ Streaming information access: high throughput of data access
rather than the low latency of data access.
➢ Appliance hardware: Hadoop does not require large
exceptionally dependable hardware to run
➢ Loads of small documents: the quantity of documents in a file
system is administrated in terms of memory on the server.

Presented by : A.H.Shanthakumara
Big Data

Hadoop Architecture:
➢ Master slave architecture, the Namenode is the master that
manages the various Datanodes
➢ Namenode managers HDFS cluster meta-data, datanodes
store the data
➢ Records and directories are presented by clients to the
Namenode, operations on records and directories are
performed by the Namenode
➢ A file is divided into one or more blocks which are stored in a
group of Datanodes, where Datanodes read and write request
from the client
➢ Datanodes can execute operations like creation, deletion and
replication of blocks depending on the instructions from the
Namenode Presented by : A.H.Shanthakumara
Big Data

Hadoop Architecture:

Presented by : A.H.Shanthakumara
Big Data

Concepts of Blocks in HDFS Architecture:


➢ HDFS blocks are huge(in MB) in contrast to disk blocks to
minimize the expense of the seek operation.
➢ The default size is 64 MB although numerous HDFS
installations utilize 120 MB blocks
➢ Typically work on one block at once.
➢ Performance is calculated through distribution of data and fault
tolerance

Presented by : A.H.Shanthakumara
Big Data

Concepts of Blocks in HDFS Architecture:


➢ Facilitate number of tasks for failure management.
Monitoring: Datanode and Namenode communicate through
continuous signals
➢ Signal is not heard by either of two the node is considered to
have failed.
➢ The failed node is replaced by the replica and replication
scheme is also changed
Rebalancing: the blocks are shifted from one to another location
wherever the free space is available.
Meta-data replication: maintain the replica of corresponding file
on the same HDFS

Presented by : A.H.Shanthakumara
Big Data

Concepts of Blocks in HDFS Architecture:


➢ Illustration of Hadoop Heartbeat message:

Presented by : A.H.Shanthakumara
Big Data

Concepts of Blocks in HDFS Architecture:


➢ Advantages of abstracting a block:
1. A record can be bigger than any single disc in a system, it is
possible to store on HDFS cluster with the entire cluster disk
filled by its blocks
2. Making the abstraction unit a block instead of a file improves
the storage subsystem
• The storage subsystem manages blocks, improving storage
management and dispensing with metadata concern

Presented by : A.H.Shanthakumara
Big Data

Role of Namenodes and Datanodes:


Namenodes:
➢ Deals with the file system
➢ Stores the metadata for all documents and indexes in file system
➢ Stored on local disk as two files: the file system image and the edit
log
➢ Aware of the Datanodes on which all the pieces of given document
are found
➢ It does not store block locations necessarily, but knows how to
reconstruct the files from the blocks on Datanodes
Datanodes:
➢ Work- horses of a file system
➢ Store and recover blocks when they are asked
➢ Ensure connectivity with the Namenode by sending heartbeat
messages

Presented by : A.H.Shanthakumara
Big Data

Agenda
• Continuous of Hadoop Ecosystem
– HDFS High availability
– Features of HDFS
– Other tools in Hadoop Ecosystem
• MapReduce
• YARN
• HIVE
• PIG

Presented by : A.H.Shanthakumara
Big Data

HDFS High availability:


➢ Namenode was a single point of failure in the earlier version of
Hadoop.
➢ Availability of an HDFS cluster depends upon the availability of
the Namenode.
➢ Total availability in mainly two ways:
1. Would remain unavailable until an operator restarts the
Namenode.
2. Upgradation of software or hardware and the Namenode
machine would result in cluster downtime.

Presented by : A.H.Shanthakumara
Big Data

HDFS High availability:


➢ The problems are addressed by the HDFS high availability
feature
1. Running on to redundant Namenodes in the same cluster in
active/passive configuration
2. Two separate machines that are configured as Namenodes in
active/standby configuration(High Availability)
➢ We can deploy an HA cluster by:
➢ Namenode machines-active/standby Namenodes with similar
hardware configuration
➢ Shared storage-must have read/write accessibility on a
shared directory.
Presented by : A.H.Shanthakumara
Big Data

Features of HDFS:
➢ Three key features: Data replication, Data Resilience and Data
integrity
➢ A file is divided into blocks and replicated blocks are distributed
through the different Datanodes of a cluster
➢ Automatically provide Resilience to data in case of an
unexpected loss or damage
➢ Ensure Data integrity throughout the cluster with the help of:
1. Maintaining transaction logs- helps to monitor every
operation and carry out effective auditing and recovery

Presented by : A.H.Shanthakumara
Big Data

Features of HDFS:
2. Validating checksum- an effective error detection technique
➢ The message receiver verify the checksum of the message to
ensuring that it is the same as in the sent message
➢ Checksum is hidden to avoid tempering
3. Creating data blocks- maintains replicated copies of data
blocks to avoid corruption of a file due to failure of a server
➢ Data blocks are sometimes called block servers and perform
the following functions
1. Storage and retrieval of data on a local file system
2. Storage of meta data of a block on a local file system
3. Conduct of periodic validation for file checksums
Presented by : A.H.Shanthakumara
Big Data

Features of HDFS:
4. Reporting Namenode on regular basis about availability of
blocks
5. On demand supply of meta data and data
6. Movement of data to connected nodes on the basis of the
pipelining model
➢ A connection between multiple Datanodes that supports
movement of data across servers is termed as a pipeline

Presented by : A.H.Shanthakumara
Big Data

MapReduce:
➢ take data input, process it, generate the output and returns the
required answers
➢ Based on parallel programming framework to process large
amount of data dispersed across different systems
➢ Facilitates the processing and analyzing of both unstructured
and semi structured data collected from different sources
➢ Primarily support two operations: map and reduce
➢ Work on master-worker approach in which the master process
control and directs the entire activity
➢ Master collects, segregates and delegate the data among
different workers
Presented by : A.H.Shanthakumara
Big Data

MapReduce:
➢ Can be summed up in the following steps:
1. Worker receives data from the master, processes it and
sends back the generated result to the master
2. Workers run the same code on the received data; however
they are not aware of other co-workers and do not
communicate or interact with them
3. Master receives the result from each worker process,
integrate and processes them and generate the final output

Presented by : A.H.Shanthakumara
Big Data

MapReduce:

Presented by : A.H.Shanthakumara
Big Data

Other tools in Hadoop Ecosystem:


Hadoop YARN(Yet Another Resource Negotiator):
➢ Supports two major services:
➢ Global Resource Management( resource manager) and per-
application management ( application master)
Hive:
➢ Data warehousing layer created with core elements of Hadoop
to support batch oriented processes
➢ Supports both access to structured data( SQL-like access)
and sophisticated analysis of big data using MapReduce
➢ Used effectively for data mining and intense analysis which
does not involve real time processing.
Presented by : A.H.Shanthakumara
Big Data

Other tools in Hadoop Ecosystem:


Apache Pig:
➢ Apache Pig is an abstraction over MapReduce.
➢ It is a tool/platform which is used to analyze larger sets of data
representing them as data flows.
➢ Pig provides a high-level language known as Pig Latin in
which programmers can develop their own functions for
reading, writing, and processing data.
➢ Apache Pig has a component known as Pig Engine that
accepts the Pig Latin scripts as input and converts those
scripts into MapReduce jobs.

Presented by : A.H.Shanthakumara
Big Data

Agenda
• Continuous of Hadoop Ecosystem
– Introducing HBase
– HBase Architecture
– HBase and HDFS
– HBase and RDBMS
– HBase Read and Write

Presented by : A.H.Shanthakumara
Big Data

Introducing HBase:
➢ A distributed, column-oriented database built on top of the
Hadoop file system
➢ Considered as a web table, a table of web pages accessed by
web page URL
➢ Web table is large, containing over a billion rows parsing and
batch analytics are MapReduce jobs that continuously run
against the web table
➢ It is not relational, but it still has the capacity to do what and an
RDBMS cannot
➢ Host large, inadequately populated tables on cluster produced
using appliance hardware
Presented by : A.H.Shanthakumara
Big Data

Introducing HBase:
➢ Stores data into tables with rows and columns, intersection is
called a cell
➢ Each cell in a HBase table has an associated attribute termed
as "version" which provides timestamp to uniquely identify the
cell
➢ Facilitates reading/writing of big data randomly and efficiently in
real time
➢ Versioning helps in keeping track and allow previous version of
the cell contents
➢ Provide various useful data processing features
➢ Support for distributed environment and multidimensional maps
➢ Allows storage of results for later analytical processing
Presented by : A.H.Shanthakumara
Big Data

HBase Architecture:
➢ HBase architecture consists mainly of four components
➢ HMaster, HRegionserver, HRegions and Zookeeper

Presented by : A.H.Shanthakumara
Big Data

HBase Architecture:
HMaster :
➢ It is the implementation of a Master server in HBase architecture.
➢ It acts as a monitoring agent to monitor all Region Server instances
of the cluster and acts as an interface for all the metadata changes.
➢ The following are important roles performed by HMaster in HBase.
1. Plays a vital role in terms of performance and maintaining nodes in
the cluster.
2. HMaster provides admin performance and distributes services to
different region servers.
3. HMaster assigns regions to region servers.
4. HMaster has the features like controlling load balancing and
handle the load over nodes present in the cluster.
5. HMaster takes responsibility, when a client wants to change any
schema and to change any Metadata operations.
Presented by : A.H.Shanthakumara
Big Data

HBase Architecture:
HRegions:
➢ Tables are automatically partitioned horizontally into regions by
HBase
➢ Regions are units that get spread over a cluster in HBase
➢ Each region consists of a subset of rows of a table
➢ It contains multiple stores, one for each column family.
➢ It consists of mainly two components, which are Memstore and
Hfile

Presented by : A.H.Shanthakumara
Big Data

HBase Architecture:
HRegionserver:
➢ When Region Server receives writes and read requests from the
client, it assigns the request to a specific region, where the actual
column family resides.
➢ It is responsible for serving and managing regions or data that is
present in a distributed cluster.
➢ The region servers run on Data Nodes present in the Hadoop cluster.
➢ HMaster can get into contact with multiple HRegion servers and
performs the following functions.
✓ Hosting and managing regions
✓ Splitting regions automatically
✓ Handling read and writes requests
✓ Communicating with the client directly
Presented by : A.H.Shanthakumara
Big Data

HBase Architecture:
Zookeeper :
➢ It is a centralized monitoring server which maintains configuration
information and provides distributed synchronization.
➢ If the client wants to communicate with regions, the server's client
has to approach ZooKeeper first.
➢ Services provided by ZooKeeper:
✓ Maintains Configuration information
✓ Provides distributed synchronization
✓ Client Communication establishment with region servers
✓ Provides ephemeral nodes for which represent different
region servers
✓ Master servers usability of ephemeral nodes for discovering
available servers in the cluster
✓ To track server failure and network partitions
Presented by : A.H.Shanthakumara
Big Data

HBase and HDFS:


HDFS HBase
HDFS is a distributed file system HBase is a database built on top of
suitable for storing large files. the HDFS.
HDFS does not support fast HBase provides fast lookups for
individual record lookups. larger tables.
It provides high latency batch It provides low latency access to
processing; no concept of batch single rows from billions of records
processing. (Random access).
It provides only sequential access HBase internally uses Hash tables
of data. and provides random access, and it
stores the data in indexed HDFS
files for faster lookups.

Presented by : A.H.Shanthakumara
Big Data

HBase and RDBMS:


HBase RDBMS
HBase is schema-less, it doesn't An RDBMS is governed by its
have the concept of fixed columns schema, which describes the whole
schema; defines only column structure of tables.
families.
It is built for wide tables. HBase is It is thin and built for small tables.
horizontally scalable. Hard to scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.


It is good for semi-structured as well It is good for structured data.
as structured data.

Presented by : A.H.Shanthakumara
Big Data

HBase Read and Write:


1. Client wants to write data and in turn first communicates with
Regions server and then regions
2. Regions contacting Memstore for storing associated with the
column family
3. First data stores into Memstore, where the data is sorted and after
that, it flushes into HFile. The main reason for using Memstore is to
store data in a Distributed file system based on Row Key. Memstore
will be placed in Region server main memory while HFiles are
written into HDFS.
4. Client wants to read data from Regions
5. In turn Client can have direct access to Mem store, and it can
request for data.
6. Client approaches HFiles to get the data. The data are fetched and
retrieved by the Client.
Presented by : A.H.Shanthakumara
Big Data

HBase Read and Write:

Presented by : A.H.Shanthakumara
Big Data

Features of HBase:
1. Consistency: not strictly an acid implementation, supports
consistent read and write operations
2. Sharding: allows distribution of data using an underlying file
system
3. High availability: implementation of region server ensure
recoverability
4. Client API: supports programmatic access using Java API's
5. Support for IT operations: provide a set of built-in web pages to
view detailed operational insights

Presented by : A.H.Shanthakumara
Big Data

Agenda
• Continuous of Hadoop Ecosystem
– Read and Write Operation In HDFS
– Access Using COMMAND-LINE INTERFACE
– Access Using JAVA API

Presented by : A.H.Shanthakumara
Big Data

Read and Write Operation In HDFS:


➢ HDFS cluster primarily consists of a NameNode that manages
the file system Metadata and a DataNodes that stores
the actual data.
• NameNode: NameNode can be considered as a master of the system. It maintains the file
system tree and the metadata for all the files and directories present in the system. Two
files 'Namespace image' and the 'edit log' are used to store metadata information.
Namenode has knowledge of all the datanodes containing data blocks for a given file,
however, it does not store block locations persistently. This information is reconstructed
every time from datanodes when the system starts.
• DataNode : DataNodes are slaves which reside on each machine in a cluster and provide
the actual storage. It is responsible for serving, read and write requests for the clients.

➢ Read/write operations in HDFS operate at a block level. Data


files in HDFS are broken into block-sized chunks, which are
stored as independent units. Default block-size is 64 MB.
Presented by : A.H.Shanthakumara
Big Data

Read and Write Operation In HDFS:


Read Operation : Data read request is served by HDFS, NameNode and DataNode.
Let's call reader as a 'client'. Below diagram depicts file read operation in Hadoop.

Presented by : A.H.Shanthakumara
Big Data

Read and Write Operation In HDFS:


Read Operation :
1. Client initiates read request by calling 'open()' method of FileSystem object; it is an object of
type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the
locations of the blocks of the file. Please note that these addresses are of first few block of
file.
3. In response to this metadata request, addresses of the DataNodes having copy of that block, is
returned back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes care of
interactions with DataNode and NameNode. In step 4 shown in above diagram, client
invokes 'read()' method which causes DFSInputStream to establish a connection with the
first DataNode with the first block of file.
5. Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This
process of read() operation continues till it reaches end of block.
6. Once end of block is reached, DFSInputStream closes the connection and moves on to locate
the next DataNode for the next block
7. Once client has done with the reading, it calls close() method. Presented by : A.H.Shanthakumara
Big Data

Read and Write Operation In HDFS:


Write Operation :

Presented by : A.H.Shanthakumara
Big Data

Read and Write Operation In HDFS:


Write Operation :
1. Client initiates write operation by calling 'create()' method of DistributedFileSystem object
which creates a new file - Step no. 1 in above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates new
file creation. However, this file create operation does not associate any blocks with the file.
It is the responsibility of NameNode to verify that the file (which is being created) does not
exist already and client has correct permissions to create new file. If file already exists or
client does not have sufficient permission to create a new file, then IOException is thrown
to client. Otherwise, operation succeeds and a new record for the file is created by the
NameNode.
3. Once new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. Client uses it to write data into the HDFS. Data write method is
invoked (step 3 in diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after communication
with DataNodes and NameNode. While client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are en-
queued into a queue which is called as DataQueue.
Presented by : A.H.Shanthakumara
Big Data

Read and Write Operation In HDFS:


Write Operation :
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable
DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our case,
we have chosen replication level of 3 and hence there are 3 DataNodes in the pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in pipeline.
9. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which are
waiting for acknowledgement from DataNodes.
10. Once acknowledgement for a packet in queue is received from all DataNodes in the
pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode failure, packets
from this queue are used to reinitiate the operation.
11. After client is done with the writing data, it calls close() method (Step 9), results into
flushing remaining data packets to the pipeline followed by waiting for acknowledgement.
12. Once final acknowledgement is received, NameNode is contacted to tell it that the file write
operation is complete. Presented by : A.H.Shanthakumara
Big Data

Access Using COMMAND-LINE INTERFACE:


➢ Command-line interface has support for filesystem operations like read file, create

directories, moving files, deleting data, and listing directories.

➢ '$HADOOP_HOME/bin/hdfs dfs -help’ get detailed help on every command

➢ $HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt / Copy a file from local

filesystem to HDFS

➢ $HADOOP_HOME/bin/hdfs dfs -ls / list files present in a directory

➢ $HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt copy a file to local filesystem

from HDFS

➢ $HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory create new directory

Presented by : A.H.Shanthakumara
Big Data

Access Using JAVA API :


➢ Hadoop provides multiple JAVA classes. Package named
org.apache.hadoop.fs contains classes useful in manipulation of a file
in Hadoop's filesystem. These operations include, open, read, write,
and close. Actually, file API for Hadoop is generic and can be
extended to interact with other filesystems other than HDFS.
➢ Object java.net.URL is used for reading contents of a file. To begin
with, we need to make Java recognize Hadoop's hdfs URL scheme.
➢ This is done by calling setURLStreamHandlerFactory method on
URL object

Presented by : A.H.Shanthakumara

You might also like