Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

WHAT IS

WHERE HDFS FITS IN


HADOOP?
LET’S FIRST
UNDERSTAND
BUZZWORDS IN THE HADOOP
WORLD
REPLICATION
FAULT TOLERANCE
LOAD BALANCING
RELIABILITY
CLUSTERING
IT’S TIME FOR DEEP DIVE…
HDFS ARCHITECTURE
 Name Node
 Data Node
 Task Tracker
 Job tracker
 Image and Journal
 HDFS Client
 Checkpoint Node
 Backup Node
Backup Node

Image Journal

Name Node
Job Tracker Checkpoint
HDFS
Client

Task Tracker Task Tracker Task Tracker

Data Node 1 DataNode 2 ……….. DataNode N


NAME NODE

Job
Tracker Journal

Inode Image

Checkpoint
 Inode - Files and directories are represented on the
NameNode, which record attributes like permissions,
modification and access times, namespace and disk
space quotas.

 Image - The inode data and the list of blocks belonging


to each file

 Checkpoint - The persistent record of the image stored


in the local host’s native file system

 Journal - Write-ahead commit log for changes to the file


system that must be persistent.
DATA NODE
On Start Up…

Name
Data Node Node
DATA NODE
Total
Fraction #Data Transfers
Storage
Storage In Progress
Capacity

Data Node Name Node

Commands
HDFS CLIENT
IMAGE & JOURNAL

Flush & Sync Operation


CHECKPOINT NODE
BACKUP NODE
FILE I/O
OPERATIONS
Single Writer

Multiple Reader
DATA WRITE OPERATION
client DN1 DN2 DN3
setup
Client Name Node
packet1

DN1 packet2

packet3

DN2 packet4

packet5
DN3
close

DN4
DATA WRITE/READ OPERATION

 Single Writer Multiple


Reader Model
 Lease Management (Soft

Client
Limit and Hard Limit)
Name Node
 Pipelining, Buffering and
Hflush
DN1
 Checksum for data
integrity
 Choosing nodes for read
operation
BLOCK PLACEMENT
Name Node
Add(data)
Client Inode Image
Data Nodes for Replica

checkpoint
Journal

RACK
RACK1
3

DN1 DN2 DN3 DN4 DN5 D11 D12 D13 D14 D15

RACK2

DN6 DN7 DN8 DN9 D10


REPLICATION MANAGEMENT
Name Node

Inode Image

/
Journal checkpoint

RACK1 RACK3

DN1 DN2 DN3 DN4 DN5 D11 D12 D13 D14 D15

RACK2 Over Replicated

Under Replicated
DN6 DN7 DN8 DN9 D10
BALANCER
 Balancing the disk space utilization on individual
data nodes.
 Based on utilization threshold.

 Utilization balancing follows block placement policy.


SCANNER
 Scanner verifies the data integrity based on checksum.

You might also like