Professional Documents
Culture Documents
Welcome To Hadoop Distributed File System (HDFS)
Welcome To Hadoop Distributed File System (HDFS)
Components
Main Component
HDFS
Big Data Problem - Storage
HDFS
Big Data Problem - Storage - Approach
Have 100 1TB drives and make 100 subfolders mount these.
HDFS
Big Data Problem - Storage - Challenges
HDFS
HDFS - Namenodes and Datanodes
Namenode Datanodes
A. Maintains meta data
B. Data in RAM (Persisted on local disk) Stores Actual Data
C. Keeps the index of “what is where”
HDFS
HDFS - Design
3. Commodity hardware
HDFS
HDFS - Limitations
HDFS
HDFS - Blocks
HDFS
HDFS - Blocks
HDFS
HDFS - Blocks
HDFS
HDFS - Blocks - Advantages
1. Helps fitting big files into small discs
2. Leaves less unused space on the disc.
3. Optimises the transfer
4. Distributes the load to multiple machines
HDFS
HDFS - Replication
Each block has multiple copies
Called Replication Factor, default 3.
No two copies are on same data node.
Data node fails => The lost blocks are copied to other
nodes.
HDFS
HDFS - Replica Placement Policy
Data Center
DataNode 2 DataNode 5
DataNode 2 DataNode 4
Second Copy: Same
Rack
Third copy off the
rack
DataNode 1
DataNode 3
First Copy on
the same node
Local Machine1
Rack 1 Rack 2
HDFS
HDFS - Replication Factor
HDFS
HDFS - Replication - Failover
Datanodes
Namenode
HDFS
HDFS - Driver
at a data
al d
ct u
e ssa node
Ac c
Client/ name
Driver Request node data
for file
node
data
Metadata node
HDFS
HDFS - If multiple namenodes?
Data
node
Client/ name
Driver node1 Data
Create
myfile.txt
node
HDFS
Anatomy of a File Write
HDFS
Anatomy of a File Read
HDFS
Namenode - Metadata
1. Keeping a backup of namespace image & editlogs on two
locations
Metadata in
Namenode memory
Edit Log3
Namespace
Image Edit Log2
Edit Log1
Local Storage
HDFS
Namenode - Single Point of Failure
1. To recover metadata we needed to merge the image with
editlogs.
Edit Log1
Namespace
Image + Edit Log2
Edit Log3
Metadata in
memory
HDFS
Namenode - Single Point of Failure
1. Keeping a backup of namespace image & editlogs on two
locations
Namenode
Metadata
Namenode
Manually brought up
in case of
Metadata
= namespace image + edit log Emergency
Secondary
Namenode
Local Storage
HDFS
Namenode - Single Point of Failure?
3. HDFS high availability (HA) - Quorum journal manager (QJM)
(in 2.x)
Client
End User Library ZK
Namenode StandBy
Datanode Datanode
HDFS
HDFS Federation (2.x)
Multiple Namenodes - different namenode for different sub folders
Namenode1
Client Library
/mydata1
Mount Tables
StandBy1
POOL
OF
DATA
NODES
Client Library
Mount Tables
/mydata2 Namenode2
/mydata3
StandBy2
HDFS
Namenode Metadata
Meta-data
1. The entire metadata is in main memory.
Types of Metadata
1. List of files
2. List of Blocks for each file
3. List of DataNode for each block
4. File attributes, e.g. access time, replication
factor, file size, file name, directory name
HDFS
HDFS - Hands-on
HDFS
How to Transfer Files?
HDFS ( /user/abhinav9884 )
Hue hadoop fs
CloudxLab Linux
Laptop
Console
(/home/abhinav9884)
SCP, SFTP, WinSCP
HDFS
HDFS - Access files
4. hadoop fs -ls
hdfs://ip-172-31-53-48.ec2.internal:8020/user/abhinav9884/sample.tx
t
HDFS
Where are the blocks?
HDFS
Set Replication
# Check blocks
HDFS
Summary
● HDFS
● HDFS - Design
● Hands-on demos
HDFS
Thank You!
HDFS
HDFS
TODAY’S CLASS
● Recap
○ HDFS Architecture
○ Hadoop 1.0 Architecture
○ Hadoop 2.0 / Yarn
● Cluster Overview
○ Hue
● Using HDFS, Hive, Pig, Oozie
○ From Web
○ From Command
HDFS