Professional Documents
Culture Documents
BDP 2024 07
BDP 2024 07
Jiaul Paik
Lecture 7
Storing Big Data in Cluster
HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)
instructions to datanode
datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system
… …
… … …
slave node slave node slave node
HDFS
Reading and Writing
Dataflow: Reading data from HDFS
2:get block locations
Distributed
HDFS 1 : o p en NameNode
client FileSystem
3:
re
ad
6 :clo s FSData namenode
e InputStream
5 :re
ad
4:read
1. Create 2. Create
Distributed
FileSystem Namenode
HDFS Client 3. Write
7. Complete namenode
FSData
6. Close OutputStream
4 4
Pipeline of Datanode Datanode Datanode
datanodes
datanode datanode datanode
5 5
• Cluster Balancing
• Data Caching
Node failures
• Namenode failures
• All the files in the filesystem are lost
• Since, reconstruction is not possible
• Datanode failure
• Won’t be a problem
• Data blocks are stored in many machines
• Can be recovered from another machine
Tackling Namenode failure
• If namenode fails, then all metadata are lost
• Won’t be able to reconstruct the file from the blocks
• How to handle?
• For a very large cluster, namenode may run out of memory to hold the
metadata
• Why?
• HDFS works best when blocks are spread evenly
• Examples:
• In distcp, if m = 1, single task will do the copying
• It will be slow
• Bad utilization of resources
• Job schedulers tries to run the code on the block that is cached
Filesystem Operations
Filesystem Operations
• Major Filesystem operations:
• reading files, creating directories, moving files, deleting data, and listing
directories.
hadoop fs -help
Filesystem Operations
• Copying a file from the local filesystem to HDFS