Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

Big Data Processing

Jiaul Paik
Lecture 7
Storing Big Data in Cluster

Hadoop Distributed Filesystem


HDFS (Hadoop) Architecture
namenode = master node

HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)

instructions to datanode

datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system

… …

(Ghemawat et al., SOSP 2003)


HDFS

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

… … …
slave node slave node slave node
HDFS
Reading and Writing
Dataflow: Reading data from HDFS
2:get block locations
Distributed
HDFS 1 : o p en NameNode
client FileSystem

3:
re
ad
6 :clo s FSData namenode
e InputStream

5 :re
ad

4:read

DataNode DataNode DataNode

datanode datanode datanode

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white


Writing data to HDFS

1. Create 2. Create
Distributed
FileSystem Namenode
HDFS Client 3. Write

7. Complete namenode
FSData
6. Close OutputStream

4. Write Packet 5. ack Packet

4 4
Pipeline of Datanode Datanode Datanode
datanodes
datanode datanode datanode
5 5

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white


Managing Hadoop: Other Key Issues
• Node failure

• HDFS federation (for memory issue)

• Cluster Balancing

• Data Caching
Node failures
• Namenode failures
• All the files in the filesystem are lost
• Since, reconstruction is not possible

• Datanode failure
• Won’t be a problem
• Data blocks are stored in many machines
• Can be recovered from another machine
Tackling Namenode failure
• If namenode fails, then all metadata are lost
• Won’t be able to reconstruct the file from the blocks

• How to handle?

• Maintain a replica of the metadata into another passive machine

• If the active namenode fails, start the passive namenode

• Needs to load the namepace into memory before it starts


HDFS Federation

• The namenode keeps a reference to every file and block in the


filesystem in memory

• For a very large cluster, namenode may run out of memory to hold the
metadata

• Solution: add more namenodes in the cluster


HDFS Cluster Balancing
• When copying data into HDFS, balancing of data storage is
important

• Why?
• HDFS works best when blocks are spread evenly

• Examples:
• In distcp, if m = 1, single task will do the copying
• It will be slow
• Bad utilization of resources

• Default value of m is 20 in Hadoop.


Block Caching
• Generally, datanodes read blocks from the disk

• Frequently accessed blocks can be stored in RAM

• A block is cached in only one datanode’s memory

• Job schedulers tries to run the code on the block that is cached
Filesystem Operations
Filesystem Operations
• Major Filesystem operations:
• reading files, creating directories, moving files, deleting data, and listing
directories.

• One can run a Hadoop command from command line

• To know the details about every command

hadoop fs -help
Filesystem Operations
• Copying a file from the local filesystem to HDFS

hadoop fs -copyFromLocal file-1 file-2

• Copying a file to the local filesystem from HDFS


hadoop fs -copyToLocal source-file dest-file
Filesystem Operations
• Creating a directory
hadoop fs -mkdir mydir

• Listing the files


hadoop fs -ls

You might also like