Read Write in HDFS

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Design of HDFS:

1. The Hadoop Distributed File System (HDFS) was designed for Big data processing.

2. Although it can support-many users simultaneously, HDFS is not designed as a true parallel
file system.

3. Its design assumes a large file write-once/read-many model.

4. This enables other optimizations and relaxes many of the concurrency and coherence overhead
requirements of a true parallel file system.

5. The design of HDFS is based on the design of the Google File System (GFS).

6. HDFS is designed for data streaming where large amounts of data are read from disk in bulk.

7. The HDFS block size is typically 64MB or 128MB.

8. In addition, due to the sequential nature of the data, there is no local caching mechanism.

9. The most interesting aspect of HDFS is its data locality.

10. A principal design aspect of Hadoop MapReduce is the emphasis on moving the computation
to the data rather than moving the data to the computation.

11. This distinction is reflected in how Hadoop clusters are implemented.

12. HDFS is designed to work on the same hardware as the compute portion of the cluster

13. That is, a single server node in the cluster is both a computation engine and a storage engine
for the application.

14. HDFS has a redundant design that can tolerate system failure and still provide the data
needed by the compute part of the program

Concept of HDFS / HDFS concepts


Following are various HDFS concepts
A. Blocks:
1. A Block is the minimum amount of data that can be read or written
2. HDFS blocks are 128 MB by default and this is configurable.
3. Files in HDFS are broken into block-sizcd chunks, which are stored as independent units.
B. NameNode:
1. NameNode is the centerpiece of HDFS
2. NameNode is also known as the Master.
3. NameNodc only stores the metadata or HDFS — the directory tree of all files in the ilesystem.
and tracks the files across the cluster.
4. NameNode does not store the actual data or the dataset. The data itself is actually stored in the
DataNodcs.
5. NameNode knows the list of the blocks and its location for any given file in HDFS. With this
information NameNode knows how to construct the file from blocks.
6. NameNode is so critical to HDFS and when the NameNode is down, Hadoop cluster is
inaccessible and considered down.
7. NameNode is a single point of failure in Hadoop cluster.

C. DataNode:
1. DataNode is responsible for storing the actual data in HDFS.
2. DataNode is also known as the Slave.
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the NameNode along with the list of blocks it
is responsible for.
5. When a DataNode is down, it does not affect the availability of data or the cluster.
6. NameNode will arrange for replication for the blocks managed by the DataNode that is not
available.
7. DataNode is usually configured with a lot of hard disk space, because the actual data is stored
in the DataNode.
Benefits of HDFS:
1. Scalable:
i. Hadoop is a highly scalable storage platform, because it can store and distribute very large data
sets across hundreds of inexpensive servers that operate in parallel.
ii. Unlike traditional relational scale to process large amounts of data, Hadoop enables businesses
to run applications on thousands of nodes involving thousands of terabytes of data.
2. Cost effective:
i. Hadoop also offers a cost effective storage solution for businesses' exploding data sets.
ii. The problem with traditional relational database management systems is that it is extremely
high cost to scale to such a degree in order to process such massive volumes of data.
3. Flexible:
i. Hadoop enables businesses to easily access new data sources and tap into different types of
data (both structured and unstructured) to generate value from that data.
ii. This means businesses can use Hadoop to derive valuable business insights from data sources
such as social media, email conversations or clickstream data.
4. Fast:
i. Hadoop's unique storage method is based on a distributed filesystem' that basically 'maps' data
wherever it is located on a cluster.
ii. The tools for data processing are often on the same servers where the data is located, resulting
in much faster data processing.
5. Resilient to failure:
i. A key advantage of using Hadoop is its fault tolerance.
ii. When data is sent to an individual node, that data is also replicated to other nodes in the
cluster, which means that in the event of failure, there is another copy available for use.
Note: A namespace ensures that all of a given set of objects have unique names so that they can
be easily identified.
Read and Write files in HDFS

1. Read Files:

Step 1: The client opens the file he/she wishes to read by calling open() on the File System
Object.

Step 2: Distributed File System(DFS) calls the name node, to determine the locations of the first
few blocks in the file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block. The DFS returns an FSDataInputStream to the client for it to read data
from.

Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info
node addresses for the primary few blocks within the file, then connects to the primary (closest)
data node for the primary block in the file.

Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on
the stream.

Step 5: When the end of the block is reached, DFSInputStream will close the connection to the
data node, then finds the best data node for the next block.

Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.
2. Write Files:
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s
namespace, with no blocks associated with it. The name node performs various checks to make
sure the file doesn’t already exist and that the client has the right permissions to create the file. If
these checks pass, the name node prepares a record of the new file; otherwise, the file can’t be
created. The DFS returns an FSDataOutputStream for the client to start out writing data to the
file.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it
writes to an indoor queue called the info queue. The data queue is consumed by the
DataStreamer, which is liable for asking the name node to allocate new blocks by picking an
inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline. The
DataStreamer streams the packets to the primary data node within the pipeline, which stores each
packet and forwards it to the second data node within the pipeline

Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last)
data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be
acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for
acknowledgements before connecting to the name node to signal whether the file is complete or
not.
3. Store Files:
 HDFS divides files into blocks and stores each block on a DataNode.
 Multiple DataNodes are linked to the master node in the cluster, the NameNode.
 The master node distributes replicas of these data blocks across the cluster.
 It also instructs the user where to locate wanted information.
 Before the Name Node can help you store and manage the data, it first needs to partition the
file into smaller, manageable data blocks.
 This process is called data block splitting.

You might also like