Explain Distributed File System and Features of Hadoop

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Q Explain Distributed File System ?

A distributed file system (DFS) is a file system with data stored on a server. The data is
accessed and processed as if it was stored on the local client machine. The DFS makes it
convenient to share information and files among users on a network in a controlled and
authorized way.
Example
The following scenario explains how much amount of time is required to read to complete
the 1 Tera Byte data by using 1 machine 4 I/O channels with each channel 100 MB capacity.

While Reading 1TB data…

The distributed file system provides the reliability, availability based


on the replication of the data. The replication make sure that the data is getting stored in
multiple locations even any issue is there with a particular copy the process may not be
disturbed as the same file is also available in any other machine. The distributed environments
follow the logic of single system image, where all the systems are present in the distributed
environment are having the same characteristics.

The

normal environment required 45 minutes time to perform the reading operation on 1TB data,
to perform the same task with distributed environment the process took only 4.5 minutes the
same performance benefits can be achieved through out in all the operations like searching and
other key operations.
Q) Explain features of Apache Hadoop ?

1. Open Source
Apache Hadoop is an open source project. It means its code can be modified according
to business requirements.
2. Distributed Processing
As data is stored in a distributed manner in HDFS across the cluster, data is processed in
parallel on a cluster of nodes.
3. Fault Tolerance
This is one of the very important features of Hadoop. By default 3 replicas of each block
is stored across the cluster in Hadoop and it can be changed also as per the
requirement. So if any node goes down, data on that node can be recovered from other
nodes easily with the help of this characteristic. Failures of nodes or tasks are recovered
automatically by the framework. This is how Hadoop is fault tolerant.
4. Reliability
Due to replication of data in the cluster, data is reliably stored on the cluster of machine
despite machine failures. If your machine goes down, then also your data will be stored
reliably due to this charecteristic of Hadoop.
5. High Availability
Data is highly available and accessible despite hardware failure due to multiple copies
of data. If a machine or few hardware crashes, then data will be accessed from another
path.
6. Scalability
Hadoop is highly scalable in the way new hardware can be easily added to the nodes.
This feature of Hadoop also provides horizontal scalability which means new nodes can
be added on the fly without any downtime.
7. Economic
Apache Hadoop is not very expensive as it runs on a cluster of commodity hardware. We
do not need any specialized machine for it. Hadoop also provides huge cost saving also
as it is very easy to add more nodes on the fly here. So if requirement increases, then
you can increase nodes as well without any downtime and without requiring much of
pre-planning.

8. Easy to use
No need of client to deal with distributed computing, the framework takes care of all
the things. So this feature of Hadoop is easy to use.
9. Data Locality
This one is a unique features of Hadoop that made it easily handle the Big Data. Hadoop
works on data locality principle which states that move computation to data instead of
data to computation. When a client submits the MapReduce algorithm, this algorithm is
moved to data in the cluster rather than bringing data to the location where the
algorithm is submitted and then processing it.

You might also like