Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

An Overview - Google File System (GFS) and Hadoop Distributed File System

(HDFS)
Snehsudha Popatrao Dhage 1,Tanpure Renuka Subhash2, Rutuja Vilas Kotkar2, Prachi Dattatray Varpe2,
Sonali Sanjay Pardeshi1
1
Loknete Ramdas Patil Dhumal Arts science and commerce college, Rahuri, Ahmednagar, India
2
PIRENS Institute of Computer Technology Loni (bk), Ahmednagar, India

Publication Info Abstract


Article history: Each day huge quantity of data is produced by means of text, audio, image, and video format
Received : 16 February 2020 via many sources like social media, YouTube, websites and so on. The most important part is
Accepted : 23 May 2020 storing this generated data in the file system. To store this intensive data cluster-based storage
Keywords: systems are required. Hadoop Distributed File System (HDFS) and Google File System (GFS)
GFS, HDFS, MapReduce, Apache are the two major file system which are used to store the data. This paper gives an overview of
Hadoop, File System GFS and HDFS. Comparison of these file system is also discussed.
*Corresponding author:
Snehsudha Popatrao Dhage
e-mail: snehsudhadhage@gmail.
com

1. INTRODUCTION process the data in bulk.


The major problem of many organizations now a days is As per the analysis some design choices were completed:
storing the large amount of data as well as processing it. • The file is divided or segmented into big chunks
The search engines are handling data in petabytes per week. • Automatic file append operation is implemented that
Even there are many research and advisement systems permits several applications to append in the same
datasets are handled by websites like Yahoo [1]. In addition file concurrently.
to this there are many ecommerce companies that are • The clusters are built around the high bandwidth. The
handling large amount of data and needs high storage. The low latency interconnection network is avoided. The
You Tube is having lots of videos streaming per week it also control flow is isolated from the data flow. The high
needs to store this intensive data. The first requirement of band width data flow is planned using the pipelining
these application is they need cluster-based storage system the data transfer over Transmission Control Protocol
which are reliable, extremely offered and scalable. (TCP) connections for decreasing the response time.
The HDFS and GFS are used to store the huge quantity The network topology is exploited by sending data to
of data. In 1990’s Google File System was developed by the neighbouring node in the network.
Google. There are thousands of storage systems utilized • Client site caching is eliminated to improve the
by the GFS built from low-cost service components. These performance.
systems offers the many users petabytes of storage for their • The critical operations are channelized to guarantee
varied requirements [2]. A key apprehension of the GFS consistency through a master monitoring the full
designers was consistency of a system showing to hardware system
failures, application errors, system software errors, and • Minimize master's involvement in file access
human errors. operations to avoid hot-spot contention and to ensure
The GFS design important aspects are: scalability.
• Reliability and scalability • Backing effective checkpointing and reckless
• File size from GB to TB retrieval mechanisms.
• Operations like appending file, random write • The effective support is given for garbage collection.
operation to file. Google has developed MapReduce programming
• Sequential read operations model [3][4] for meeting the fast-growing demands of
• Users are not concern about the response time they their web search indene has developed Xing process. GFS

126
2010-2019 S-JPSET : Vol. 12, Issue Supplementary-1, ISSN : 2229-7111 (Print) and ISSN : 2454-5767 (Online) copyright © samriddhi,
Authors

system is used to perform the MapReduce computations [5]. deal with the failure. The one master server for clusters may
Hadoop is open source framework developed by result in bottleneck. To overcome this problem GFS keeps
inspiring from the GFS and MapReduce success [6]. It very small messages sent and receive by master server.
is used to build large clusters. HDFS used the distributed The chunk servers actually handle the data not master
file system design. HDFS run on service hardware and server. They are the workhorses of GFS. system. The 64
it is extremely fault tolerant. Hadoop file system earlier MB file chunks are stored in chunk servers. They do not
developed by the Yahoo but later on it became open source send chunks to the master server. Client request the chunks
framework. HDFS stores a large amount of data and it offers to the chunk server and chunk servers directly send the
faster and easier access. The data is stored in the file and requested chunks. Each chunk is copied by the GFS many
the files are stored on many machines. The files are stored times and kept on diverse chunk servers. Each copy is
in a redundant way to protect the system from data loss in known as a replica. GFS by default creates 3 replicas of
case of failure. Hadoop Distributed File System prepares each chunk. users can do the changes in the setting, and
applications accessible to parallel processing they can make additional replicas as they wanted.
The features of HDFS are:
3. HADOOP DISTRIBUTED FILE SYSTEM
Appropriate for distributed storage and processing.
• To interact with the HDFS Hadoop offers a command The HDFS uses the master slave architecture. The datanode,
interface. namenode and the block are the main components of the
• It has two built in servers called namenode and HDFS. Figure 2 shows the architecture of HDFS. The HDFS
datanode. These servers assist the user to check the namespace contains the files and directory hierarchy. The
status of the clusters. inode represents the files and directories on the namenode
• It streams the access to file system data. by inode, which record qualities such as access time,
• HDFS offers the file permissions and authentication modifications, permissions, and disk space quotas.
facility. The namenode is nothing but the service hardware.
The namenode comprises the GNU or Linux operating
2. GOOGLE FILE SYSTEM system and the name node software. The namenode
Google File System (GFS) is nothing but it is clusters software can run on service hardware. The system which
of computers. The cluster is formed by the network of has namenode is a master server. The master server manages
computers. Every cluster has thousands of machines. There the namespaces of the files. It regulates the file access. The
are three main components in each google file system and file operations like renaming a file, opening and closing a
they are chunk servers, clients, and master servers. Figure 1 file or directory are executed by the master server.
shows the general architecture of GFS.
The client makes a file request like retrieve the file,
manipulate the existing file, and create a new file. A client
cam be a computer or it can be a computer application. We
say that a client is a customer of the GFS.
The master server works as a coordinator for the
cluster. The operational logs are maintained by the
master server. The operational logs contain the data of
master cluster’s activity track. There is minimum service Figure 1: General Architecture of GFS [7]
interruption caused due to an operation log. In the event
of master server crash, the server that has monitors the
operational log is substituted at the place of crashed server.
The metadata tracking is done by the master server. The
chunks are described by the metadata. The metadata tells
the master server that chunk belongs to which file and where
in the overall file they fit. All the chunks in the clusters
are polled by the master server at the time of startup. The
contents of the inventory are sent as reply to master server
by the chunk server. Since that moment on, the chunk
location within the clusters are tracked by the master
servers. At one time, per cluster there is one active master
server. There are multiple copies exists for master server to Figure 2: HDFS Architecture [8]

127
copyright © samriddhi, 2010-2019 S-JPSET : Vol. 12, Issue Supplementary-1, ISSN : 2229-7111 (Print) and ISSN : 2454-5767 (Online)
An Overview - Google File System (GFS) and Hadoop Distributed File System (HDFS)

Table 1: Comparison between HDFS and GFS


HDFS GFS
Platform Cross Platform Linux
Developer Yahoo. now it is an open source framework Google
Servers Name node, data node Master node and chunk server
Hardware used Commodities hardware Commodities hardware
Read and write Write once and reading can be done many times. It is a multiple read and write model
File deletion The files deleted are renamed into a specific The deleted file is renamed in hidden namespace, it cannot be
folder and after that the file will be removed thru reclaimed instantly. The file will be deleted after 3days if the
garbage file is not in use.
Network Stack Issue No Yes
Availability Journal, edit log Operational log
Other operations only append is possible random file writes possible

The datanode also has the commodity hardware, software 5. REFERENCES


and the operating systems GNU or Linux. For each node [1] GFS vs HDFS. (2020). Retrieved 22 August 2020,
in the cluster there is a datanode. These nodes accomplish fromhttps://sensaran.wordpress.com/2015/11/24/gfs-
the storage of their system. Datanode as per client’s request vs-hdfs/
performs the read and write operations on the file system. [2] Bonner, S., Kureshi, I., Brennan, J., & Theodoropoulos,
Data nodes also performs the operations like creation of G.(2017). Exploring the Evolution of Big Data
block, deletion and replication of the blocks according to Technologies
the orders of the namenode. [3] J e ff r e y D e a n . , a n d S a n j a y G h e m a w a t . ( 2 0 0 4 ) .
The user data is kept in the file system of the HDFS. MapReduce:Simplified Data Processing on Large
The files are divided into one or additional segments and Clusters. Google.
are stored in a separate data node. Segments are known as [4] Hadoop - HDFS Overview. (2020). Retrieved fromhttps://
blocks. The block is the least quantity of data is written and www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.
read by the HDFS. The defaulting block size is 64MB. This
htm
block size can be changed according to the requirement.
[5] Sanjay Ghemawat., Howard Gobioff., and Shun-
Table 1 shows the contrast between the HDFS and
TakLeung. (2003). The Google File System.Google.
GFS.
[6] Hadoop wiki page documentation. Retrieved fromhttp://
4. CONCLUSION wiki.apache.org/hadoop/
This paper has given an overview of the Hadoop Distributed [7] Richa Pandey. S.P. Sah. (2016). A Review on Google
File System and Google File System as well as these file FileSystem. International Journal of Computer Science
systems are also compared. The google is using its own File Trendsand Technology ( Volume 4, Issue 4).
system that is GFS. The HDFS is inspired from the GFS. [8] Anup Suresh Talwalkar. (2020). HadoopT -Breaking the
Both the file systems are using the master slave architecture. ScalabilityLimits of Hadoop (Doctoral thesis, Rochester
The GFS works on the Linux platform on the other hand the [9] Ins titute of Te c hnologyR oc he s te r, N e w York) .
HDFS works on the cross platforms. GFS has two servers R e t r i v e d f r o m h t t p s : / / w w w. r e s e a r c h g a t e . n e t /
master node and chunk servers and the HDFS has name publication/265403224_HadoopT_-Breaking_the_
node and data node servers. Scalability_Limits_of_Hadoop

128
2010-2019 S-JPSET : Vol. 12, Issue Supplementary-1, ISSN : 2229-7111 (Print) and ISSN : 2454-5767 (Online) copyright © samriddhi,

You might also like