Professional Documents
Culture Documents
An Overview - Google File System (GFS) and Hadoop Distributed File System (HDFS)
An Overview - Google File System (GFS) and Hadoop Distributed File System (HDFS)
(HDFS)
Snehsudha Popatrao Dhage 1,Tanpure Renuka Subhash2, Rutuja Vilas Kotkar2, Prachi Dattatray Varpe2,
Sonali Sanjay Pardeshi1
1
Loknete Ramdas Patil Dhumal Arts science and commerce college, Rahuri, Ahmednagar, India
2
PIRENS Institute of Computer Technology Loni (bk), Ahmednagar, India
126
2010-2019 S-JPSET : Vol. 12, Issue Supplementary-1, ISSN : 2229-7111 (Print) and ISSN : 2454-5767 (Online) copyright © samriddhi,
Authors
system is used to perform the MapReduce computations [5]. deal with the failure. The one master server for clusters may
Hadoop is open source framework developed by result in bottleneck. To overcome this problem GFS keeps
inspiring from the GFS and MapReduce success [6]. It very small messages sent and receive by master server.
is used to build large clusters. HDFS used the distributed The chunk servers actually handle the data not master
file system design. HDFS run on service hardware and server. They are the workhorses of GFS. system. The 64
it is extremely fault tolerant. Hadoop file system earlier MB file chunks are stored in chunk servers. They do not
developed by the Yahoo but later on it became open source send chunks to the master server. Client request the chunks
framework. HDFS stores a large amount of data and it offers to the chunk server and chunk servers directly send the
faster and easier access. The data is stored in the file and requested chunks. Each chunk is copied by the GFS many
the files are stored on many machines. The files are stored times and kept on diverse chunk servers. Each copy is
in a redundant way to protect the system from data loss in known as a replica. GFS by default creates 3 replicas of
case of failure. Hadoop Distributed File System prepares each chunk. users can do the changes in the setting, and
applications accessible to parallel processing they can make additional replicas as they wanted.
The features of HDFS are:
3. HADOOP DISTRIBUTED FILE SYSTEM
Appropriate for distributed storage and processing.
• To interact with the HDFS Hadoop offers a command The HDFS uses the master slave architecture. The datanode,
interface. namenode and the block are the main components of the
• It has two built in servers called namenode and HDFS. Figure 2 shows the architecture of HDFS. The HDFS
datanode. These servers assist the user to check the namespace contains the files and directory hierarchy. The
status of the clusters. inode represents the files and directories on the namenode
• It streams the access to file system data. by inode, which record qualities such as access time,
• HDFS offers the file permissions and authentication modifications, permissions, and disk space quotas.
facility. The namenode is nothing but the service hardware.
The namenode comprises the GNU or Linux operating
2. GOOGLE FILE SYSTEM system and the name node software. The namenode
Google File System (GFS) is nothing but it is clusters software can run on service hardware. The system which
of computers. The cluster is formed by the network of has namenode is a master server. The master server manages
computers. Every cluster has thousands of machines. There the namespaces of the files. It regulates the file access. The
are three main components in each google file system and file operations like renaming a file, opening and closing a
they are chunk servers, clients, and master servers. Figure 1 file or directory are executed by the master server.
shows the general architecture of GFS.
The client makes a file request like retrieve the file,
manipulate the existing file, and create a new file. A client
cam be a computer or it can be a computer application. We
say that a client is a customer of the GFS.
The master server works as a coordinator for the
cluster. The operational logs are maintained by the
master server. The operational logs contain the data of
master cluster’s activity track. There is minimum service Figure 1: General Architecture of GFS [7]
interruption caused due to an operation log. In the event
of master server crash, the server that has monitors the
operational log is substituted at the place of crashed server.
The metadata tracking is done by the master server. The
chunks are described by the metadata. The metadata tells
the master server that chunk belongs to which file and where
in the overall file they fit. All the chunks in the clusters
are polled by the master server at the time of startup. The
contents of the inventory are sent as reply to master server
by the chunk server. Since that moment on, the chunk
location within the clusters are tracked by the master
servers. At one time, per cluster there is one active master
server. There are multiple copies exists for master server to Figure 2: HDFS Architecture [8]
127
copyright © samriddhi, 2010-2019 S-JPSET : Vol. 12, Issue Supplementary-1, ISSN : 2229-7111 (Print) and ISSN : 2454-5767 (Online)
An Overview - Google File System (GFS) and Hadoop Distributed File System (HDFS)
128
2010-2019 S-JPSET : Vol. 12, Issue Supplementary-1, ISSN : 2229-7111 (Print) and ISSN : 2454-5767 (Online) copyright © samriddhi,