Professional Documents
Culture Documents
Untitled
Untitled
Untitled
Hadoop
Distributed File
System (HDFS) is
like Master-
Worker
architecture. The
master is the
NameNode and
the workers are
the low-cost
commodity
hardware. In the
DataNodes, the
actual data is
stored. In this
architecture, there
is single
NameNode and
multiple
DataNodes.
What is the task of DataNodes?
•DataNodes are the main storages of data. Hadoop uses
low-cost hardware to store data.
•DataNodes are responsible for storing, replication creating,
deleting these type of jobs according to the instruction of
NameNode.
•These DataNodes send the health report to the NameNode
periodically. The default time is 3 seconds. So after every 3
seconds, these send the report to the NameNode .
What is the Secondary NameNode?
•Secondary NameNode: The Secondary NameNode is
another specially dedicated node, which is used to take the
checkpoints of the file-system. The Secondary NameNode is
not the substitute of the Primary NameNode. It helps the
NameNode but not replace for NameNode .
GFS
Google File System (GFS or GoogleFS) is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. The last version of Google File System codenamed Colossus was released in 2010
GFS is not implemented in the kernel of an operating system, but is instead provided as a userspace library.
The Need
•
–
Files are huge
Due to clustered computing
•
–
Most mutations are mutations
By traditional standards (many TB)
•
–
Co-Designing apps & file system
Not random access overwrite
• Typical: 1000 nodes & 300 TB
Desiderata
• Workload
• Familiar
• Novel
– Create, delete, open, close, read, write
– Snapshot
• Low cost
– Record append
me
ta da t Master
ao
nly
Client Chunk
}
Server
Many
{ Client
Client
Chunk
Server
Many
data only
Client Chunk
Server
Architecture
– In fixed-size chucks
• 64 MB
Server
Chunk
Server
Chunk
Server
Architecture
Client
• GFS code implements API
Client • Cache only metadata
Client
Client
Click to edit Master title style
12
Click to edit Master title style
• Cloud File Storage is a method for storing data in the cloud
that provides servers and applications access to data through
shared file systems. This compatibility makes cloud file storage
ideal for workloads that rely on shared file systems and
provides simple integration without code changes.
Lo adbala ncing : thefi lesyst emcan distrib utefile acce ssreq uests across multip lecom pute rstoim prov eperfo rman ceand reliab ilty.
Lo adbala ncing : thefi lesyst emcan distrib utefile acce ssreq uests across multip lecom pute rstoim prov eperfo rman ceand reliab ilty.
Da tarep licatio n: the filesys temca nstor ecopie soffi lesonm ultiple compu terst oensu reth atthe filesar eavailab leeven ifoneo fthe comp uters fails.
Da tarep licatio n: the filesys temca nstor ecopie soffi lesonm ultiple compu terst oensu reth atthe filesar eavailab leeven ifoneo fthe comp uters fails.
Se curity : the filesys temc anenfo rceac cessc ontr olpolicie stoen sureth atonly autho rized users canac cessfile s.
Se curity : the filesys temc anenfo rceac cessc ontr olpolicie stoen sureth atonly autho rized users canac cessfile s.
Sc alabilty : the filesys temc ansupp ortala rgenu mbe rofus ersan dalarge numb eroffi les.
Sc alabilty : the filesys temc ansupp ortala rgenu mbe rofus ersan dalarge numb eroffi les.
Working of DFS
Th e re are tw o w aysin w h ich D FScan b e im p le m e n te d :
Stan d alo n e D FSn am e sp-Italo ace w so n lyfo rth o se D FSro o tsh ate
xiso n th e lo calo m p u te ran d are n o tu sin gA cti ve D ire cto ry
Do m ain -b ase d n am e sp ace-Itso re sth e co n fi gu rati o n o fD FSin A cti ve D ire cto ry,ce ati n gth e D FSn am e sp ace ro o tace sib le
Do m ain -b ase d n am e sp ace-Itso re sth e co n fi gu rati o n o fD FSin A cti ve D ire cto ry,ce ati n gth e D FSn am e sp ace ro o tace sib le
Ap p l icatio n s
Ap p l icatio n s
N FS(N e
tw o rkFile Syste m )
N FS(N e
tw o rkFile Syste m )
C IFS(C o mmo n In te rn e
tFil Syste m)
C IFS(C o mmo n In te rn e
tFil Syste m)
H ad o o p
H ad o o p
Advantages: Disadvantages:
DFS allo ws multi pl e u ser to access or sto re t he data.
DFS allo ws multi pl e u ser to access or sto re t he data.
InDist ribute dFileSy stem node sandco nnectionsne edsto besec ured theref orewe cansa ythat secur ityisa tstake .
InDist ribute dFileSy stem node sandco nnectionsne edsto besec ured theref orewe cansa ythat secur ityisa tstake .
There isapos sibiltyo floseo fme ssages andda tainth enetworkwh ilemo vemen tfrom onen odeto anoth er.
There isapos sibiltyo floseo fme ssages andda tainth enetworkwh ilemo vemen tfrom onen odeto anoth er.
Datab aseco nnectioninca seof Distribu tedFile Syste misco mplica ted.
Datab aseco nnectioninca seof Distribu tedFile Syste misco mplica ted.
It i mpro ved th e availabili tyof file , acces s time, and network effi cie ncy.
It i mpro ved th e availabili tyof file , acces s time, and network effi cie ncy.
.
.
Comparison of
Features of
GFS and HDFS
Features of GFS
GFS has several unique features that are tailored to Google’s
requirements.
Atomic Record Append : It allows Multiple Clients to append data to the
same file without conflicts.
Snapshotting : Enables fast and consistent backups of the entire file
system.
Garbage collection : Frees up space by deleting unused chunks and
replicas.
Features of HDFS
Hdfs also has some distinctive features that are useful for big data
processing.
Replication Factor: Determines how many copies of each block should be
stored across the cluster.
Block Placement Policy : It decides where to place those replicas based
on factors like rack awareness and data locality..
Federation: Allows multiple name nodes to manage different parts of the
namespace and distribute the load.