Untitled

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 21

What is HDFS Architecture?

Hadoop
Distributed File
System (HDFS) is
like Master-
Worker
architecture. The
master is the
NameNode and
the workers are
the low-cost
commodity
hardware. In the
DataNodes, the
actual data is
stored. In this
architecture, there
is single
NameNode and
multiple
DataNodes.
What is the task of DataNodes?
•DataNodes are the main storages of data. Hadoop uses
low-cost hardware to store data.
•DataNodes are responsible for storing, replication creating,
deleting these type of jobs according to the instruction of
NameNode.
•These DataNodes send the health report to the NameNode
periodically. The default time is 3 seconds. So after every 3
seconds, these send the report to the NameNode .
What is the Secondary NameNode?
•Secondary NameNode: The Secondary NameNode is
another specially dedicated node, which is used to take the
checkpoints of the file-system. The Secondary NameNode is
not the substitute of the Primary NameNode. It helps the
NameNode but not replace for NameNode .
GFS

Google File System (GFS or GoogleFS) is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. The last version of Google File System codenamed Colossus was released in 2010

GFS is not implemented in the kernel of an operating system, but is instead provided as a userspace library.
The Need

• Component failures normal



Files are huge
Due to clustered computing



Most mutations are mutations
By traditional standards (many TB)



Co-Designing apps & file system
Not random access overwrite
• Typical: 1000 nodes & 300 TB
Desiderata

• Must monitor & recover from comp failures

• Modest number of large files

• Workload

– Large streaming reads + small random reads

– Many large sequential writes

• Need semantics for concurrent appends


• Random access overwrites don’t need to be efficient
• High sustained bandwidth

– More important than low latency


Interface

• Familiar

• Novel
– Create, delete, open, close, read, write

– Snapshot

• Low cost

– Record append

• Atomicity with multiple concurrent writes


Architecture

me
ta da t Master
ao
nly
Client Chunk

}
Server

Many
{ Client

Client
Chunk
Server
Many

data only

Client Chunk
Server
Architecture

• Store all files

– In fixed-size chucks

• 64 MB

• 64 bit unique handle Chunk


• Triple redundancy

Server

Chunk
Server

Chunk
Server
Architecture

Stores all metadata


– Namespace
– Access-control information
– Chunk locations
– ‘Lease’ management
•Heartbeats
•Having one master  global knowledge
– Allows better placement / replication
– Simplifies design
Architecture

Client
• GFS code implements API
Client • Cache only metadata

Client

Client
Click to edit Master title style

Cloud File System

12
Click to edit Master title style
• Cloud File Storage is a method for storing data in the cloud
that provides servers and applications access to data through
shared file systems. This compatibility makes cloud file storage
ideal for workloads that rely on shared file systems and
provides simple integration without code changes.

• A Cloud File System is a hierarchical storage system in the


cloud that provides shared access to file data. Users can
create, delete, modify, read, and write files, as well as organize
them logically in directory trees for intuitive access.

• Cloud File Sharing is a service that provides simultaneous


access for multiple users to a common set of files stored in the
cloud. Security for online file storage is managed with user and
group permissions so that administrators can control access to
the shared file data. 1313
Click to edit Master title style
• A Cloud File System is a file system that creates a hub
and spoke method of distributing data. The “hub” is the
central storage area, typically located at a public cloud
provider like Amazon AWS, Microsoft Azure or Google
Cloud. At each spoke a software or hardware appliance is
installed and it acts as a cache for that location’s most
active data.

• A file system in the cloud is exactly what it sounds like. The


vendor creates a file system that offers traditional file
protocols like NFS or SMB to cloud hosted applications.

• The goal with these file systems is to speed the migration


of applications to the cloud. By using a file system in the
cloud the organization does not need to re-write the
storage IO components of the application.
1414
Distributed File System

 Distributed file systems allows users of distributed computers to


share data and storage resources.
 The goal is to present transparency to the users and the system.
 The main purpose of the Distributed File System is to allow users
of physically distributed systems to share their data and resources
by using a Common File System.
 Distributed File Systems support the sharing of information in the
form of files and hardware resources.
 DFS has two components :
1. Location Transparency –
Location Transparency achieves through the namespace component.
2. Redundancy –
Redundancy is done through a file replication component.
Properties
File trans pare ncy: u sersc anacce ssfiles withou tkno wingwh ereth eyarep hysica lystor edon thene twork .
File trans pare ncy: u sersc anacce ssfiles withou tkno wingwh ereth eyarep hysica lystor edon thene twork .

Lo adbala ncing : thefi lesyst emcan distrib utefile acce ssreq uests across multip lecom pute rstoim prov eperfo rman ceand reliab ilty.
Lo adbala ncing : thefi lesyst emcan distrib utefile acce ssreq uests across multip lecom pute rstoim prov eperfo rman ceand reliab ilty.

Da tarep licatio n: the filesys temca nstor ecopie soffi lesonm ultiple compu terst oensu reth atthe filesar eavailab leeven ifoneo fthe comp uters fails.
Da tarep licatio n: the filesys temca nstor ecopie soffi lesonm ultiple compu terst oensu reth atthe filesar eavailab leeven ifoneo fthe comp uters fails.

Se curity : the filesys temc anenfo rceac cessc ontr olpolicie stoen sureth atonly autho rized users canac cessfile s.
Se curity : the filesys temc anenfo rceac cessc ontr olpolicie stoen sureth atonly autho rized users canac cessfile s.

Sc alabilty : the filesys temc ansupp ortala rgenu mbe rofus ersan dalarge numb eroffi les.
Sc alabilty : the filesys temc ansupp ortala rgenu mbe rofus ersan dalarge numb eroffi les.


Working of DFS
Th e re are tw o w aysin w h ich D FScan b e im p le m e n te d :

Th e re are tw o w aysin w h ich D FScan b e im p le m e n te d :

Stan d alo n e D FSn am e sp-Italo ace w so n lyfo rth o se D FSro o tsh ate
xiso n th e lo calo m p u te ran d are n o tu sin gA cti ve D ire cto ry

Stan d alo n e D FSn am e sp-Italoace w so n lyfo rth o se D FSro o tsh ate


xiso n th e lo calo m p u te ran d are n o tu sin gA cti ve D ire cto ry

Do m ain -b ase d n am e sp ace-Itso re sth e co n fi gu rati o n o fD FSin A cti ve D ire cto ry,ce ati n gth e D FSn am e sp ace ro o tace sib le

Do m ain -b ase d n am e sp ace-Itso re sth e co n fi gu rati o n o fD FSin A cti ve D ire cto ry,ce ati n gth e D FSn am e sp ace ro o tace sib le

Ap p l icatio n s
Ap p l icatio n s

N FS(N e
tw o rkFile Syste m )

N FS(N e
tw o rkFile Syste m )

SM B (Se rve rM e sage B lo ck)

SM B (Se rve rM e sage B lo ck)

C IFS(C o mmo n In te rn e
tFil Syste m)

C IFS(C o mmo n In te rn e
tFil Syste m)

H ad o o p

H ad o o p
Advantages: Disadvantages:
DFS allo ws multi pl e u ser to access or sto re t he data.
DFS allo ws multi pl e u ser to access or sto re t he data.

InDist ribute dFileSy stem node sandco nnectionsne edsto besec ured theref orewe cansa ythat secur ityisa tstake .
InDist ribute dFileSy stem node sandco nnectionsne edsto besec ured theref orewe cansa ythat secur ityisa tstake .

It allows th e d ata to b e s hare remote ly.


It allows th e d ata to b e s hare remote ly.

There isapos sibiltyo floseo fme ssages andda tainth enetworkwh ilemo vemen tfrom onen odeto anoth er.
There isapos sibiltyo floseo fme ssages andda tainth enetworkwh ilemo vemen tfrom onen odeto anoth er.

Datab aseco nnectioninca seof Distribu tedFile Syste misco mplica ted.
Datab aseco nnectioninca seof Distribu tedFile Syste misco mplica ted.

It i mpro ved th e availabili tyof file , acces s time, and network effi cie ncy.
It i mpro ved th e availabili tyof file , acces s time, and network effi cie ncy.

.
.
Comparison of
Features of
GFS and HDFS
Features of GFS
GFS has several unique features that are tailored to Google’s
requirements.
Atomic Record Append : It allows Multiple Clients to append data to the
same file without conflicts.
Snapshotting : Enables fast and consistent backups of the entire file
system.
Garbage collection : Frees up space by deleting unused chunks and
replicas.
Features of HDFS
Hdfs also has some distinctive features that are useful for big data
processing.
Replication Factor: Determines how many copies of each block should be
stored across the cluster.
Block Placement Policy : It decides where to place those replicas based
on factors like rack awareness and data locality..
Federation: Allows multiple name nodes to manage different parts of the
namespace and distribute the load.

You might also like