Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Hadoop is an open-source solution for processing large data sets

in a distributed computing environment. It based on the Google


File System or GFS.

2
 Provides a storage layer for Hadoop. HDFS has
demonstrated production scalability of up to 200 PB of
storage and a single cluster of 4500 servers, supporting close
to a billion files and blocks (cloudera)

3
 Based on a master Slave Architecture with Name Node (NN)
being the master and Data Nodes (DN) being the slaves.
 NN stores only the meta Information about the files, actual
data is stored in DN.

Name
Node
Physical machine

DN DN DN DN

Physical machine Physical machine Physical machine Physical machine

DN Data node
4
 Both NN and DN don’t use super fancy hardware.
 The DN uses the underlying OS file System to perform read and
write data.
 HDFS clients never send data to NN hence Name Node never
becomes a bottleneck for any Data IO in the cluster

Name
Node
Physical machine

DN DN DN DN

Physical machine Physical machine Physical machine Physical machine

DN Data node
5
Illustration of putting local txt file to Hadoop cluster

6
The meta info saved on Name Node, e.g., replication factor of 3

7
The meta info saved on Name Node, e.g., replication factor of 3

• Place the 1st replica


somewhere
• Place the 2nd replica
in a different rack
• Place the 3rd replica
in the same rack as
the second replica.
• If there are more
replicas – spread
them across the rest
of the racks.

8
9
• HDFS is referred to as WORM (Write Once Read Many)
filesystem and one of most important properties is the
IMMUTABILITY.
• We basically cannot update the data in HDFS
• Mutable objects present several problems, e.g., dealing with
concurrency requiring additional programming to ensure an
object is only updated by a single source at a time.

• Case example:
• For example, you have loaded the data in HDFS data.v1. Then
next week, you need to change the part of data. You can load
the modified data again in HDFS and keep the the latest
version data.v2. Later you can retire the old version of data.

10
 Manages distributed applications. It consists of a central
Resource manager (RM), which arbitrates all available
cluster resources, and a per-node Node Manager (NM),
which takes direction from the RM. The NM is responsible
for managing available resources on a single node.

11
• Based on a master Slave Architecture with Resource Manager (RM)
being the master and Node Manager (NM) being the slaves.
• RM keeps the meta info about which jobs are running on which NM
and how much memory and CPU is consumed

Resource
manager
Physical machine

NM NM NM NM

Physical machine Physical machine Physical machine Physical machine

NM Node manager

12
• The jobs run on the NM and jobs never get execute on RM  RM
never becomes a bottleneck for any job execution.
• Both RM and NM don’t use super fancy hardware.
• Container is logical abstraction for CPU and RAM.

Resource
manager
Physical machine

NM NM NM NM

Physical machine Physical machine Physical machine Physical machine

NM Node manager

13
• NN and RM are (preferably) on different machines
• The DN and NM processes are co-located on same host.
• A file is saved onto HDFS (DN) and to access a file in distributed way,
one can write a YARN Application using YARN client and to read data
use HDFSclient.

Name Resource
Node manager
Physical machine Physical machine

DN NM DN NM DN NM DN NM

Physical machine Physical machine Physical machine Physical machine

DN Data node NM Node manager


14
• The Distributed application can fetch file location (meta info from NN),
ask RM to provide containers on the hosts which hold the file blocks.
• With the short-circuit optimization provided by HDFS, hence if the
distributed job gets a container on a host which host the file block and
tries to read it, the read will be local and not over the network.

Name Resource
Node manager
Physical machine Physical machine

DN NM DN NM DN NM DN NM

Physical machine Physical machine Physical machine Physical machine

DN Data node NM Node manager


15
• The same file If read sequentially would have taken 4 sec (100 MB/sec
speed) can be read in 1 second as distributed process is running
parallelly on different YARN container (NM) and reading 100 MB/sec *
4 in 1 second.

Name Resource
Node manager
Physical machine Physical machine

DN NM DN NM DN NM DN NM

Physical machine Physical machine Physical machine Physical machine

DN Data node NM Node manager


16
 Zookeeper provides a distributed configuration service, a
synchronization service and a naming registry for distributed
systems. It is used by many daemons (including YARN) to
manage their peers in a multiple node setup for high
availability.

17
 Other components are self-described as shown below

18

You might also like