Professional Documents
Culture Documents
2b Hadoop Ecosystem PDF
2b Hadoop Ecosystem PDF
2
Provides a storage layer for Hadoop. HDFS has
demonstrated production scalability of up to 200 PB of
storage and a single cluster of 4500 servers, supporting close
to a billion files and blocks (cloudera)
3
Based on a master Slave Architecture with Name Node (NN)
being the master and Data Nodes (DN) being the slaves.
NN stores only the meta Information about the files, actual
data is stored in DN.
Name
Node
Physical machine
DN DN DN DN
DN Data node
4
Both NN and DN don’t use super fancy hardware.
The DN uses the underlying OS file System to perform read and
write data.
HDFS clients never send data to NN hence Name Node never
becomes a bottleneck for any Data IO in the cluster
Name
Node
Physical machine
DN DN DN DN
DN Data node
5
Illustration of putting local txt file to Hadoop cluster
6
The meta info saved on Name Node, e.g., replication factor of 3
7
The meta info saved on Name Node, e.g., replication factor of 3
8
9
• HDFS is referred to as WORM (Write Once Read Many)
filesystem and one of most important properties is the
IMMUTABILITY.
• We basically cannot update the data in HDFS
• Mutable objects present several problems, e.g., dealing with
concurrency requiring additional programming to ensure an
object is only updated by a single source at a time.
• Case example:
• For example, you have loaded the data in HDFS data.v1. Then
next week, you need to change the part of data. You can load
the modified data again in HDFS and keep the the latest
version data.v2. Later you can retire the old version of data.
10
Manages distributed applications. It consists of a central
Resource manager (RM), which arbitrates all available
cluster resources, and a per-node Node Manager (NM),
which takes direction from the RM. The NM is responsible
for managing available resources on a single node.
11
• Based on a master Slave Architecture with Resource Manager (RM)
being the master and Node Manager (NM) being the slaves.
• RM keeps the meta info about which jobs are running on which NM
and how much memory and CPU is consumed
Resource
manager
Physical machine
NM NM NM NM
NM Node manager
12
• The jobs run on the NM and jobs never get execute on RM RM
never becomes a bottleneck for any job execution.
• Both RM and NM don’t use super fancy hardware.
• Container is logical abstraction for CPU and RAM.
Resource
manager
Physical machine
NM NM NM NM
NM Node manager
13
• NN and RM are (preferably) on different machines
• The DN and NM processes are co-located on same host.
• A file is saved onto HDFS (DN) and to access a file in distributed way,
one can write a YARN Application using YARN client and to read data
use HDFSclient.
Name Resource
Node manager
Physical machine Physical machine
DN NM DN NM DN NM DN NM
Name Resource
Node manager
Physical machine Physical machine
DN NM DN NM DN NM DN NM
Name Resource
Node manager
Physical machine Physical machine
DN NM DN NM DN NM DN NM
17
Other components are self-described as shown below
18