2b Hadoop Ecosystem PDF

Hadoop is an open-source solution for processing large data sets
in a distributed computing environment. It based on the Google

File System or GFS.
2
 Provides a storage layer for Hadoop. HDFS has
demonstrated production scalability of up to 200 PB of
storage and a single cluster of 4500 servers, supporting close
to a billion files and blocks (cloudera)
3
 Based on a master Slave Architecture with Name Node (NN)
being the master and Data Nodes (DN) being the slaves.
 NN stores only the meta Information about the files, actual
data is stored in DN.
Name
Node
Physical machine
DN DN DN DN
Physical machine Physical machine Physical machine Physical machine
DN Data node
4
 Both NN and DN don’t use super fancy hardware.
 The DN uses the underlying OS file System to perform read and
write data.
 HDFS clients never send data to NN hence Name Node never
becomes a bottleneck for any Data IO in the cluster
Name
Node
Physical machine
DN DN DN DN
DN Data node
5
Illustration of putting local txt file to Hadoop cluster
6
The meta info saved on Name Node, e.g., replication factor of 3
7
The meta info saved on Name Node, e.g., replication factor of 3
• Place the 1st replica

somewhere
• Place the 2nd replica
in a different rack
• Place the 3rd replica
in the same rack as
the second replica.
• If there are more
replicas – spread
them across the rest
of the racks.
8
9
• HDFS is referred to as WORM (Write Once Read Many)
filesystem and one of most important properties is the
IMMUTABILITY.
• We basically cannot update the data in HDFS
• Mutable objects present several problems, e.g., dealing with
concurrency requiring additional programming to ensure an
object is only updated by a single source at a time.
• Case example:
• For example, you have loaded the data in HDFS data.v1. Then
next week, you need to change the part of data. You can load
the modified data again in HDFS and keep the the latest
version data.v2. Later you can retire the old version of data.
10
 Manages distributed applications. It consists of a central
Resource manager (RM), which arbitrates all available
cluster resources, and a per-node Node Manager (NM),
which takes direction from the RM. The NM is responsible
for managing available resources on a single node.
11
• Based on a master Slave Architecture with Resource Manager (RM)
being the master and Node Manager (NM) being the slaves.
• RM keeps the meta info about which jobs are running on which NM
and how much memory and CPU is consumed
Resource
manager
Physical machine
NM NM NM NM
NM Node manager
12
• The jobs run on the NM and jobs never get execute on RM  RM
never becomes a bottleneck for any job execution.
• Both RM and NM don’t use super fancy hardware.
• Container is logical abstraction for CPU and RAM.
Resource
manager
Physical machine
NM NM NM NM
NM Node manager
13
• NN and RM are (preferably) on different machines
• The DN and NM processes are co-located on same host.
• A file is saved onto HDFS (DN) and to access a file in distributed way,
one can write a YARN Application using YARN client and to read data
use HDFSclient.
Name Resource
Node manager
Physical machine Physical machine
DN NM DN NM DN NM DN NM
DN Data node NM Node manager

14
• The Distributed application can fetch file location (meta info from NN),
ask RM to provide containers on the hosts which hold the file blocks.
• With the short-circuit optimization provided by HDFS, hence if the
distributed job gets a container on a host which host the file block and
tries to read it, the read will be local and not over the network.
Name Resource
Node manager

15
• The same file If read sequentially would have taken 4 sec (100 MB/sec
speed) can be read in 1 second as distributed process is running
parallelly on different YARN container (NM) and reading 100 MB/sec *
4 in 1 second.
Name Resource
Node manager

16
 Zookeeper provides a distributed configuration service, a
synchronization service and a naming registry for distributed
systems. It is used by many daemons (including YARN) to
manage their peers in a multiple node setup for high
availability.
17
 Other components are self-described as shown below
18

2b Hadoop Ecosystem PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2b Hadoop Ecosystem PDF

Uploaded by

Copyright:

Available Formats

Hadoop is an open-source solution for processing large data sets

in a distributed computing environment. It based on the Google

Physical machine Physical machine Physical machine Physical machine

Physical machine Physical machine Physical machine Physical machine

• Place the 1st replica

Physical machine Physical machine Physical machine Physical machine

Physical machine Physical machine Physical machine Physical machine

Physical machine Physical machine Physical machine Physical machine

DN Data node NM Node manager

Physical machine Physical machine Physical machine Physical machine

DN Data node NM Node manager

Physical machine Physical machine Physical machine Physical machine

DN Data node NM Node manager

You might also like