02 Hadoop Architecture and HDFS

Module-2
Hadoop Architecture and HDFS
www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE
» Hadoop Architecture and HDFS » Advance HIVE and HBase
» Hadoop MapReduce Framework » Advance HBase
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
 Analyse Hadoop 2.x Cluster Architecture – Federation
 Analyse Hadoop 2.x Cluster Architecture – High Availability
 Run Hadoop in different cluster modes
 Implement basic Hadoop commands on Terminal
 Prepare Hadoop 2.x configuration files and analyze the parameters in it
 Analyze dump of a MapReduce program
 Implement different data loading techniques
Let’s Revise
 Hadoop Core Components
 HDFS Architecture
 What is HDFS?
 Hadoop Vs. Traditional Systems

Resource Node Node Node Node
 NameNode and Secondary YARN Manager Manager Manager Manager Manager
NameNode
HDFS DataNode DataNode DataNode DataNode

Cluster NameNode
Pre-Class Questions
Pre-Class Questions
Annie’s Question
The default replication factor is:
a. 2
b. 4
c. 5
d. 3
Annie’s Answer
Ans. Option d.
It means if you move a file to HDFS then by default 3
copies of the file will be stored on different datanodes.
Annie’s Question
Every Slave node has two daemons running on
them that is DataNode and NodeManager in a
MultiNode Cluster.
a. TRUE
b. FALSE
Annie’s Answer
Ans. TRUE
DataNode service for HDFS and NodeManager for
processing.
Annie’s Question
A block is replicated in 4 nodes K,L,M, and N. If

M, K and N fails. A client can still read the data.
a. TRUE
b. FALSE
Annie’s Answer
Ans. TRUE.
As the remaining node ‘L’ will contain the block in
question.
Hadoop Cluster: A Typical Use Case
Optional
Secondary NameNode Active NameNode StandBy NameNode

RAM: 64 GB, RAM: 32 GB, RAM: 128 GB,
Hard disk: 1 TB Hard disk: 1 TB Hard disk: 1 TB
Processor: Xenon with 4 Cores Processor: Xenon with 8 Cores Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS OS: 64-bit CentOS
Power: Redundant Power Supply Power: Redundant Power Supply Power: Redundant Power Supply
DataNode DataNode DataNode

DataNode DataNode DataNode
RAM: 16GB RAM: 16GB RAM: 16GB

Hard disk: 6 x 2TB Hard disk: 6 x 2TB Hard disk: 6 x 2TB
Processor: Xenon with 2 cores. Processor: Xenon with 2 cores Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS OS: 64-bit CentOS
Hadoop 2.x Cluster Architecture
Slave01
Master
DataNode
Slave02
NameNode
http://master:50070/ NodeManager
DataNode
Slave03
ResourceManager
http://master:8088 NodeManager
DataNode
Slave04
NodeManager
DataNode
Slave05
NodeManager
DataNode
NodeManager
Hadoop 2.x Cluster Architecture (Contd.)
Client
HDFS YARN
NameNode ResourceManager
DataNode DataNode NodeManager NodeManager
NodeManager NodeManager DataNode DataNode
DataNode DataNode NodeManager NodeManager
NodeManager NodeManager DataNode DataNode
Hadoop 2.x Cluster Architecture - Federation
Hadoop 1.0 Hadoop 2.0

Block Storage Namespace
Namenode NS
Block Management
Datanode … Datanode
Storage
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
Annie’s Question
How does HDFS Federation help HDFS Scale horizontally?
a. Reduces the load on any single NameNode by using the
multiple, independent NameNode to manage individual
parts of the file system namespace.
b. Provides cross-data centre (non-local) support for
HDFS, allowing a cluster administrator to split the Block
Storage outside the local cluster.
Annie’s Answer
Ans. Option (a)

In order to scale the name service horizontally, HDFS
federation uses multiple independent NameNode. The
NameNode are federated, that is, the NameNode are
independent and don’t require coordination with each
other.
Annie’s Question
You have configured two name nodes to manage

/marketing and /finance respectively. What will happen if
you try to put a file in /accounting directory?
Annie’s Answer
Ans. Put will fail. None of the namespace will manage the
file and you will get an IOException with a No such file or
directory error.
Hadoop 2.x – High Availability
HDFS HIGH AVAILABILITY
Client
All name space edits
logged to shared NFS
storage; single writer Read edit logs and
Shared Edit Logs applies to its own
(fencing)
namespace
NameNode
High Secondary Active Standby
Availability Name Node NameNode NameNode
DataNode DataNode Data Node

*Not necessary to
configure Secondary
NameNode
Node Manager Node Manager
App App
Container Container Master
Master
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Hadoop 2.x – Resource Management
HDFS HIGH AVAILABILITY
HDFS YARN
Client
All name space edits
logged to shared NFS
storage; single writer Read edit logs and
Shared Edit Logs applies to its own
(fencing)
namespace
NameNode
Secondary Active Standby Resource Next Generation
High
Name Node NameNode NameNode Manager MapReduce
Availability
DataNode DataNode Data Node Node Manager Node Manager

*Not necessary to
configure Secondary App App
NameNode Container Master
Container Master

DataNode DataNode
App App
Container Container Master
Master
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Hadoop 2.x – Resource Management (Contd.)
Client
Masters
Resource Manager
Applications
Scheduler Manager
(AsM)
Slaves
App App
Container Container
Master Master
DataNode DataNode
YARN – Yet Another Resource Negotiator
Annie’s Question
HDFS HA was developed to overcome the following
disadvantage in Hadoop 1.0?
a. Single Point of Failure of NameNode
b. Only one version can be run in classic Map-
Reduce
c. Too much burden on Job Tracker
Annie’s Answer
Ans. Single Point of Failure of NameNode
Hadoop Cluster: Facebook
Facebook
 We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.
 Currently we have 2 major clusters:
» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.

» A 300-machine cluster with 2400 cores and about 3 PB raw storage.
» Each (commodity) node has 8 cores and 12 TB of storage.
» We are heavy users of both streaming as well as the Java APIs. We have
built a higher level data warehousing framework using these features called
Hive(see the http://Hadoop.apache.org/hive/). We have also developed a
FUSE implementation over HDFS.
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Standalone (or Local) Mode
 No daemons, everything runs in a single JVM.

 Suitable for running MapReduce programs during development.
 Has no DFS.
Pseudo-Distributed Mode
 Hadoop daemons run on the local machine.
Fully-Distributed Mode
 Hadoop daemons run on a cluster of machines.
Terminal Commands
Terminal Commands
Hadoop FS Shell Commands
 HDFS organizes its data in files and directories
 It provides a command line interface called the FS shell that lets the user interact with data in the HDFS
 The syntax of the commands is similar to bash
Terminal Commands
Listing of files present on HDFS
Listing of files present in bin Directory
Hadoop 2.x Configuration Files
Configuration
Description of Log Files
Filenames
hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.
Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
core-site.xml MapReduce.
Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
hdfs-site.xml nodes.
mapred-site.xml Configuration settings for MapReduce Applications.
yarn-site.xml Configuration settings for ResourceManager and NodeManager.
masters A list of machines (one per line) that each run a secondary namenode.
slaves A list of machines (one per line) that each run a Datanode and a NodeManager.
Hadoop 2.x Configuration Files – Apache Hadoop
Hadoop 2.x Configuration Files – Apache Hadoop
Core core-site.xml
HDFS hdfs-site.xml
YARN yarn-site.xml
Map
mapred-site.xml
Reduce
core-site.xml
-------------------------------------------------core-site.xml-----------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
The name of the default file
 system. The url's authority is
<configuration> used to determine the host,
<property> port, etc. for a filesystem.
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
------------------------------------------------core-site.xml-----------------------------------------------------
hdfs-site.xml
---------------------------------------------------------hdfs-site.xml-------------------------------------------------------------

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> Determines the number of
replication of blocks allowed in the
 HDFS(here the specified value is
<configuration> 1).
<property>
<name>dfs.replication</name>
<value>1</value> If "true", enable permission
</property> checking in HDFS. If "false",
permission checking is turned off.
<property>
<name>dfs.permissions</name>
<value>false</value> Determines where on the local
filesystem the DFS name node
</property> should store the name
<property> table(fsimage).
<name>dfs.namenode.name.dir</name>
<value>/home/edureka/hadoop-2.2.0/hadoop2_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.2.0/hadoop2_data/hdfs/datanode</value>
</property>
Determines where on the local
</configuration> filesystem an DFS data node should
store its blocks.
---------------------------------------------------------hdfs-site.xml-------------------------------------------------------------
mapred-site.xml
-----------------------------------------------mapred-site.xml---------------------------------------------------

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> The runtime framework for
 executing MapReduce jobs.
Can be one of local, classic
<configuration> or yarn.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
-----------------------------------------------mapred-site.xml---------------------------------------------------
yarn-site.xml
-----------------------------------------------yarn-site.xml---------------------------------------------------
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 The auxiliary service
<configuration> name.
<property>
<name>yarn.nodemanager.aux-services</name>
The auxiliary service
<value>mapreduce_shuffle</value> class to use.
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
-----------------------------------------------yarn-site.xml---------------------------------------------------
All Properties
1. http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/core-default.xml
2. http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
3. http://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/mapred-default.xml
4. http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Slaves and Masters
Two files are used by the startup and shutdown commands:
Slaves
 Contains a list of hosts, one per line, that are to host DataNode and
NodeManager servers.
Masters
 Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.
Per-Process RunTime Environment
Set parameter JAVA_HOME

hadoop-env.sh JVM
 This file also offers a way to provide custom parameters for each of the servers.
 Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the hadoop directory which is present in
hadoop installation directory (hadoop-2.2.0/etc/hadoop).
 Examples of environment variables that you can specify:
export HADOOP_HEAPSIZE=“512"
export HADOOP_DATANODE_HEAPSIZE=“128"
Hadoop Daemons
Hadoop Daemons
 NameNode daemon
» Runs on master node of the Hadoop Distributed File System (HDFS)
» Directs Data Nodes to perform their low-level I/O tasks
 DataNode daemon
» Runs on each slave machine in the HDFS
» Does the low-level I/O work
 Resource Manager
» Runs on master node of the Data processing System(MapReduce)
» Global resource Scheduler
 Node Manager
» Runs on each slave node of Data processing System
» Platform for the Data processing tasks
 Job HistoryServer
» JobHistoryServer is responsible for servicing all job history related requests from client
Hadoop Web UI Parts
Default
Service Servers Protocol Description
Used Ports
NameNode Master Nodes Web UI to look at current status of

50070 http
WebUI (NameNode and any HDFS, explore file system
back-up NameNodes)
Data Node WebUI to access the status,

50075 http
DataNode All Slave Nodes logs etc.
Resource- Cluster Level resource Web UI for Resource-Manager and for

8088 http
Manager Web UI manager application submissions
Monitors resources on Node information, List of Applications

NodeManager 8042 TCP
Data Node and List of containers
MapReduce Providing logs of important events in

JobHistory Get status on finished 19888 TCP MapReduce job execution and associated
Server applications. profiling metrics
Web UI URLs
 NameNode status: http://localhost:50070/dfshealth.jsp
 ResourceManager status: http://localhost:8088/cluster
 MapReduce JobHistory Server status: http://localhost:19888/jobhistory
Annie’s Question
Which of the following file is used to specify the
NameNode's heap size?
a. bashrc
b. hadoop-env.sh
c. hdfs-site.sh
d. core-site.xml
Annie’s Answer
Ans. hadoop-env.sh.
This file specifies environment variables that affect the
JDK used by Hadoop Daemon (bin/Hadoop)
Annie’s Question
It is necessary to define all the properties in core-

site.xml, hdfs-site.xml,yarn-site.xml & mapred-site.xml.
a. TRUE
b. FALSE
Annie’s Answer
Ans. False.
Detailed answer will be given after the next question.
Annie’s Question
Stand alone Mode uses default configuration?

a) TRUE
b) FALSE
Annie’s Answer
Ans. True.
In Stand alone mode Hadoop runs with default
configuration
(Empty configuration files i.e. no configuration settings in
core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-
site.xml). If properties are not defined in the
configuration files, hadoop runs with default values for
the corresponding properties.
Sample Examples List
Running the Teragen Example
Checking the Output
Checking the Output
Annie’s Question
The output of a MR job will be stored on HDFS:

a. TRUE
b. FALSE
Annie’s Answer
Ans. True
It is stored in different part files for eg – part-m-00000,
part-m-00001 and so on.
Annie’s Question
To run MR job data should be present on HDFS:

a. TRUE
b. FALSE
Annie’s Answer
Ans. True
In order to process data in parallel it is necessary that it
is present on HDFS so that MR can work on chunks of
data in parallel.
Data Loading Techniques and Data Analysis
Data Analysis
Using Pig Using HIVE
HDFS
Using Flume Using Sqoop Using Hadoop Copy Commands
Slide 59 Data Loading www.edureka.co/big-data-and-hadoop

Hadoop Copy Commands
put: Copy file(s) from local file system to destination file system. It can also read from “stdin” and writes to
destination file system.
hadoop dfs –put weather.txt hdfs://<target Namenode>
copyFromLocal: Similar to “put” command, except that the source is restricted to a local file reference.
hadoop dfs –copyFromLocal weather.txt hdfs://<target Namenode>
distcp: Distributed Copy to move data between clusters, used for backup and recovery
hadoop distcp hdfs://<source NN> hdfs://<target NN>
Demo on Copy Commands
Data Loading Using Flume
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.
Demo will be covered in
Module 10
Twitter
Streaming HDFS
API
Flume
Twitter Source Memory Channel HDFS Sink
Data Loading Using Sqoop
Apache Sqoop (TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores such as relational Demo will be covered in
databases. Module 10
 Imports individual tables or entire databases to HDFS.
 Generates Java classes to allow you to interact with your

imported data.
 Provides the ability to import from SQL databases straight into

your Hive data warehouse.
Annie’s Question
Your website is hosting a group of more than 300 sub-
websites. You want to have an analytics on the shopping
patterns of different visitors? What is the best way to
collect those information from the weblogs?
a. SQOOP
b. FLUME
Annie’s Answer
Ans. FLUME.
Annie’s Question
You want to join data collected from two sources. One
source of data collected from a big database of call
records is already available in HDFS. The another source
of data is available in a database table. The best way to
move that data in HDFS is:
a. SQOOP import
b. PIG script
c. Hive Query
Annie’s Answer
Ans. SQOOP import.
Assignment
 Go through Edureka VM and explore it
 Check working condition of Hadoop eco system in Edureka VM
Follow this
document to
install
Edureka VM
Further Reading
 Hadoop Cluster Setup
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html
 Hadoop on Amazon AWS ec2
http://www.edureka.in/blog/install-apache-hadoop-cluster/
 Hadoop Hardware Selection
http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-
hadoop-cluster/
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html
 Hadoop Cluster Configuration
http://www.edureka.in/blog/hadoop-cluster-configuration-files/
Further Reading
 MapReduce Job execution
http://www.edureka.in/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/
 Add/Remove Nodes in a Cluster
http://www.edureka.in/blog/commissioning-and-decommissioning-nodes-in-a-hadoop-cluster/
 Secondary Namenode
https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-
hdfs/HdfsUserGuide.html#Secondary_NameNode
Pre-work for next Class
Setup the Hadoop development environment using the documents present in the LMS.
 Refresh your Java Skills using Java Essential for Hadoop Tutorial
Review the Interview Questions for setting up hadoop cluster.
http://www.edureka.in/blog/hadoop-interview-questions-hadoop-cluster/
Agenda for Next Class
 Use Cases of MapReduce
 Traditional vs MapReduce Way
 Hadoop 2.x MapReduce Components and Architecture
 YARN Execution Flow
 MapReduce Concepts
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.

02 Hadoop Architecture and HDFS

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Hadoop Architecture and HDFS

Uploaded by

Copyright:

Available Formats

Module-2

Hadoop Architecture and HDFS

 Analyse Hadoop 2.x Cluster Architecture – High Availability

 Run Hadoop in different cluster modes

 Implement basic Hadoop commands on Terminal

 Prepare Hadoop 2.x configuration files and analyze the parameters in it

 Analyze dump of a MapReduce program

 Implement different data loading techniques

 Hadoop Core Components

 Hadoop Vs. Traditional Systems

HDFS DataNode DataNode DataNode DataNode

A block is replicated in 4 nodes K,L,M, and N. If

Secondary NameNode Active NameNode StandBy NameNode

DataNode DataNode DataNode

RAM: 16GB RAM: 16GB RAM: 16GB

DataNode DataNode NodeManager NodeManager

NodeManager NodeManager DataNode DataNode

DataNode DataNode NodeManager NodeManager

NodeManager NodeManager DataNode DataNode

Hadoop 1.0 Hadoop 2.0

Ans. Option (a)

You have configured two name nodes to manage

DataNode DataNode Data Node

Node Manager Node Manager

DataNode DataNode Data Node Node Manager Node Manager

Node Manager Node Manager

Node Manager Node Manager

YARN – Yet Another Resource Negotiator

Ans. Single Point of Failure of NameNode

 Currently we have 2 major clusters:

» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.

Standalone (or Local) Mode

 No daemons, everything runs in a single JVM.

 Hadoop daemons run on the local machine.

 Hadoop daemons run on a cluster of machines.

 HDFS organizes its data in files and directories

 The syntax of the commands is similar to bash

Listing of files present in bin Directory

<?xml version="1.0" encoding="UTF-8"?>

<?xml version="1.0" encoding="UTF-8"?>

<?xml version="1.0" encoding="UTF-8"?>

Set parameter JAVA_HOME

 Examples of environment variables that you can specify:

NameNode Master Nodes Web UI to look at current status of

Data Node WebUI to access the status,

Resource- Cluster Level resource Web UI for Resource-Manager and for

Monitors resources on Node information, List of Applications

MapReduce Providing logs of important events in

 NameNode status: http://localhost:50070/dfshealth.jsp

 ResourceManager status: http://localhost:8088/cluster

 MapReduce JobHistory Server status: http://localhost:19888/jobhistory

It is necessary to define all the properties in core-

Stand alone Mode uses default configuration?

The output of a MR job will be stored on HDFS:

To run MR job data should be present on HDFS:

Using Pig Using HIVE

Using Flume Using Sqoop Using Hadoop Copy Commands

Slide 59 Data Loading www.edureka.co/big-data-and-hadoop

hadoop dfs –put weather.txt hdfs://<target Namenode>

hadoop dfs –copyFromLocal weather.txt hdfs://<target Namenode>

hadoop distcp hdfs://<source NN> hdfs://<target NN>

Twitter Source Memory Channel HDFS Sink