2-HadoopArchitecture HDFS

How To Make The Best Use Of Live Sessions
• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session
• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class
• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of on-
going topic
• If you want to connect to your Personal Learning Manager (PLM), dial +917618772501
• We have dedicated support team to assist all your queries. You can reach us anytime on the below numbers:
US: 1855 818 0063 (Toll-Free) | India: +91 9019117772
• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience
Copyright © edureka and/or its affiliates. All rights reserved.

Big Data & Hadoop Certification Training

Course Outline
Understanding Big Data Kafka Monitoring &
Hive
Stream Processing
and Hadoop
Hadoop Architecture Integration of Kafka

Kafka Producer Advance
with Hive&and
Hadoop HBase
Storm
and HDFS
Hadoop MapReduce Integration of Kafka

Kafka Consumer Advance
Framework with Spark &HBase
Flume
Kafka Operation and Processing Distributed Data

Advance MapReduce
Performance Tuning with Apache Spark
Kafka Cluster Architectures Apache Oozie and Hadoop

Pig Kafka Project
& Administering Kafka Project

Module 2: Hadoop Architecture and HDFS

Topics
Following are the topics covered in this module:
▪ Hadoop 2.x cluster architecture
▪ Hadoop 2.x – High Availability
▪ Hadoop 2.x – Resource Management
▪ Hadoop Cluster Modes
▪ Hadoop Terminal Commands
▪ Hadoop 2.x Configuration Files
▪ Hadoop Daemons
▪ Hadoop Web UI Parts
▪ Data Loading Techniques

Objectives
At the end of this module, you will be able to:
▪ Analyse Hadoop 2.x Cluster Architecture – Federation
▪ Analyse Hadoop 2.x Cluster Architecture – High Availability
▪ Run Hadoop in Different Cluster Modes
▪ Run Basic Hadoop Commands on Terminal
▪ Prepare Hadoop 2.x Configuration Files and Analyze the Parameters in it
▪ Analyze Dump of a MapReduce Program
▪ Implement Different Data Loading Techniques

Let’s Revise
▪ Hadoop Core Components
▪ HDFS Architecture
▪ What is HDFS?
▪ Hadoop Vs. Traditional Systems

Resource Node Node Node Node
▪ NameNode and Secondary YARN Manager Manager Manager Manager Manager
NameNode
HDFS DataNode DataNode DataNode DataNode

Cluster NameNode

Pre-Class Questions

Annie’s Question
The default replication factor is:
a. 2
b. 4
c. 5
d. 3

Annie’s Answer
Ans. Option d.
It means if you move a file to HDFS then by default 3 copies of the file
will be stored on different DataNodes.

Annie’s Question
Every Slave node has two daemons running on them that is
DataNode and NodeManager in a MultiNode Cluster.
a. TRUE
b. FALSE

Annie’s Answer
Ans. TRUE
DataNode service for HDFS and NodeManager for processing.

Annie’s Question
A block is replicated in 4 nodes K,L,M, and N. If M, K and N fails. But, a
client can still read the data.
a. TRUE
b. FALSE

Annie’s Answer
Ans. TRUE.
As the remaining node ‘L’ will contain the block in question.

Typical Hadoop Cluster Configuration
Secondary NameNode Active NameNode StandBy NameNode
(optional)
RAM: 64 GB, RAM: 32 GB, RAM: 128 GB,
Hard disk: 1 TB Hard disk: 1 TB Hard disk: 1 TB
Processor: Xenon with 4 Cores Processor: Xenon with 8 Cores Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS OS: 64-bit CentOS
Power: Redundant Power Supply Power: Redundant Power Supply Power: Redundant Power Supply
DataNode DataNode DataNode

DataNode DataNode DataNode
RAM: 16GB RAM: 16GB RAM: 16GB

Hard disk: 6 x 2TB Hard disk: 6 x 2TB Hard disk: 6 x 2TB
Processor: Xenon with 2 cores. Processor: Xenon with 2 cores Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS OS: 64-bit CentOS

Hadoop 2.x Cluster Architecture

Hadoop 2.x Cluster Architecture
Slave01
Master
DataNode
NameNode Slave02
http://master:50070/ NodeManager
DataNode
Slave03
ResourceManager
http://master:8088 NodeManager
DataNode
Slave04
NodeManager
DataNode
Slave05
NodeManager
DataNode
NodeManager

Hadoop 2.x Cluster Architecture (Contd.)
Client
HDFS YARN
NameNode ResourceManager
DataNode DataNode NodeManager NodeManager
NodeManager NodeManager DataNode DataNode
DataNode DataNode NodeManager NodeManager
NodeManager NodeManager DataNode DataNode

Hadoop 2.x Cluster Architecture - Federation
Hadoop 1.0 Hadoop 2.0
ViewFS Map
Namespace
Namenode NS /Financial => NN1

/HR => NN2
/Health Care => NN3
Block Management
NameNode NameNode NameNode

Block Storage
Datanode … Datanode
DNn DNn DNn
Storage

Annie’s Question
How does HDFS Federation help HDFS Scale horizontally?
a. Reduces the load on any single NameNode by using the multiple, independent
NameNode to manage individual parts of the file system namespace.
b. Provides cross-data centre (non-local) support for HDFS, allowing a cluster
administrator to split the Block Storage outside the local cluster.

Annie’s Answer
Ans. Option (a)
In order to scale the name service horizontally, HDFS federation uses
multiple independent NameNode. The NameNode are federated, that
is, the NameNode are independent and don’t require coordination
with each other.

Annie’s Question
You have configured two name nodes to manage /marketing and
/finance respectively. What will happen if you try to put a file in
/accounting directory?

Annie’s Answer
Ans. Put will fail. None of the namespace will manage the file and you
will get an IOException with a No such file or directory error.

Hadoop 2.x – High Availability

Hadoop 2.x – High Availability
HDFS HIGH AVAILABILITY
Client
All name space edits
logged to shared NFS Read edit logs
storage; single writer Shared Edit Logs and applies to its
(fencing) own namespace
NameNode
Secondary Active Standby
High Name Node NameNode NameNode
Availability
DataNode DataNode Data Node

*Not necessary to
configure
Secondary
NameNode Node Manager Node Manager
App App
Container Master Container Master

Hadoop 2.x – Resource Management
HDFS HIGH AVAILABILITY
HDFS YARN
All name space edits Client
logged to shared NFS Read edit logs
storage; single writer Shared Edit Logs and applies to its
(fencing) own namespace
NameNode
Secondary Active Standby Resource Next Generation
High NameNode NameNode NameNode Manager MapReduce
Availability
DataNode DataNode Data Node Node Manager Node Manager

*Not necessary to
configure App
App
Secondary Container Master Container Master
NameNode Node Manager Node Manager
DataNode DataNode
App App

Hadoop 2.x – Resource Management (Contd.)
Client
Masters
Resource Manager
Applications
Scheduler
Manager (AsM)
Node Manager Node Manager
Slaves
App App
DataNode DataNode
YARN – Yet Another Resource Negotiator

Annie’s Question
HDFS HA was developed to overcome the following disadvantage in
Hadoop 1.0?
a. Single Point of Failure of NameNode
b. Only one version can be run in classic MapReduce
c. Too much burden on Job Tracker

Annie’s Answer
Ans. Single Point of Failure of NameNode

Hadoop Cluster: Facebook
Facebook
▪ We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.
▪ Currently we have 2 major clusters:

❑ A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
❑ A 300-machine cluster with 2400 cores and about 3 PB raw storage.
❑ Each (commodity) node has 8 cores and 12 TB of storage.
❑ We are heavy users of both streaming as well as the Java APIs. We have built a
higher level data warehousing framework using these features called Hive(see the
http://Hadoop.apache.org/hive/). We have also developed a FUSE implementation
over HDFS.

Hadoop Cluster Modes

Hadoop Cluster Modes
Standalone (or Local) Mode
• No daemons, everything runs in a single JVM.
• Suitable for running MapReduce programs during development.
• Has no DFS.
Hadoop can run in any of the three

Pseudo-Distributed Mode modes
• Hadoop daemons run on the local machine.
Fully-Distributed Mode
• Hadoop daemons run on a cluster of machines.

Terminal Commands
command: hdfs <args>

Terminal Commands
command: hadoop <args>

Hadoop FS Shell Commands
▪ HDFS organizes data in files and directories
▪ Hadoop provides a command line interface called FS

shell using which a user can interact with directly with
the HDFS
▪ The syntax of the Hadoop commands is similar to bash
▪ Command: hdfs dfs <args>

Terminal Commands
Listing of files present on HDFS
Listing of files present in Hadoop Directory

Hadoop 2.x Configuration Files

Configuration
Description of Log Files
Filenames
hadoop-env.sh Environment Variables that are used in the scripts to run Hadoop.
core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.
hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the Secondary NameNode and the DataNodes.
mapred-site.xml Configuration settings for MapReduce Applications.

yarn-site.xml Configuration settings for ResourceManager and NodeManager.
masters A list of machines (one per line) that each run a Secondary NameNode.
slaves A list of machines (one per line) that each run a Datanode and a NodeManager.


Core core-site.xml
HDFS hdfs-site.xml
YARN yarn-site.xml
MapReduce mapred-site.xml

core-site.xml
<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

The name of the default file
<configuration> system. The url's authority is
<property> used to determine the host,
<name>fs.defaultFS</name> port, etc. for a filesystem.
<value>hdfs://nameservice1</value>
</property>
</configuration>

hdfs-site.xml

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> Determines the number of
 replication of blocks allowed in the
<configuration> HDFS(here the specified value is 3).
<property>
<name>dfs.replication</name> Determines the size of data blocks
<value>3</value> in the HDFS(here, the specified
</property> value is in bytes – 128 MB).
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
</configuration>

mapred-site.xml

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> The runtime framework
 for executing MapReduce
jobs. Can be set to local,
<configuration>
classic or yarn.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

yarn-site.xml

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 HA feature enabled
<configuration> in YARN
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
</configuration>

All Properties
1. https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-common/core-default.xml
2. https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
3. https://hadoop.apache.org/docs/r2.8.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
4. https://hadoop.apache.org/docs/r2.8.5/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Slaves and Masters
Two files are used by the startup and shutdown commands:
Slaves
▪ Contains a list of hosts, one per line, that are to host DataNode
and NodeManager services.
Masters
▪ Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.

Per-Process Run Time Environment
Set parameter JAVA_HOME
hadoop-env.sh JVM
▪ This file also offers a way to provide custom parameters for each of the servers.
▪ hadoop-env.sh is sourced by all of the Hadoop Core scripts provided inside hadoop directory:
/opt/cloudera/parcels/CDH/lib/hadoop/etc/hadoop
▪ Examples of environment variables that you can specify:
▪ export HADOOP_HEAPSIZE=“512“
▪ export HADOOP_DATANODE_HEAPSIZE=“128"

Hadoop Daemons

Hadoop Daemons - Status

Hadoop Daemons
NameNode
▪ Runs on master node of the Hadoop Distributed File System (HDFS)

▪ Directs Data Nodes to perform their low-level I/O tasks
DataNode
▪ Runs on each slave machine in the HDFS

▪ Does the low-level I/O work
ResourceManager
▪ Runs on master node of the Data processing System(MapReduce)

▪ Global resource Scheduler

Hadoop Daemons
NodeManager
▪ Runs on each slave node of Data processing System

▪ Platform for the Data processing tasks
JobHistoryServer
▪ Runs on each slave node of Data processing System

▪ Platform for the Data processing tasks

Hadoop Web UI Parts
Default Used
Service Servers Protocol Description
Ports
Master Nodes
Web UI to look at current status of HDFS, explore
NameNode Web UI (NameNode and any back-up 50070 http
file system
NameNodes)
DataNode All Slave Nodes 50075 http Data Node Web UI to access the status, logs etc.
ResourceManager Cluster Level resource Web UI for Resource-Manager and for

8088 http
Web UI manager application submissions
Monitors resources on Data Node information, List of Applications

NodeManager 8042 TCP
Node and List of containers
MapReduce Get status on finished Providing logs of important events in MapReduce

19888 TCP
JobHistory Server applications. job execution and associated profiling metrics

Web UI URLs
▪ NameNode Status: http://bdlabs.edureka.co:50011
▪ ResourceManager Status: http://bdlabs.edureka.co:50012
▪ MapReduce JobHistoryServer Status: http://bdlabs.edureka.co:50013

Annie’s Question
Which of the following file is used to specify the NameNode's heap
size?
a. bashrc
b. hadoop-env.sh
c. hdfs-site.sh
d. core-site.xml

Annie’s Answer
Ans. hadoop-env.sh.
This file specifies environment variables that affect the JDK
used by Hadoop Daemon (bin/Hadoop)

Annie’s Question
It is necessary to define all the properties in core-site.xml,
hdfs-site.xml,yarn-site.xml & mapred-site.xml.
a. TRUE
b. FALSE

Annie’s Answer
Ans. False.
Detailed answer will be given after the next question.

Annie’s Question
Standalone Mode uses default configuration?
a) TRUE
b) FALSE

Annie’s Answer
Ans. True.
In Stand alone mode Hadoop runs with default configuration
(Empty configuration files i.e. no configuration settings in core-
site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml). If
properties are not defined in the configuration files, Hadoop runs
with default values for the corresponding properties.

Sample Example List

Running the WordCount Example

Checking the Output

Annie’s Question
The output of a MR job will be stored on HDFS:

a. TRUE
b. FALSE

Annie’s Answer
Ans. True
It is stored in different part files for eg – part-m-00000, part-m-00001
and so on.

Annie’s Question
To run MR job, the data should be present on HDFS:
a. TRUE
b. FALSE

Annie’s Answer
Ans. True
In order to process data in parallel, it is necessary that it is present
on HDFS so that MR can work on chunks of data in parallel.

Data Loading Techniques

Data Loading Techniques and Data Analysis
Data Analysis
Using Pig Using HIVE
HDFS
Using Flume Using Sqoop Using Hadoop Copy Commands
Data Loading

Hadoop Copy Commands
put: Copy file(s) from local file system to destination file system. It can also read from “stdin” and writes to destination
file system.
hadoop dfs –put weather.txt hdfs://<target Namenode>
copyFromLocal: Similar to “put” command, except that the source is restricted to a local file reference.
hadoop dfs –copyFromLocal weather.txt hdfs://<target Namenode>
distcp: Distributed Copy to move data between clusters, used for backup and recovery
hadoop distcp hdfs://<source NN> hdfs://<target NN>

Demo on Copy Commands

Data Loading Using Flume
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.
Demo will be covered in
Module 10
Twitter
Streaming HDFS
API
Flume
Twitter Source Memory Channel HDFS Sink

Data Loading Using Sqoop
Apache Sqoop (TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores such as relational
databases. Demo will be covered in
Module 10
▪ Imports individual tables or entire databases to HDFS.

▪ Generates Java classes to allow you to interact with your imported data.
▪ Provides the ability to import from SQL databases straight into your Hive data
warehouse.

Annie’s Question
Your website is hosting a group of more than 300 sub-websites. You want to
have an analytics on the shopping patterns of different visitors? What is the
best way to collect those information from the weblogs?
a. SQOOP
b. FLUME

Annie’s Answer
Ans. FLUME.

Annie’s Question
You want to join data collected from two sources. One source of data
collected from a big database of call records is already available in HDFS.
The another source of data is available in a database table. The best way to
move that data in HDFS is:
a. SQOOP import
b. PIG script
c. Hive Query

Annie’s Answer
Ans. SQOOP import.

Assignment
# Go through Edureka Cloud Lab and explore it
# Check working condition of Hadoop Eco-system in Edureka’s Cloud Lab

Further Reading
▪ Hadoop Cluster Setup
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html
▪ Hadoop on Amazon AWS ec2
http://www.edureka.in/blog/install-apache-hadoop-cluster/
▪ Hadoop Hardware Selection
http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-
cluster/
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html
▪ Hadoop Cluster Configuration
http://www.edureka.in/blog/hadoop-cluster-configuration-files/

Further Reading
▪ MapReduce Job execution
http://www.edureka.in/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/
▪ Add/Remove Nodes in a Cluster
http://www.edureka.in/blog/commissioning-and-decommissioning-nodes-in-a-hadoop-cluster/
▪ Secondary Namenode
https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-
hdfs/HdfsUserGuide.html#Secondary_NameNode

Pre-work for next class
Refresh your Java Skills using Java Essential for Hadoop Tutorial
Review the Interview Questions for setting up hadoop cluster.
http://www.edureka.in/blog/hadoop-interview-questions-hadoop-cluster/

Agenda for Next Class
▪ Use Cases of MapReduce
▪ Traditional vs MapReduce Way
▪ Hadoop 2.x MapReduce Components and Architecture
▪ YARN Execution Flow
▪ MapReduce Concepts

2-HadoopArchitecture HDFS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2-HadoopArchitecture HDFS

Uploaded by

Copyright:

Available Formats

How To Make The Best Use Of Live Sessions

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Hadoop Architecture Integration of Kafka

Hadoop MapReduce Integration of Kafka

Kafka Operation and Processing Distributed Data

Kafka Cluster Architectures Apache Oozie and Hadoop

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Analyse Hadoop 2.x Cluster Architecture – Federation

▪ Analyse Hadoop 2.x Cluster Architecture – High Availability

▪ Run Hadoop in Different Cluster Modes

▪ Run Basic Hadoop Commands on Terminal

▪ Prepare Hadoop 2.x Configuration Files and Analyze the Parameters in it

▪ Analyze Dump of a MapReduce Program

▪ Implement Different Data Loading Techniques

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Hadoop Vs. Traditional Systems

HDFS DataNode DataNode DataNode DataNode

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

DataNode DataNode DataNode

RAM: 16GB RAM: 16GB RAM: 16GB

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

DataNode DataNode NodeManager NodeManager

NodeManager NodeManager DataNode DataNode

DataNode DataNode NodeManager NodeManager

NodeManager NodeManager DataNode DataNode

Copyright © edureka and/or its affiliates. All rights reserved.

Namenode NS /Financial => NN1

NameNode NameNode NameNode

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

DataNode DataNode Data Node

Copyright © edureka and/or its affiliates. All rights reserved.

DataNode DataNode Data Node Node Manager Node Manager

Copyright © edureka and/or its affiliates. All rights reserved.

Node Manager Node Manager

YARN – Yet Another Resource Negotiator

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Ans. Single Point of Failure of NameNode

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Currently we have 2 major clusters:

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

• No daemons, everything runs in a single JVM.

• Suitable for running MapReduce programs during development.

Hadoop can run in any of the three

• Hadoop daemons run on the local machine.