Professional Documents
Culture Documents
02 Hadoop Architecture and HDFS
02 Hadoop Architecture and HDFS
www.edureka.co/big-data-and-hadoop
Course Topics
Module 1 Module 6
» Understanding Big Data and Hadoop » HIVE
Module 2 Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase
Module 3 Module 8
» Hadoop MapReduce Framework » Advance HBase
Module 4 Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
Module 5
» PIG Module 10
» Oozie and Hadoop Project
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
Analyse Hadoop 2.x Cluster Architecture – Federation
Slide 3 www.edureka.co/big-data-and-hadoop
Let’s Revise
HDFS Architecture
What is HDFS?
Slide 4 www.edureka.co/big-data-and-hadoop
Pre-Class Questions
Pre-Class Questions
Slide 5 www.edureka.co/big-data-and-hadoop
Annie’s Question
The default replication factor is:
a. 2
b. 4
c. 5
d. 3
Slide 6 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. Option d.
It means if you move a file to HDFS then by default 3
copies of the file will be stored on different datanodes.
Slide 7 www.edureka.co/big-data-and-hadoop
Annie’s Question
Every Slave node has two daemons running on
them that is DataNode and NodeManager in a
MultiNode Cluster.
a. TRUE
b. FALSE
Slide 8 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. TRUE
DataNode service for HDFS and NodeManager for
processing.
Slide 9 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 10 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. TRUE.
As the remaining node ‘L’ will contain the block in
question.
Slide 11 www.edureka.co/big-data-and-hadoop
Hadoop Cluster: A Typical Use Case
Optional
Slide 12 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Cluster Architecture
Slave01
Master
DataNode
Slave02
NameNode
http://master:50070/ NodeManager
DataNode
Slave03
ResourceManager
http://master:8088 NodeManager
DataNode
Slave04
NodeManager
DataNode
Slave05
NodeManager
DataNode
NodeManager
Slide 13 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Cluster Architecture (Contd.)
Client
HDFS YARN
NameNode ResourceManager
Slide 14 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Cluster Architecture - Federation
Namenode NS
Block Management
Datanode … Datanode
Storage
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
Slide 15 www.edureka.co/big-data-and-hadoop
Annie’s Question
How does HDFS Federation help HDFS Scale horizontally?
a. Reduces the load on any single NameNode by using the
multiple, independent NameNode to manage individual
parts of the file system namespace.
b. Provides cross-data centre (non-local) support for
HDFS, allowing a cluster administrator to split the Block
Storage outside the local cluster.
Slide 16 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Slide 17 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 18 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. Put will fail. None of the namespace will manage the
file and you will get an IOException with a No such file or
directory error.
Slide 19 www.edureka.co/big-data-and-hadoop
Hadoop 2.x – High Availability
HDFS HIGH AVAILABILITY
Client
All name space edits
logged to shared NFS
storage; single writer Read edit logs and
Shared Edit Logs applies to its own
(fencing)
namespace
NameNode
High Secondary Active Standby
Availability Name Node NameNode NameNode
App App
Container Container Master
Master
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Slide 20 www.edureka.co/big-data-and-hadoop
Hadoop 2.x – Resource Management
HDFS HIGH AVAILABILITY
HDFS YARN
Client
All name space edits
logged to shared NFS
storage; single writer Read edit logs and
Shared Edit Logs applies to its own
(fencing)
namespace
NameNode
Secondary Active Standby Resource Next Generation
High
Name Node NameNode NameNode Manager MapReduce
Availability
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Slide 21 www.edureka.co/big-data-and-hadoop
Hadoop 2.x – Resource Management (Contd.)
Client
Masters
Resource Manager
Applications
Scheduler Manager
(AsM)
Slaves
App App
Container Container
Master Master
DataNode DataNode
Slide 22 www.edureka.co/big-data-and-hadoop
Annie’s Question
HDFS HA was developed to overcome the following
disadvantage in Hadoop 1.0?
a. Single Point of Failure of NameNode
b. Only one version can be run in classic Map-
Reduce
c. Too much burden on Job Tracker
Slide 23 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Slide 24 www.edureka.co/big-data-and-hadoop
Hadoop Cluster: Facebook
Facebook
We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.
Slide 25 www.edureka.co/big-data-and-hadoop
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Pseudo-Distributed Mode
Fully-Distributed Mode
Slide 26 www.edureka.co/big-data-and-hadoop
Terminal Commands
Slide 27 www.edureka.co/big-data-and-hadoop
Terminal Commands
Slide 28 www.edureka.co/big-data-and-hadoop
Hadoop FS Shell Commands
It provides a command line interface called the FS shell that lets the user interact with data in the HDFS
Slide 29 www.edureka.co/big-data-and-hadoop
Terminal Commands
Listing of files present on HDFS
Slide 30 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Configuration Files
Configuration
Description of Log Files
Filenames
hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.
Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
core-site.xml MapReduce.
Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
hdfs-site.xml nodes.
mapred-site.xml Configuration settings for MapReduce Applications.
yarn-site.xml Configuration settings for ResourceManager and NodeManager.
masters A list of machines (one per line) that each run a secondary namenode.
slaves A list of machines (one per line) that each run a Datanode and a NodeManager.
Slide 31 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Configuration Files – Apache Hadoop
Slide 32 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Configuration Files – Apache Hadoop
Core core-site.xml
HDFS hdfs-site.xml
YARN yarn-site.xml
Map
mapred-site.xml
Reduce
Slide 33 www.edureka.co/big-data-and-hadoop
core-site.xml
-------------------------------------------------core-site.xml-----------------------------------------------------
------------------------------------------------core-site.xml-----------------------------------------------------
Slide 34 www.edureka.co/big-data-and-hadoop
hdfs-site.xml
---------------------------------------------------------hdfs-site.xml-------------------------------------------------------------
-----------------------------------------------mapred-site.xml---------------------------------------------------
-----------------------------------------------mapred-site.xml---------------------------------------------------
Slide 36 www.edureka.co/big-data-and-hadoop
yarn-site.xml
-----------------------------------------------yarn-site.xml---------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- yarn-site.xml --> The auxiliary service
<configuration> name.
<property>
<name>yarn.nodemanager.aux-services</name>
The auxiliary service
<value>mapreduce_shuffle</value> class to use.
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
-----------------------------------------------yarn-site.xml---------------------------------------------------
Slide 37 www.edureka.co/big-data-and-hadoop
All Properties
1. http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/core-default.xml
2. http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
3. http://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/mapred-default.xml
4. http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Slide 38 www.edureka.co/big-data-and-hadoop
Slaves and Masters
Two files are used by the startup and shutdown commands:
Slaves
Contains a list of hosts, one per line, that are to host DataNode and
NodeManager servers.
Masters
Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.
Slide 39 www.edureka.co/big-data-and-hadoop
Per-Process RunTime Environment
This file also offers a way to provide custom parameters for each of the servers.
Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the hadoop directory which is present in
hadoop installation directory (hadoop-2.2.0/etc/hadoop).
export HADOOP_HEAPSIZE=“512"
export HADOOP_DATANODE_HEAPSIZE=“128"
Slide 40 www.edureka.co/big-data-and-hadoop
Hadoop Daemons
Slide 41 www.edureka.co/big-data-and-hadoop
Hadoop Daemons
NameNode daemon
» Runs on master node of the Hadoop Distributed File System (HDFS)
» Directs Data Nodes to perform their low-level I/O tasks
DataNode daemon
» Runs on each slave machine in the HDFS
» Does the low-level I/O work
Resource Manager
» Runs on master node of the Data processing System(MapReduce)
» Global resource Scheduler
Node Manager
» Runs on each slave node of Data processing System
» Platform for the Data processing tasks
Job HistoryServer
» JobHistoryServer is responsible for servicing all job history related requests from client
Slide 42 www.edureka.co/big-data-and-hadoop
Hadoop Web UI Parts
Default
Service Servers Protocol Description
Used Ports
Slide 43 www.edureka.co/big-data-and-hadoop
Web UI URLs
Slide 44 www.edureka.co/big-data-and-hadoop
Annie’s Question
Which of the following file is used to specify the
NameNode's heap size?
a. bashrc
b. hadoop-env.sh
c. hdfs-site.sh
d. core-site.xml
Slide 45 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. hadoop-env.sh.
This file specifies environment variables that affect the
JDK used by Hadoop Daemon (bin/Hadoop)
Slide 46 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 47 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. False.
Detailed answer will be given after the next question.
Slide 48 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 49 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. True.
In Stand alone mode Hadoop runs with default
configuration
(Empty configuration files i.e. no configuration settings in
core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-
site.xml). If properties are not defined in the
configuration files, hadoop runs with default values for
the corresponding properties.
Slide 50 www.edureka.co/big-data-and-hadoop
Sample Examples List
Slide 51 www.edureka.co/big-data-and-hadoop
Running the Teragen Example
Slide 52 www.edureka.co/big-data-and-hadoop
Checking the Output
Slide 53 www.edureka.co/big-data-and-hadoop
Checking the Output
Slide 54 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 55 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. True
It is stored in different part files for eg – part-m-00000,
part-m-00001 and so on.
Slide 56 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 57 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. True
In order to process data in parallel it is necessary that it
is present on HDFS so that MR can work on chunks of
data in parallel.
Slide 58 www.edureka.co/big-data-and-hadoop
Data Loading Techniques and Data Analysis
Data Analysis
HDFS
copyFromLocal: Similar to “put” command, except that the source is restricted to a local file reference.
distcp: Distributed Copy to move data between clusters, used for backup and recovery
Slide 60 www.edureka.co/big-data-and-hadoop
Demo on Copy Commands
Slide 61 www.edureka.co/big-data-and-hadoop
Data Loading Using Flume
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.
Demo will be covered in
Module 10
Twitter
Streaming HDFS
API
Flume
Slide 62 www.edureka.co/big-data-and-hadoop
Data Loading Using Sqoop
Apache Sqoop (TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores such as relational Demo will be covered in
databases. Module 10
Slide 63 www.edureka.co/big-data-and-hadoop
Annie’s Question
Your website is hosting a group of more than 300 sub-
websites. You want to have an analytics on the shopping
patterns of different visitors? What is the best way to
collect those information from the weblogs?
a. SQOOP
b. FLUME
Slide 64 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. FLUME.
Slide 65 www.edureka.co/big-data-and-hadoop
Annie’s Question
You want to join data collected from two sources. One
source of data collected from a big database of call
records is already available in HDFS. The another source
of data is available in a database table. The best way to
move that data in HDFS is:
a. SQOOP import
b. PIG script
c. Hive Query
Slide 66 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Slide 67 www.edureka.co/big-data-and-hadoop
Assignment
Go through Edureka VM and explore it
Check working condition of Hadoop eco system in Edureka VM
Follow this
document to
install
Edureka VM
Slide 68 www.edureka.co/big-data-and-hadoop
Further Reading
Hadoop Cluster Setup
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html
http://www.edureka.in/blog/install-apache-hadoop-cluster/
http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-
hadoop-cluster/
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html
http://www.edureka.in/blog/hadoop-cluster-configuration-files/
Slide 69 www.edureka.co/big-data-and-hadoop
Further Reading
MapReduce Job execution
http://www.edureka.in/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/
http://www.edureka.in/blog/commissioning-and-decommissioning-nodes-in-a-hadoop-cluster/
Secondary Namenode
https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-
hdfs/HdfsUserGuide.html#Secondary_NameNode
Slide 70 www.edureka.co/big-data-and-hadoop
Pre-work for next Class
Setup the Hadoop development environment using the documents present in the LMS.
Refresh your Java Skills using Java Essential for Hadoop Tutorial
http://www.edureka.in/blog/hadoop-interview-questions-hadoop-cluster/
Slide 71 www.edureka.co/big-data-and-hadoop
Agenda for Next Class
Use Cases of MapReduce
Traditional vs MapReduce Way
Hadoop 2.x MapReduce Components and Architecture
YARN Execution Flow
MapReduce Concepts
Slide 72 www.edureka.co/big-data-and-hadoop
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
Slide 73 www.edureka.co/big-data-and-hadoop