Download as pdf or txt
Download as pdf or txt
You are on page 1of 74

Module-2

Hadoop Architecture and HDFS

www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE

 Module 2  Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase

 Module 3  Module 8
» Hadoop MapReduce Framework » Advance HBase

 Module 4  Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project

Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
 Analyse Hadoop 2.x Cluster Architecture – Federation

 Analyse Hadoop 2.x Cluster Architecture – High Availability

 Run Hadoop in different cluster modes

 Implement basic Hadoop commands on Terminal

 Prepare Hadoop 2.x configuration files and analyze the parameters in it

 Analyze dump of a MapReduce program

 Implement different data loading techniques

Slide 3 www.edureka.co/big-data-and-hadoop
Let’s Revise

 Hadoop Core Components

 HDFS Architecture

 What is HDFS?

 Hadoop Vs. Traditional Systems


Resource Node Node Node Node
 NameNode and Secondary YARN Manager Manager Manager Manager Manager
NameNode

HDFS DataNode DataNode DataNode DataNode


Cluster NameNode

Slide 4 www.edureka.co/big-data-and-hadoop
Pre-Class Questions

Pre-Class Questions

Slide 5 www.edureka.co/big-data-and-hadoop
Annie’s Question
The default replication factor is:
a. 2
b. 4
c. 5
d. 3

Slide 6 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. Option d.
It means if you move a file to HDFS then by default 3
copies of the file will be stored on different datanodes.

Slide 7 www.edureka.co/big-data-and-hadoop
Annie’s Question
Every Slave node has two daemons running on
them that is DataNode and NodeManager in a
MultiNode Cluster.
a. TRUE
b. FALSE

Slide 8 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. TRUE
DataNode service for HDFS and NodeManager for
processing.

Slide 9 www.edureka.co/big-data-and-hadoop
Annie’s Question

A block is replicated in 4 nodes K,L,M, and N. If


M, K and N fails. A client can still read the data.
a. TRUE
b. FALSE

Slide 10 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. TRUE.
As the remaining node ‘L’ will contain the block in
question.

Slide 11 www.edureka.co/big-data-and-hadoop
Hadoop Cluster: A Typical Use Case
Optional

Secondary NameNode Active NameNode StandBy NameNode


RAM: 64 GB, RAM: 32 GB, RAM: 128 GB,
Hard disk: 1 TB Hard disk: 1 TB Hard disk: 1 TB
Processor: Xenon with 4 Cores Processor: Xenon with 8 Cores Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS OS: 64-bit CentOS
Power: Redundant Power Supply Power: Redundant Power Supply Power: Redundant Power Supply

DataNode DataNode DataNode


DataNode DataNode DataNode

RAM: 16GB RAM: 16GB RAM: 16GB


Hard disk: 6 x 2TB Hard disk: 6 x 2TB Hard disk: 6 x 2TB
Processor: Xenon with 2 cores. Processor: Xenon with 2 cores Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS OS: 64-bit CentOS

Slide 12 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Cluster Architecture

Slave01
Master
DataNode
Slave02
NameNode
http://master:50070/ NodeManager
DataNode
Slave03
ResourceManager
http://master:8088 NodeManager
DataNode
Slave04
NodeManager
DataNode
Slave05
NodeManager
DataNode

NodeManager

Slide 13 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Cluster Architecture (Contd.)

Client

HDFS YARN

NameNode ResourceManager

DataNode DataNode NodeManager NodeManager

NodeManager NodeManager DataNode DataNode

DataNode DataNode NodeManager NodeManager

NodeManager NodeManager DataNode DataNode

Slide 14 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Cluster Architecture - Federation

Hadoop 1.0 Hadoop 2.0


Block Storage Namespace

Namenode NS

Block Management

Datanode … Datanode

Storage

http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
Slide 15 www.edureka.co/big-data-and-hadoop
Annie’s Question
How does HDFS Federation help HDFS Scale horizontally?
a. Reduces the load on any single NameNode by using the
multiple, independent NameNode to manage individual
parts of the file system namespace.
b. Provides cross-data centre (non-local) support for
HDFS, allowing a cluster administrator to split the Block
Storage outside the local cluster.

Slide 16 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. Option (a)


In order to scale the name service horizontally, HDFS
federation uses multiple independent NameNode. The
NameNode are federated, that is, the NameNode are
independent and don’t require coordination with each
other.

Slide 17 www.edureka.co/big-data-and-hadoop
Annie’s Question

You have configured two name nodes to manage


/marketing and /finance respectively. What will happen if
you try to put a file in /accounting directory?

Slide 18 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. Put will fail. None of the namespace will manage the
file and you will get an IOException with a No such file or
directory error.

Slide 19 www.edureka.co/big-data-and-hadoop
Hadoop 2.x – High Availability
HDFS HIGH AVAILABILITY
Client
All name space edits
logged to shared NFS
storage; single writer Read edit logs and
Shared Edit Logs applies to its own
(fencing)
namespace
NameNode
High Secondary Active Standby
Availability Name Node NameNode NameNode

DataNode DataNode Data Node


*Not necessary to
configure Secondary
NameNode

Node Manager Node Manager

App App
Container Container Master
Master

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Slide 20 www.edureka.co/big-data-and-hadoop
Hadoop 2.x – Resource Management
HDFS HIGH AVAILABILITY

HDFS YARN
Client
All name space edits
logged to shared NFS
storage; single writer Read edit logs and
Shared Edit Logs applies to its own
(fencing)
namespace
NameNode
Secondary Active Standby Resource Next Generation
High
Name Node NameNode NameNode Manager MapReduce
Availability

DataNode DataNode Data Node Node Manager Node Manager


*Not necessary to
configure Secondary App App
NameNode Container Master
Container Master

Node Manager Node Manager


DataNode DataNode
App App
Container Container Master
Master

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Slide 21 www.edureka.co/big-data-and-hadoop
Hadoop 2.x – Resource Management (Contd.)

Client

Masters
Resource Manager
Applications
Scheduler Manager
(AsM)

Node Manager Node Manager

Slaves
App App
Container Container
Master Master

DataNode DataNode

YARN – Yet Another Resource Negotiator

Slide 22 www.edureka.co/big-data-and-hadoop
Annie’s Question
HDFS HA was developed to overcome the following
disadvantage in Hadoop 1.0?
a. Single Point of Failure of NameNode
b. Only one version can be run in classic Map-
Reduce
c. Too much burden on Job Tracker

Slide 23 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. Single Point of Failure of NameNode

Slide 24 www.edureka.co/big-data-and-hadoop
Hadoop Cluster: Facebook
Facebook

 We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.

 Currently we have 2 major clusters:

» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.


» A 300-machine cluster with 2400 cores and about 3 PB raw storage.
» Each (commodity) node has 8 cores and 12 TB of storage.
» We are heavy users of both streaming as well as the Java APIs. We have
built a higher level data warehousing framework using these features called
Hive(see the http://Hadoop.apache.org/hive/). We have also developed a
FUSE implementation over HDFS.

Slide 25 www.edureka.co/big-data-and-hadoop
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:

Standalone (or Local) Mode

 No daemons, everything runs in a single JVM.


 Suitable for running MapReduce programs during development.
 Has no DFS.

Pseudo-Distributed Mode

 Hadoop daemons run on the local machine.

Fully-Distributed Mode

 Hadoop daemons run on a cluster of machines.

Slide 26 www.edureka.co/big-data-and-hadoop
Terminal Commands

Slide 27 www.edureka.co/big-data-and-hadoop
Terminal Commands

Slide 28 www.edureka.co/big-data-and-hadoop
Hadoop FS Shell Commands

 HDFS organizes its data in files and directories

 It provides a command line interface called the FS shell that lets the user interact with data in the HDFS

 The syntax of the commands is similar to bash

Slide 29 www.edureka.co/big-data-and-hadoop
Terminal Commands
Listing of files present on HDFS

Listing of files present in bin Directory

Slide 30 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Configuration Files

Configuration
Description of Log Files
Filenames
hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.
Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
core-site.xml MapReduce.
Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
hdfs-site.xml nodes.
mapred-site.xml Configuration settings for MapReduce Applications.
yarn-site.xml Configuration settings for ResourceManager and NodeManager.

masters A list of machines (one per line) that each run a secondary namenode.

slaves A list of machines (one per line) that each run a Datanode and a NodeManager.

Slide 31 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Configuration Files – Apache Hadoop

Slide 32 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Configuration Files – Apache Hadoop

Core core-site.xml

HDFS hdfs-site.xml

YARN yarn-site.xml

Map
mapred-site.xml
Reduce

Slide 33 www.edureka.co/big-data-and-hadoop
core-site.xml

-------------------------------------------------core-site.xml-----------------------------------------------------

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
The name of the default file
<!-- core-site.xml --> system. The url's authority is
<configuration> used to determine the host,
<property> port, etc. for a filesystem.
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

------------------------------------------------core-site.xml-----------------------------------------------------

Slide 34 www.edureka.co/big-data-and-hadoop
hdfs-site.xml
---------------------------------------------------------hdfs-site.xml-------------------------------------------------------------

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> Determines the number of
replication of blocks allowed in the
<!-- hdfs-site.xml --> HDFS(here the specified value is
<configuration> 1).
<property>
<name>dfs.replication</name>
<value>1</value> If "true", enable permission
</property> checking in HDFS. If "false",
permission checking is turned off.
<property>
<name>dfs.permissions</name>
<value>false</value> Determines where on the local
filesystem the DFS name node
</property> should store the name
<property> table(fsimage).
<name>dfs.namenode.name.dir</name>
<value>/home/edureka/hadoop-2.2.0/hadoop2_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.2.0/hadoop2_data/hdfs/datanode</value>
</property>
Determines where on the local
</configuration> filesystem an DFS data node should
store its blocks.
---------------------------------------------------------hdfs-site.xml-------------------------------------------------------------
Slide 35 www.edureka.co/big-data-and-hadoop
mapred-site.xml

-----------------------------------------------mapred-site.xml---------------------------------------------------

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> The runtime framework for
<!-- mapred-site.xml --> executing MapReduce jobs.
Can be one of local, classic
<configuration> or yarn.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

-----------------------------------------------mapred-site.xml---------------------------------------------------

Slide 36 www.edureka.co/big-data-and-hadoop
yarn-site.xml

-----------------------------------------------yarn-site.xml---------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- yarn-site.xml --> The auxiliary service
<configuration> name.

<property>
<name>yarn.nodemanager.aux-services</name>
The auxiliary service
<value>mapreduce_shuffle</value> class to use.
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

-----------------------------------------------yarn-site.xml---------------------------------------------------

Slide 37 www.edureka.co/big-data-and-hadoop
All Properties

1. http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/core-default.xml

2. http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

3. http://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/mapred-default.xml

4. http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Slide 38 www.edureka.co/big-data-and-hadoop
Slaves and Masters
Two files are used by the startup and shutdown commands:

Slaves

 Contains a list of hosts, one per line, that are to host DataNode and
NodeManager servers.

Masters

 Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.

Slide 39 www.edureka.co/big-data-and-hadoop
Per-Process RunTime Environment

Set parameter JAVA_HOME


hadoop-env.sh JVM

 This file also offers a way to provide custom parameters for each of the servers.

 Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the hadoop directory which is present in
hadoop installation directory (hadoop-2.2.0/etc/hadoop).

 Examples of environment variables that you can specify:

export HADOOP_HEAPSIZE=“512"

export HADOOP_DATANODE_HEAPSIZE=“128"

Slide 40 www.edureka.co/big-data-and-hadoop
Hadoop Daemons

Slide 41 www.edureka.co/big-data-and-hadoop
Hadoop Daemons

 NameNode daemon
» Runs on master node of the Hadoop Distributed File System (HDFS)
» Directs Data Nodes to perform their low-level I/O tasks

 DataNode daemon
» Runs on each slave machine in the HDFS
» Does the low-level I/O work

 Resource Manager
» Runs on master node of the Data processing System(MapReduce)
» Global resource Scheduler

 Node Manager
» Runs on each slave node of Data processing System
» Platform for the Data processing tasks

 Job HistoryServer
» JobHistoryServer is responsible for servicing all job history related requests from client

Slide 42 www.edureka.co/big-data-and-hadoop
Hadoop Web UI Parts
Default
Service Servers Protocol Description
Used Ports

NameNode Master Nodes Web UI to look at current status of


50070 http
WebUI (NameNode and any HDFS, explore file system
back-up NameNodes)

Data Node WebUI to access the status,


50075 http
DataNode All Slave Nodes logs etc.

Resource- Cluster Level resource Web UI for Resource-Manager and for


8088 http
Manager Web UI manager application submissions

Monitors resources on Node information, List of Applications


NodeManager 8042 TCP
Data Node and List of containers

MapReduce Providing logs of important events in


JobHistory Get status on finished 19888 TCP MapReduce job execution and associated
Server applications. profiling metrics

Slide 43 www.edureka.co/big-data-and-hadoop
Web UI URLs

 NameNode status: http://localhost:50070/dfshealth.jsp

 ResourceManager status: http://localhost:8088/cluster

 MapReduce JobHistory Server status: http://localhost:19888/jobhistory

Slide 44 www.edureka.co/big-data-and-hadoop
Annie’s Question
Which of the following file is used to specify the
NameNode's heap size?
a. bashrc
b. hadoop-env.sh
c. hdfs-site.sh
d. core-site.xml

Slide 45 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. hadoop-env.sh.
This file specifies environment variables that affect the
JDK used by Hadoop Daemon (bin/Hadoop)

Slide 46 www.edureka.co/big-data-and-hadoop
Annie’s Question

It is necessary to define all the properties in core-


site.xml, hdfs-site.xml,yarn-site.xml & mapred-site.xml.
a. TRUE
b. FALSE

Slide 47 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. False.
Detailed answer will be given after the next question.

Slide 48 www.edureka.co/big-data-and-hadoop
Annie’s Question

Stand alone Mode uses default configuration?


a) TRUE
b) FALSE

Slide 49 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. True.
In Stand alone mode Hadoop runs with default
configuration
(Empty configuration files i.e. no configuration settings in
core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-
site.xml). If properties are not defined in the
configuration files, hadoop runs with default values for
the corresponding properties.

Slide 50 www.edureka.co/big-data-and-hadoop
Sample Examples List

Slide 51 www.edureka.co/big-data-and-hadoop
Running the Teragen Example

Slide 52 www.edureka.co/big-data-and-hadoop
Checking the Output

Slide 53 www.edureka.co/big-data-and-hadoop
Checking the Output

Slide 54 www.edureka.co/big-data-and-hadoop
Annie’s Question

The output of a MR job will be stored on HDFS:


a. TRUE
b. FALSE

Slide 55 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. True
It is stored in different part files for eg – part-m-00000,
part-m-00001 and so on.

Slide 56 www.edureka.co/big-data-and-hadoop
Annie’s Question

To run MR job data should be present on HDFS:


a. TRUE
b. FALSE

Slide 57 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. True
In order to process data in parallel it is necessary that it
is present on HDFS so that MR can work on chunks of
data in parallel.

Slide 58 www.edureka.co/big-data-and-hadoop
Data Loading Techniques and Data Analysis
Data Analysis

Using Pig Using HIVE

HDFS

Using Flume Using Sqoop Using Hadoop Copy Commands

Slide 59 Data Loading www.edureka.co/big-data-and-hadoop


Hadoop Copy Commands
put: Copy file(s) from local file system to destination file system. It can also read from “stdin” and writes to
destination file system.

hadoop dfs –put weather.txt hdfs://<target Namenode>

copyFromLocal: Similar to “put” command, except that the source is restricted to a local file reference.

hadoop dfs –copyFromLocal weather.txt hdfs://<target Namenode>

distcp: Distributed Copy to move data between clusters, used for backup and recovery

hadoop distcp hdfs://<source NN> hdfs://<target NN>

Slide 60 www.edureka.co/big-data-and-hadoop
Demo on Copy Commands

Slide 61 www.edureka.co/big-data-and-hadoop
Data Loading Using Flume
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.
Demo will be covered in
Module 10

Twitter
Streaming HDFS
API

Flume

Twitter Source Memory Channel HDFS Sink

Slide 62 www.edureka.co/big-data-and-hadoop
Data Loading Using Sqoop
Apache Sqoop (TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores such as relational Demo will be covered in
databases. Module 10

 Imports individual tables or entire databases to HDFS.

 Generates Java classes to allow you to interact with your


imported data.

 Provides the ability to import from SQL databases straight into


your Hive data warehouse.

Slide 63 www.edureka.co/big-data-and-hadoop
Annie’s Question
Your website is hosting a group of more than 300 sub-
websites. You want to have an analytics on the shopping
patterns of different visitors? What is the best way to
collect those information from the weblogs?
a. SQOOP
b. FLUME

Slide 64 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. FLUME.

Slide 65 www.edureka.co/big-data-and-hadoop
Annie’s Question
You want to join data collected from two sources. One
source of data collected from a big database of call
records is already available in HDFS. The another source
of data is available in a database table. The best way to
move that data in HDFS is:
a. SQOOP import
b. PIG script
c. Hive Query

Slide 66 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. SQOOP import.

Slide 67 www.edureka.co/big-data-and-hadoop
Assignment
 Go through Edureka VM and explore it
 Check working condition of Hadoop eco system in Edureka VM

Follow this
document to
install
Edureka VM

Slide 68 www.edureka.co/big-data-and-hadoop
Further Reading
 Hadoop Cluster Setup

http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html

 Hadoop on Amazon AWS ec2

http://www.edureka.in/blog/install-apache-hadoop-cluster/

 Hadoop Hardware Selection

http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-
hadoop-cluster/

http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html

 Hadoop Cluster Configuration

http://www.edureka.in/blog/hadoop-cluster-configuration-files/

Slide 69 www.edureka.co/big-data-and-hadoop
Further Reading
 MapReduce Job execution

http://www.edureka.in/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/

 Add/Remove Nodes in a Cluster

http://www.edureka.in/blog/commissioning-and-decommissioning-nodes-in-a-hadoop-cluster/

 Secondary Namenode

https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-
hdfs/HdfsUserGuide.html#Secondary_NameNode

Slide 70 www.edureka.co/big-data-and-hadoop
Pre-work for next Class
Setup the Hadoop development environment using the documents present in the LMS.

 Refresh your Java Skills using Java Essential for Hadoop Tutorial

Review the Interview Questions for setting up hadoop cluster.

http://www.edureka.in/blog/hadoop-interview-questions-hadoop-cluster/

Slide 71 www.edureka.co/big-data-and-hadoop
Agenda for Next Class
 Use Cases of MapReduce
 Traditional vs MapReduce Way
 Hadoop 2.x MapReduce Components and Architecture
 YARN Execution Flow
 MapReduce Concepts

Slide 72 www.edureka.co/big-data-and-hadoop
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!

Please spare few minutes to take the survey after the webinar.

Slide 73 www.edureka.co/big-data-and-hadoop

You might also like