Unit 3.1

Unit – 3 HDFS
1
Hadoop Distributed File system
HDFS
2
HDFS- Introduction
• Hadoop Distributed File system – HDFS is the world’s most reliable

storage system.
• HDFS is a Filesystem of Hadoop designed for storing very large files
running on a cluster of commodity hardware.
• It is designed on the principle of storage of less number of large
files rather than the huge number of small files.
• Hadoop HDFS provides a fault-tolerant storage layer for Hadoop
and its other components.
• HDFS Replication of data helps us to attain this feature. It stores
data reliably, even in the case of hardware failure.
3
The Design of HDFS
• HDFS is a filesystem designed for storing very large files with

streaming data access patterns, running on clusters of commodity
hardware.
• Lets see the above statement in details –
• Very large files: “Very large” in this context means files that are
hundreds of megabytes, gigabytes, or terabytes in size.
• There are Hadoop clusters running today that store petabytes of
data.
•
4
The Design of HDFS
Streaming data access :

• HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern.
• A dataset is typically generated or copied from source, then various
analyses are performed on that dataset over time.
Commodity hardware :
• It’s designed to run on clusters of commodity hardware (commonly
available hardware available from multiple vendors3) for which the
chance of node failure across the cluster is high, at least for large
clusters.
• HDFS is designed to carry on working without a noticeable interruption
to the user in the face of such failure.
5
The Design of HDFS
These are areas where HDFS is not a good fit today:

Low-latency data access : Applications that require low-latency access
to data, in the tens of milliseconds range, will not work well with
HDFS.
Lots of small files : Since the namenode holds filesystem metadata in
memory, the limit to the number of files in a filesystem is governed by
the amount of memory on the namenode.
Multiple writers, arbitrary file modifications: Files in HDFS may be
written to by a single writer. There is no support for multiple writers,
or for modifications at arbitrary offsets in the file
6
HDFS Concepts
• Blocks
• Namenode
• Datanode
• HDFS High-Availability
7
Blocks
• Hadoop distributed file system also stores the data in terms of

blocks.
• However the block size in HDFS is very large. The default size of
HDFS block is 128MB.
• The hadoop application is responsible for distributing the data
blocks across multiple nodes.
8
9
Advantages of HDFS Block
• The blocks are of fixed size, so it is very easy to calculate the number of blocks
that can be stored on a disk.
• HDFS block concept simplifies the storage of the datanodes.
• The datanodes doesn’t need to concern about the blocks metadata data like
file permissions etc. The namenode maintains the metadata of all the blocks.
• If the size of the file is less than the HDFS block size, then the file does not
occupy the complete block storage.
• As the file is chunked into blocks, it is easy to store a file that is larger than the
disk size as the data blocks are distributed and stored on multiple nodes in a
hadoop cluster.
• Blocks are easy to replicate between the datanodes and thus provide fault
tolerance and high availability.
10
Namenode (HDFS Master)
• This is the daemon that runs on all the masters.

• NameNode stores metadata like filename, the number of blocks,
number of replicas, a location of blocks, block IDs, etc.
• This metadata is available in memory in the master for faster
retrieval of data.
• In the local disk, a copy of the metadata is available for persistence.
• So NameNode memory should be high as per the requirement.
• NameNode runs on the high configuration hardware.
11
HDFS NameNode
The NameNode stores information about blocks locations,

permissions, etc. on the local disk in the form of two files:
• Fsimage: Fsimage stands for File System image. It contains the
complete namespace of the Hadoop file system since the
NameNode creation.
• Edit log: It contains all the recent changes performed to the file
system namespace to the most recent Fsimage.
12
Functions of HDFS NameNode
• It executes the file system namespace operations like opening, renaming,

and closing files and directories.
• NameNode manages and maintains the DataNodes.
• It determines the mapping of blocks of a file to DataNodes.
• NameNode records each change made to the file system namespace.
• It keeps the locations of each block of a file.
• NameNode takes care of the replication factor of all the blocks.
• NameNode receives heartbeat and block reports from all DataNodes that
ensure DataNode is alive.
• If the DataNode fails, the NameNode chooses new DataNodes for new
replicas.
13
Datanode
• This is the daemon that runs on the slave. These are actual worker nodes
that store the data.
• There are n number of slaves or DataNodes in the Hadoop Distributed
File System that manages storage of data.
• They perform block creation, deletion, and replication upon instruction
from the NameNode.
• Once a block is written on a DataNode, it replicates it to other DataNode,
and the process continues until creating the required number of replicas.
• DataNodes runs on commodity hardware having an average
configuration.
14
HDFS Architecture
15
HDFS Data Write Operation
16
HDFS Data Write Operation
1. The HDFS client sends a create request on DistributedFileSystem APIs.

2. DistributedFileSystem makes an RPC call to the namenode to create a new file
in the file system’s namespace.
3. The DistributedFileSystem returns a FSDataOutputStream for the client to
start writing data to.
4. The list of datanodes form a pipeline, and here we’ll assume the replication
level is three, so there are three nodes in the pipeline.
5. DFSOutputStream also maintains an internal queue of packets that are
waiting to be acknowledged by datanodes, called the ack queue.
6. When the client has finished writing data, it calls close() on the stream.
7. This action flushes all the remaining packets to the datanode pipeline and
waits for acknowledgments before contacting the namenode to signal that
the file is complete.
17
HDFS Data Read Operation(Internal)
18
HDFS Data Read Operation
1. In order to open the required file, the client calls the open() method on
the FileSystem object
2. DistributedFileSystem then calls the NameNode using RPC to get the
locations of the first few blocks of a file.
3. The DistributedFileSystem returns an FSDataInputStream to the client
from where the client can read the data.
4. Then the client calls the read() method on the FSDataInputStream
object.
5. Upon reaching the end of the file, DFSInputStream closes the connection
with that DataNode and finds the best suited DataNode for the next block.
19
Command-Line Interface
• The File System (FS) shell includes various shell-like commands that
directly interact with the Hadoop Distributed File System (HDFS) as
well as other file systems that Hadoop supports, such as Local FS,
HFTP FS, S3 FS, and others.
• https://hadoop.apache.org/docs/r2.7.2/hadoop-project-
dist/hadoop-common/FileSystemShell.html
20
Hadoop HDFS Commands
• With the help of the HDFS command, we can perform Hadoop HDFS
file operations like –
• changing the file permissions,
• viewing the file contents,
• creating files or directories,
• copying file/directory from the local file system to HDFS or vice-versa, etc.
21
1. Version - The Hadoop fs shell command version prints the

Hadoop version.
$ hadoop version
2. ls - check for the directories in HDFS.

$ hadoop fs –ls /
$hadoop fs –ls /directory-name
3. mkdir - Create a new directory
$ hadoop fs –mkdir /path/directory_name
22
4. cat - Copies content of source paths to stdout

$ hadoop fs –cat /path
5. put - copy localfile1 of the local file system to the Hadoop

filesystem.
$ hadoop fs –put ~/localfile path /path of hdfs
6. copyFromLocal – this is similar to put

$ hadoop fs –copyFromLocal ~/Source Path / hdfs-destination
23
7. get - The Hadoop fs shell command get copies the file or

directory from the Hadoop file system to the local file system.
$ hadoop fs –get /Hdfs-Source ~/destination-Path
8. copyToLocal – Similar to get
$ hadoop fs –copyToLocal /hdfs-path ~/destination-path
9. mv - HDFS mv command moves the files or directories from the
source to a destination within HDFS.
$hadoop fs -mv <src> <dest>
24
10. rm - $hadoop fs –rm <path>

11. tail - Is used to display the last 1KB of file on console.
$hadoop fs -tail [-f] <file>
12. setrep - setrep command changes the replication factor to a
specific count
$hadoop fs -setrep <rep> <path>
13. df - df shows the capacity, size, and free space available on the
HDFS file system.
$hadoop fs -df [-h] <path>
25
14. fsck - fsck Hadoop command is used to check the health of the
HDFS
$hadoop fsck <path> -files
15. Help - help shows help for all the commands or the specified
command.
$hadoop fs -help [command]
26
Hadoop file system interfaces
Hadoop provides many interfaces to its filesystems, and it generally

uses the URI scheme to pick the correct filesystem instance to
communicate with.
For example, the filesystem shell that we see in the previous slide
operates with all Hadoop filesystems.
To list the files in the root directory of the local filesystem,
type:
$hadoop fs -ls file:///
27
• Hadoop is written in Java, and all Hadoop filesystem interactions

are mediated through the Java API.
• The filesystem shell (fs) , for example, is a Java application that uses
the Java FileSystem class to provide filesystem operations.
• But for Non-Java application there are following filesystem
interfaces.
• Thrift
• C
• FUSE
• WebDAV
• HTTP, FTP
28
• Thrift
The Thrift API comes with a number of pre-generated stubs for a variety of
languages, including C++, Perl, PHP, Python, and Ruby.
Thrift has support for versioning, so it’s a good choice if you want to access
different versions of a Hadoop filesystem from the same client code.
• C
Hadoop provides a C library called libhdfs that mirrors the Java FileSystem
interface.
The C API is very similar to the Java one, but it typically lags the Java one, so
newer features may not be supported.
Hadoop comes with prebuilt libhdfs binaries for 32-bit Linux.
29
• FUSE
• Filesystem in Userspace (FUSE) allows filesystems that are implemented in user
space to be integrated as a Unix filesystem.
• Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically
HDFS) to be mounted as a standard filesystem.
• You can then use Unix utilities (such as ls and cat) to interact with the filesystem.
• WebDAV
• WebDAV is a set of extensions to HTTP to support editing and updating files.
• WebDAV shares can be mounted as filesystems on most operating systems, so by
exposing HDFS over WebDAV, it’s possible to access HDFS as a standardfilesystem.
30
• HTTP
• HDFS defines a read-only interface for retrieving directory listings and data
over HTTP.
• Directory listings are served by the namenode’s embedded web server
(which runs on port 50070) in XML format, while file data is streamed from
datanodes by their web servers (running on port 50075).
• FTP
• There is an FTP interface to HDFS, which permits the use of the FTP
protocol to interact with HDFS.
• This interface is a convenient way to transfer data into and out of HDFS
using existing FTP clients.
31
Data Ingest with Flume and Sqoop
• Data ingestion is critical and should be emphasized for any big data
project, as the volume of data is usually in terabytes or petabytes,
maybe exabytes.
• Handling huge amounts of data is always a challenge and critical.
So, rather than writing an application to move data into HDFS, we can
used some existing tools for ingesting data because they cover many
of the common requirements.
i.e. Flume and Sqoop
32
Apache Flume
• Apache Flume is a system for moving large quantities of streaming

data into HDFS.
• A very common use case –
• Collecting log data from one system ( Like a bank web servers ) and
aggregating it in HDFS for later analysis.
33
Apache Flume
• Apache Flume is a tool/service/data ingestion mechanism for

collecting aggregating and transporting large amounts of streaming
data such as log files, events (etc...) from various sources to a
centralized data store.
• Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from various
web servers to HDFS.
34
Advantages of Flume
• Using Apache Flume we can store the data in to any of the

centralized stores (HBase, HDFS).
• When the rate of incoming data exceeds the rate at which data can
be written to the destination, Flume acts as a mediator between
data producers and the centralized stores and provides a steady
flow of data between them.
• The transactions in Flume are channel-based where two
transactions (one sender and one receiver) are maintained for each
message. It guarantees reliable message delivery.
35
Apache Sqoop
• Sqoop is a tool designed to transfer data between Hadoop and

relational database servers.
• It is used to import data from relational databases such as MySQL,
Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases.
36
Why do we need Sqoop?
• Analytical processing using Hadoop requires loading of huge

amounts of data from diverse sources into Hadoop clusters.
• This process of bulk data load into Hadoop, from heterogeneous
sources and then processing it, comes with a certain set of
challenges ( Like Data Load, Direct access to external data)
• Maintaining and ensuring data consistency and ensuring efficient
utilization of resources, are some factors to consider before
selecting the right approach for data load.
37
Sqoop Vs Flume
Sqoop Flume
Sqoop is used for importing data from structured Flume is used for moving bulk streaming data
data sources such as RDBMS. into HDFS.
Sqoop has a connector based architecture. Flume has an agent-based architecture. Here, a
Connectors know how to connect to the code is written (which is called as 'agent') which
respective data source and fetch the data. takes care of fetching data.
Sqoop data load is not event-driven. Flume data load can be driven by an event.
In order to import data from structured data In order to load streaming data such as tweets
sources, one has to use Sqoop commands only, generated on Twitter or log files of a web server,
because its connectors know how to interact with Flume should be used. Flume agents are built for
structured data sources and fetch data from fetching streaming data.
them.
38
Hadoop Archives
• HDFS stores small files inefficiently, since each file is stored in a

block, and block metadata is held in memory by the namenode.
• Thus, a large number of small files can eat up a lot of memory on
the namenode.
• (Note, however, that small files do not take up any more disk space
than is required to store the raw contents of the file. For example, a
1 MB file stored with a block size of 128 MB uses 1 MB of disk
space, not 128 MB.)
39
Hadoop Archives
• Hadoop Archives, or HAR files, are a file archiving facility that packs
files into HDFS blocks more efficiently, thereby reducing namenode
memory usage while still allowing transparent access to files.
• Hadoop Archives can be used as input to MapReduce.
• This is the best option for storing large number of small sized
files in HDFS .
40
Limitations of HAR Files
• Creation of HAR files will create a copy of the original files. So, we
need as much disk space as size of original files which we are
archiving. We can delete the original files after creation of archive
to release some disk space.
• Archives are immutable. Once an archive is created, to add or
remove files from/to archive we need to re-create the archive.
• HAR files can be used as input to MapReduce but there is no
archive-aware InputFormat that can pack multiple files into a single
MapReduce split, so processing lots of small files, even in a HAR file
will require lots of map tasks which are inefficient.
41
Hadoop I/O
• Hadoop Comes with a set of primitives for data I/O.

• Some of these are techniques that are more general than Hadoop,
such as data integrity and compression, but deserve special
consideration when dealing with multi-terabyte datasets.
• Others are Hadoop tools or APIs that form the building blocks for
developing distributed system, such as serialization frameworks
and on-disk data structures.
42
Data Integrity
• Since every I/O operation on the disk or network carries with it a small
chance of introducing errors into the data that it is reading or writing.
• When the volumes of data flowing through the system are as large as the ones
Hadoop is capable of handling, the chance of data corruption occurring is high
• The usual way of detecting corrupted data is by computing a checksum for the
data.
• This technique doesn’t offer any way to fix the data, just only error detection.
• Note that it is possible that it’s the checksum that is corrupt, not the data, but
this is very unlikely, since the checksum is much smaller than the data.
• A commonly used error-detecting code is CRC-32, which computes a 32-bit
integer checksum for input of any size.
43
Data Integrity in HDFS
• HDFS transparently checksums all data written to it and by default
verifies checksums when reading data. A separate checksum is
created for every io.bytes.per.checksum bytes of data. The default
is 512 bytes, and since a CRC-32 checksum is 4 bytes long, the
storage overhead is less than 1%.
• Datanodes are responsible for verifying the data they receive before
storing the data and its checksum. This applies to data that they
receive from clients and from other datanodes during replication. If
it detects an error, the client receives a ChecksumException, a
subclass of IOException.
44
• When clients read data from datanodes, they verify checksums as

well, comparing them with the ones stored at the datanode. When
a client successfully verifies a block, it tells the datanode, which
updates its log. Keeping statistics such as these is valuable in
detecting bad disks.
• Aside from block verification on client reads, each datanode runs a

DataBlockScanner in a background thread that periodically verifies
all the blocks stored on the datanode. This is to guard against
corruption due to “bit rot” in the physical storage media.
45
• Since HDFS stores replica of blocks, it can “heal” corrupted blocks

by copying one of the good replicas to produce a new, uncorrupt
replica.
• If a client detects an error when reading a block
• It reports the bad block and datanode it was trying to read from to the
namenode before throwing a ChecksumException.
• The namenode marks the block replica as corrupt, so it doesn’t direct
clients to it, or try to copy this replica to another datanode.
• It then schedules a copy of the block to be replicated on another datanode,
so its replication factor is back at the expected level.
• Once this has happened, the corrupt replica is deleted.
46
• It is possible to disable verification of checksums by passing false to

the setVerifyChecksum() method on FileSystem, before using the
open() method to read a file.
47
Compression
• File compression brings two major benefits –

• It reduces the space needed to store files
• and it speeds up data transfer across the network or to or from disk
• There are many different compression formats, tools and
algorithms used with hadoop-
48
Serialization
• Serialization is the process of turning structured objects into a byte

stream for transmission over a network or for writing to persistent
storage.
• Deserialization is the process of turning a byte stream back into a
series of structured objects.
• In Hadoop, interprocess communication between nodes in the
system is implemented using remote procedure calls(RPCs).
• The RPC protocol uses serialization to render the message into a
binary stream to be sent to the remote node, which then
deserializes the binary stream into the original message.
49
Serialization
In general, it is desirable that an RPC serialization format is –

• Compact – A compact format makes the best use of network
bandwidth.
• Fast – Interprocess communication forms the backbone for a
distributed systems, so it is little fast.
• Extensible – Protocols change over time to meet new requirements.
So it should be extensible.
• Interoperable – For some system, it is desirable to be able to
support clients that are written in different language.
50
Avro
• Apache Avro is a language-neutral data serialization system.

• It was developed by Doug Cutting, the father of Hadoop.
• Auro deals with data formats that can be processed by multiple
languages.
• Avro is a preferred tool to serialize data in Hadoop.
• Avro uses JSON format to declare the data structures. Presently, it
supports languages such as Java, C, C++, C#, Python, and Ruby.
51
Avro Data Types
• Avro defines a small number of primitive data types, which can be

used to build application specific data structures by writing
schemas.
52
Hadoop
Environment
53
Setting up a Hadoop Cluster
• There are a few options when it comes to getting a Hadoop cluster,

from building your own, to running on rented hardware or using an
offering that provides Hadoop as a hosted service in the cloud.
• The number of hosted options is too large to list here, but even if
you choose to build a Hadoop cluster yourself, there are still a
number of installation options:
• Apache tarballs
• Packages - RPM and Debian packages
54
Cluster Specification
• Hadoop is designed to run on commodity hardware.

• Hardware specifications rapidly become obsolete, but for the sake
of illustration, a typical choice of machine for running a Hadoop
datanode and tasktracker in mid-2010 would have the following
specifications:
• Processor - 2 quad-core 2-2.5GHz CPUs
• Memory - 16-24 GB ECC RAM*
• Storage - 4 × 1TB SATA disks
• Network - Gigabit Ethernet
55
How large should your cluster be?
• There isn’t an exact answer to this question, but the beauty of Hadoop is that
you can start with a small cluster (say, 10 nodes) and grow it as your storage
and computational needs grow.
• For a small cluster , it is usually acceptable to run the namenode and the
jobtracker on a single master .
• But, as the cluster and the number of files stored in HDFS grow, the namenode
needs more memory, so the namenode and jobtracker should be moved onto
separate machines.
56
Network Topology
A common Hadoop cluster architecture consists of a two-level

network topology –
57
Rack awareness
• To get maximum performance out of Hadoop, it is important to

configure Hadoop so that it knows the topology of your network.
• If your cluster runs on a single rack, then there is nothing more to
do, since this is the default.
• However, for multirack clusters, you need to map nodes to racks.
• By doing this, Hadoop will prefer within-rack transfers to off-rack
transfers when placing MapReduce tasks on nodes.
• HDFS will be able to place replicas more intelligently to trade-off
performance and resilience.
58
CLUSTER SETUP AND INSTALLATION ( Single Node)
Recommended Platform
OS: Ubuntu 12.04 or later
59
Prerequisites
• Java (oracle java is recommended for production)
• Password-less SSH setup (Hadoop need passwordless ssh from

master to all the slaves, this is required for remote script
invocations)
60
Install Java
• Java 8 or Later is required to run Hadoop 3.2

• Run the following Command to Install Java on Ubuntu
• sudo apt-get update
• sudo apt-get install openjdk-8-jdk
• To Check Java was installed correctly run following command
• java –version
• To check the Java home path
• ls /usr/lib/jvm/java-8-openjdk-amd64/
61
Configure Password-less SSH
Install SSH:
sudo apt-get install ssh
Generate key-value pairs:
ssh-keygen -t rsa -P "“
Configure password-less SSH:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Check by SSH to localhost:
ssh localhost
62
Install Hadoop
• Download Hadoop:
• https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-
3.2.2/hadoop-3.2.2-src.tar.gz
• Un-tar Hadoop using following command

• tar xzf hadoop-3.2.1.tar.gz
63
Hadoop Configuration
64
Hadoop Configuration
• hadoop-env.sh –
• As Hadoop framework is written in Java and uses Java Runtime

environment, one of the important environment variables for
Hadoop daemon is $JAVA_HOME in hadoop-env.sh.
• Open hadoop-env.sh file -
nano ~/hadoop/etc/hadoop/hadoop-env.sh
• Now, Set JAVA_HOME path:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
65
core-site.sh
• This file informs Hadoop daemon where NameNode runs in the

cluster.
• It contains the configuration settings for Hadoop Core such as I/O
settings that are common to HDFS and MapReduce.
• Open - nano ~/hadoop/etc/hadoop/core-site.xml
66
hdfs-site.sh
• This file contains the configuration settings for HDFS daemons; the
Name Node, the Secondary Name Node, and the data nodes.
nano ~/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
67
mapred-site.sh
• This file contains the configuration settings for MapReduce daemons; the job tracker and the
task-trackers.
nano ~/hadoop/etc/hadoop/mapred-site.xml
The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job
Tracker listens for RPC communication.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
68
yarn-site.xml
The yarn.nodemanager.aux-services property tells NodeManagers that there will be an

auxiliary service called mapreduce.shuffle that they need to implement.
nano ~/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
69
Open the bashrc files in the nano editor using the following command:
nano .bashrc
Edit .bashrc file located in the user’s home directory and add the
following parameters:
export HADOOP_HOME="/home/hadoop/hadoop"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
Run - source ~/.bashrc
70
format HDFS
• Before starting Hadoop, we need to format HDFS, which can be

done using the below command:
hdfs namenode –format
71
• Start the HDFS services - > sbin/start-dfs.sh
• Open the HDFS web console - > localhost:9870
• Now start the yarn services - > sbin/start-yarn.sh
• Open the yarn web console - > localhost:8088
Or
start-all.sh - Starts all Hadoop daemons, the namenode, datanodes,
the jobtracker and tasktrackers.
stop-all.sh - Stops all Hadoop daemons.
72
Hadoop Security
Hadoop Security is generally defined as a procedure to secure the Hadoop

Data Storage unit, by offering a virtually impenetrable wall of security
against any potential cyber threat.
Hadoop attains this high-calibre security wall by following the below
security protocol.
• Authentication - It means “Who am I/prove it?”.
• Authorization - It means, “What can I do?”
• Auditing - It means, “What did I do?”
• Data Protection - It means, “How can I encrypt the data at rest and over
the wire?”.
73
Authentication
• Authentication is the first stage where the user’s credentials are

verified.
• The credentials typically include the user’s dedicated User-Name
and a secret password.
• Entered credentials will be checked against the available details on
the security database.
• If valid, the user will be authenticated.
74
Authorization
• It is the second stage that defines what individual users can do after
they have been authenticated.
• Authorization controls what a particular user can do to a specific
file.
• It provides permission to the user whether he can access the data
or not.
75
Auditing
• Auditing is the process of keeping track of what an authenticated,

authorized user did once he gets access to the cluster.
• It records all the activity of the authenticated user, including what

data was accessed, added, changed, and what analysis occurred by
the user from the period when he login to the cluster.
76
Data Protection
• It refers to the use of techniques like encryption and data masking

for preventing sensitive data access by unauthorized users and
applications.
77
Types of Hadoop Security
How Hadoop achieve Security?
• Kerberos Security
• HDFS Encryption
• Traffic Encryption
• HDFS file and directory permission
78
Kerberos Security
• Kerberos is one of the leading Network Authentication Protocol

designed to provide powerful authentication services to
both Server and Client-ends through Secret-Key cryptography
techniques.
• It is proven to be highly secure since it uses encrypted service
tickets throughout the entire session.
• In the secure mode, all Hadoop nodes use Kerberos to do mutual
authentication.
• It means that when two nodes talk to each other, they each make
sure that the other node is who it says it is.
79
Kerberos Security
The client makes the three steps while using Hadoop with Kerberos.
• Authentication: In Kerberos, the client first authenticates itself to
the authentication server. The authentication server provides the
timestamped Ticket-Granting Ticket (TGT) to the client.
• Authorization: The client then uses TGT to request a service ticket
from the Ticket-Granting Server.
• Service Request: On receiving the service ticket, the client directly
interacts with the Hadoop cluster daemons such as NameNode and
ResourceManager.
80
HDFS Encryption
• For data protection, Hadoop HDFS implements transparent

encryption.
• Once it is configured, the data that is to be read from and written to
the special HDFS directories is encrypted and decrypted
transparently without requiring any changes to the user application
code.
• This encryption is end-to-end encryption, which means that only
the client will encrypt or decrypt the data.
81
Traffic Encryption
• Traffic Encryption is none other than HTTPS(HyperText Transfer

Protocol Secure).
• This procedure is used to secure the data transmission, from the
website as well as data transmission to the website.
• Much online banking gateways use this method to secure
transactions over a Security Certificate
82
HDFS file and directory permission
• For authorizing the user, the Hadoop HDFS checks the files and
directory permission after the user authentication.
• Every file and directory in HDFS is having an owner and a group.
• The HDFS do a permission check for the file or directory accessed by
the client as follow:
• If the user name of the client access process matches the owner of file or
directory, then HDFS perform the test for the owner permissions;
• If the group of file/directory matches any of member of the group list of the
client access process, then HDFS perform the test for the group
permissions;
• Otherwise, the HDFS tests the other permissions of files/directories.
83
Hadoop Administration
Hadoop administration, includes both HDFS and MapReduce

administration.
• HDFS administration includes monitoring the HDFS file structure,

locations, and the updated files.
• MapReduce administration includes monitoring the list of
applications, configuration of nodes, application status, etc.
84
HDFS Monitoring
• HDFS (Hadoop Distributed File System) contains the user

directories, input files, and output files.
• Use the MapReduce commands, put and get, for storing and
retrieving.
• After starting the Hadoop framework (daemons) by passing the
command “sbin/start-all.sh” pass the following URL to the browser.
• http://localhost:9870
• Demo
85
Mapreduce job monitoring
• A MapReduce application is a collection of jobs (Map job, Combiner,

Partitioner, and Reduce job). It is mandatory to monitor and
maintain the following −
• Configuration of datanode where the application is suitable.
• The number of datanodes and resources used per application.
• To monitor all these things, we start the Hadoop framework
(daemons) by passing the command “sbin/start-all.sh” pass the
following URL to the browser.
• http://localhost:8088
86
Let’s put your knowledge to the test 87
Q & A Time
We have 10 Minutes for Q&A
88

Unit 3.1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3.1

Uploaded by

Copyright:

Available Formats

Unit – 3 HDFS

• Hadoop Distributed File system – HDFS is the world’s most reliable

• HDFS is a filesystem designed for storing very large files with

Streaming data access :

These are areas where HDFS is not a good fit today:

• Hadoop distributed file system also stores the data in terms of

• This is the daemon that runs on all the masters.

The NameNode stores information about blocks locations,

• It executes the file system namespace operations like opening, renaming,

1. The HDFS client sends a create request on DistributedFileSystem APIs.

1. Version - The Hadoop fs shell command version prints the

2. ls - check for the directories in HDFS.

4. cat - Copies content of source paths to stdout

5. put - copy localfile1 of the local file system to the Hadoop

6. copyFromLocal – this is similar to put

7. get - The Hadoop fs shell command get copies the file or

10. rm - $hadoop fs –rm <path>

$hadoop fs -help [command]

Hadoop provides many interfaces to its filesystems, and it generally

• Hadoop is written in Java, and all Hadoop filesystem interactions

• Apache Flume is a system for moving large quantities of streaming

• Apache Flume is a tool/service/data ingestion mechanism for

• Using Apache Flume we can store the data in to any of the

• Sqoop is a tool designed to transfer data between Hadoop and

• Analytical processing using Hadoop requires loading of huge

• HDFS stores small files inefficiently, since each file is stored in a

• Hadoop Comes with a set of primitives for data I/O.

• When clients read data from datanodes, they verify checksums as

• Aside from block verification on client reads, each datanode runs a

• Since HDFS stores replica of blocks, it can “heal” corrupted blocks

• It is possible to disable verification of checksums by passing false to

• File compression brings two major benefits –

• Serialization is the process of turning structured objects into a byte

In general, it is desirable that an RPC serialization format is –

• Apache Avro is a language-neutral data serialization system.

• Avro defines a small number of primitive data types, which can be

• There are a few options when it comes to getting a Hadoop cluster,

• Hadoop is designed to run on commodity hardware.

A common Hadoop cluster architecture consists of a two-level

• To get maximum performance out of Hadoop, it is important to

OS: Ubuntu 12.04 or later

• Java (oracle java is recommended for production)

• Password-less SSH setup (Hadoop need passwordless ssh from

• Java 8 or Later is required to run Hadoop 3.2

• Un-tar Hadoop using following command

• As Hadoop framework is written in Java and uses Java Runtime

• This file informs Hadoop daemon where NameNode runs in the

The yarn.nodemanager.aux-services property tells NodeManagers that there will be an

• Before starting Hadoop, we need to format HDFS, which can be

• Now start the yarn services - > sbin/start-yarn.sh

• Open the yarn web console - > localhost:8088

Hadoop Security is generally defined as a procedure to secure the Hadoop

• Authorization - It means, “What can I do?”

• Auditing - It means, “What did I do?”

• Authentication is the first stage where the user’s credentials are

• Auditing is the process of keeping track of what an authenticated,

• It records all the activity of the authenticated user, including what

• It refers to the use of techniques like encryption and data masking

How Hadoop achieve Security?

• Kerberos is one of the leading Network Authentication Protocol

• For data protection, Hadoop HDFS implements transparent

• Traffic Encryption is none other than HTTPS(HyperText Transfer

Hadoop administration, includes both HDFS and MapReduce