Professional Documents
Culture Documents
Unit 3.1
Unit 3.1
1
Hadoop Distributed File system
HDFS
2
HDFS- Introduction
3
The Design of HDFS
4
The Design of HDFS
5
The Design of HDFS
6
HDFS Concepts
• Blocks
• Namenode
• Datanode
• HDFS High-Availability
7
Blocks
8
9
Advantages of HDFS Block
• The blocks are of fixed size, so it is very easy to calculate the number of blocks
that can be stored on a disk.
• HDFS block concept simplifies the storage of the datanodes.
• The datanodes doesn’t need to concern about the blocks metadata data like
file permissions etc. The namenode maintains the metadata of all the blocks.
• If the size of the file is less than the HDFS block size, then the file does not
occupy the complete block storage.
• As the file is chunked into blocks, it is easy to store a file that is larger than the
disk size as the data blocks are distributed and stored on multiple nodes in a
hadoop cluster.
• Blocks are easy to replicate between the datanodes and thus provide fault
tolerance and high availability.
10
Namenode (HDFS Master)
11
HDFS NameNode
12
Functions of HDFS NameNode
13
Datanode
• This is the daemon that runs on the slave. These are actual worker nodes
that store the data.
• There are n number of slaves or DataNodes in the Hadoop Distributed
File System that manages storage of data.
• They perform block creation, deletion, and replication upon instruction
from the NameNode.
• Once a block is written on a DataNode, it replicates it to other DataNode,
and the process continues until creating the required number of replicas.
• DataNodes runs on commodity hardware having an average
configuration.
14
HDFS Architecture
15
HDFS Data Write Operation
16
HDFS Data Write Operation
18
HDFS Data Read Operation
1. In order to open the required file, the client calls the open() method on
the FileSystem object
2. DistributedFileSystem then calls the NameNode using RPC to get the
locations of the first few blocks of a file.
3. The DistributedFileSystem returns an FSDataInputStream to the client
from where the client can read the data.
4. Then the client calls the read() method on the FSDataInputStream
object.
5. Upon reaching the end of the file, DFSInputStream closes the connection
with that DataNode and finds the best suited DataNode for the next block.
19
Command-Line Interface
• The File System (FS) shell includes various shell-like commands that
directly interact with the Hadoop Distributed File System (HDFS) as
well as other file systems that Hadoop supports, such as Local FS,
HFTP FS, S3 FS, and others.
• https://hadoop.apache.org/docs/r2.7.2/hadoop-project-
dist/hadoop-common/FileSystemShell.html
20
Hadoop HDFS Commands
• With the help of the HDFS command, we can perform Hadoop HDFS
file operations like –
• changing the file permissions,
• viewing the file contents,
• creating files or directories,
• copying file/directory from the local file system to HDFS or vice-versa, etc.
21
Hadoop HDFS Commands
23
Hadoop HDFS Commands
24
Hadoop HDFS Commands
25
Hadoop HDFS Commands
14. fsck - fsck Hadoop command is used to check the health of the
HDFS
$hadoop fsck <path> -files
15. Help - help shows help for all the commands or the specified
command.
26
Hadoop file system interfaces
27
Hadoop file system interfaces
• Thrift
The Thrift API comes with a number of pre-generated stubs for a variety of
languages, including C++, Perl, PHP, Python, and Ruby.
Thrift has support for versioning, so it’s a good choice if you want to access
different versions of a Hadoop filesystem from the same client code.
• C
Hadoop provides a C library called libhdfs that mirrors the Java FileSystem
interface.
The C API is very similar to the Java one, but it typically lags the Java one, so
newer features may not be supported.
Hadoop comes with prebuilt libhdfs binaries for 32-bit Linux.
29
Hadoop file system interfaces
• FUSE
• Filesystem in Userspace (FUSE) allows filesystems that are implemented in user
space to be integrated as a Unix filesystem.
• Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically
HDFS) to be mounted as a standard filesystem.
• You can then use Unix utilities (such as ls and cat) to interact with the filesystem.
• WebDAV
• WebDAV is a set of extensions to HTTP to support editing and updating files.
• WebDAV shares can be mounted as filesystems on most operating systems, so by
exposing HDFS over WebDAV, it’s possible to access HDFS as a standardfilesystem.
30
Hadoop file system interfaces
• HTTP
• HDFS defines a read-only interface for retrieving directory listings and data
over HTTP.
• Directory listings are served by the namenode’s embedded web server
(which runs on port 50070) in XML format, while file data is streamed from
datanodes by their web servers (running on port 50075).
• FTP
• There is an FTP interface to HDFS, which permits the use of the FTP
protocol to interact with HDFS.
• This interface is a convenient way to transfer data into and out of HDFS
using existing FTP clients.
31
Data Ingest with Flume and Sqoop
• Data ingestion is critical and should be emphasized for any big data
project, as the volume of data is usually in terabytes or petabytes,
maybe exabytes.
• Handling huge amounts of data is always a challenge and critical.
So, rather than writing an application to move data into HDFS, we can
used some existing tools for ingesting data because they cover many
of the common requirements.
i.e. Flume and Sqoop
32
Apache Flume
33
Apache Flume
34
Advantages of Flume
35
Apache Sqoop
36
Why do we need Sqoop?
37
Sqoop Vs Flume
Sqoop Flume
Sqoop is used for importing data from structured Flume is used for moving bulk streaming data
data sources such as RDBMS. into HDFS.
Sqoop has a connector based architecture. Flume has an agent-based architecture. Here, a
Connectors know how to connect to the code is written (which is called as 'agent') which
respective data source and fetch the data. takes care of fetching data.
Sqoop data load is not event-driven. Flume data load can be driven by an event.
In order to import data from structured data In order to load streaming data such as tweets
sources, one has to use Sqoop commands only, generated on Twitter or log files of a web server,
because its connectors know how to interact with Flume should be used. Flume agents are built for
structured data sources and fetch data from fetching streaming data.
them.
38
Hadoop Archives
39
Hadoop Archives
• Hadoop Archives, or HAR files, are a file archiving facility that packs
files into HDFS blocks more efficiently, thereby reducing namenode
memory usage while still allowing transparent access to files.
• Hadoop Archives can be used as input to MapReduce.
• This is the best option for storing large number of small sized
files in HDFS .
40
Limitations of HAR Files
• Creation of HAR files will create a copy of the original files. So, we
need as much disk space as size of original files which we are
archiving. We can delete the original files after creation of archive
to release some disk space.
• Archives are immutable. Once an archive is created, to add or
remove files from/to archive we need to re-create the archive.
• HAR files can be used as input to MapReduce but there is no
archive-aware InputFormat that can pack multiple files into a single
MapReduce split, so processing lots of small files, even in a HAR file
will require lots of map tasks which are inefficient.
41
Hadoop I/O
42
Data Integrity
• Since every I/O operation on the disk or network carries with it a small
chance of introducing errors into the data that it is reading or writing.
• When the volumes of data flowing through the system are as large as the ones
Hadoop is capable of handling, the chance of data corruption occurring is high
• The usual way of detecting corrupted data is by computing a checksum for the
data.
• This technique doesn’t offer any way to fix the data, just only error detection.
• Note that it is possible that it’s the checksum that is corrupt, not the data, but
this is very unlikely, since the checksum is much smaller than the data.
• A commonly used error-detecting code is CRC-32, which computes a 32-bit
integer checksum for input of any size.
43
Data Integrity in HDFS
• HDFS transparently checksums all data written to it and by default
verifies checksums when reading data. A separate checksum is
created for every io.bytes.per.checksum bytes of data. The default
is 512 bytes, and since a CRC-32 checksum is 4 bytes long, the
storage overhead is less than 1%.
• Datanodes are responsible for verifying the data they receive before
storing the data and its checksum. This applies to data that they
receive from clients and from other datanodes during replication. If
it detects an error, the client receives a ChecksumException, a
subclass of IOException.
44
Data Integrity in HDFS
46
Data Integrity in HDFS
47
Compression
48
Serialization
49
Serialization
50
Avro
51
Avro Data Types
52
Hadoop
Environment
53
Setting up a Hadoop Cluster
• The number of hosted options is too large to list here, but even if
you choose to build a Hadoop cluster yourself, there are still a
number of installation options:
• Apache tarballs
• Packages - RPM and Debian packages
54
Cluster Specification
55
How large should your cluster be?
• There isn’t an exact answer to this question, but the beauty of Hadoop is that
you can start with a small cluster (say, 10 nodes) and grow it as your storage
and computational needs grow.
• For a small cluster , it is usually acceptable to run the namenode and the
jobtracker on a single master .
• But, as the cluster and the number of files stored in HDFS grow, the namenode
needs more memory, so the namenode and jobtracker should be moved onto
separate machines.
56
Network Topology
57
Rack awareness
58
CLUSTER SETUP AND INSTALLATION ( Single Node)
Recommended Platform
59
Prerequisites
60
Install Java
61
Configure Password-less SSH
Install SSH:
sudo apt-get install ssh
Generate key-value pairs:
ssh-keygen -t rsa -P "“
Configure password-less SSH:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Check by SSH to localhost:
ssh localhost
62
Install Hadoop
• Download Hadoop:
• https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-
3.2.2/hadoop-3.2.2-src.tar.gz
63
Hadoop Configuration
64
Hadoop Configuration
• hadoop-env.sh –
65
core-site.sh
66
hdfs-site.sh
• This file contains the configuration settings for HDFS daemons; the
Name Node, the Secondary Name Node, and the data nodes.
nano ~/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
67
mapred-site.sh
• This file contains the configuration settings for MapReduce daemons; the job tracker and the
task-trackers.
nano ~/hadoop/etc/hadoop/mapred-site.xml
The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job
Tracker listens for RPC communication.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
68
yarn-site.xml
69
Open the bashrc files in the nano editor using the following command:
nano .bashrc
Edit .bashrc file located in the user’s home directory and add the
following parameters:
export HADOOP_HOME="/home/hadoop/hadoop"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
Run - source ~/.bashrc
70
format HDFS
71
• Start the HDFS services - > sbin/start-dfs.sh
• Open the HDFS web console - > localhost:9870
Or
start-all.sh - Starts all Hadoop daemons, the namenode, datanodes,
the jobtracker and tasktrackers.
stop-all.sh - Stops all Hadoop daemons.
72
Hadoop Security
• Data Protection - It means, “How can I encrypt the data at rest and over
the wire?”.
73
Authentication
74
Authorization
• It is the second stage that defines what individual users can do after
they have been authenticated.
• Authorization controls what a particular user can do to a specific
file.
• It provides permission to the user whether he can access the data
or not.
75
Auditing
76
Data Protection
77
Types of Hadoop Security
• Kerberos Security
• HDFS Encryption
• Traffic Encryption
• HDFS file and directory permission
78
Kerberos Security
79
Kerberos Security
The client makes the three steps while using Hadoop with Kerberos.
• Authentication: In Kerberos, the client first authenticates itself to
the authentication server. The authentication server provides the
timestamped Ticket-Granting Ticket (TGT) to the client.
• Authorization: The client then uses TGT to request a service ticket
from the Ticket-Granting Server.
• Service Request: On receiving the service ticket, the client directly
interacts with the Hadoop cluster daemons such as NameNode and
ResourceManager.
80
HDFS Encryption
81
Traffic Encryption
82
HDFS file and directory permission
• For authorizing the user, the Hadoop HDFS checks the files and
directory permission after the user authentication.
• Every file and directory in HDFS is having an owner and a group.
• The HDFS do a permission check for the file or directory accessed by
the client as follow:
• If the user name of the client access process matches the owner of file or
directory, then HDFS perform the test for the owner permissions;
• If the group of file/directory matches any of member of the group list of the
client access process, then HDFS perform the test for the group
permissions;
• Otherwise, the HDFS tests the other permissions of files/directories.
83
Hadoop Administration
84
HDFS Monitoring
85
Mapreduce job monitoring
86
Let’s put your knowledge to the test 87
Q & A Time
We have 10 Minutes for Q&A
88