Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

(AUTONOMOUS)
Accredited by NAAC with‘A’ Grade & NBA (Under Tier - I), ISO 9001:2015 Certified Institution
Approved by AICTE, New Delhi and Affiliated to JNTUK, Kakinada
L.B. REDDY NAGAR, MYLAVARAM, NTR DIST., A.P.-521 230.
hodcse@lbrce.ac.in, cseoffice@lbrce.ac.in, Phone: 08659-222933, Fax: 08659-222931
DEPARTMENT OF COMPUTER SCIENCE &ENGINEERING

UNIT – II

Hadoop Distributed File System


TOPICS: The Design of HDFS, HDFS Concepts, Command Line Interface,
Hadoop file system interfaces,Data flow, Data Ingestion with Sqoop and
Hadoop archives, Hadoop I/O: Compression,Serialization, Avro and File-Based
Data structures.

HDFS:
Hadoop comes with a distributed file system called HDFS. In HDFS data is
distributed over several machines and replicated to ensure their durability to
failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of
blocks, data nodes and node name.

Where to use HDFS


1.Very Large Files: Files should be of hundreds of megabytes, gigabytes or
more.
2.Streaming Data Access: The time to read whole data set is more important
than latency in reading the first. HDFS is built on write-once and read-many-
times pattern
3.Commodity Hardware:It works on low cost hardware.

Where not to use HDFS


1.Low Latency data access: Applications that require very less time to access
the first data should not use HDFS as it is giving importance to whole data
rather than time to fetch the first record.
2.Lots Of Small Files:The name node contains the metadata of files in memory
and if the files are small in size it takes a lot of memory for name node's
memory which is not feasible.
3.Multiple Writes:It should not be used when we have to write multiple times.
HDFS ARCHITECTURE

HDFS Concepts
 BLOCKS
 NAME NODE
 DATA NODE
 SECOUNDARY NAME NODE
 Blocks: A Block is the minimum amount of data that it can read or write.HDFS
blocks are 128 MB by default and this is configurable.Files n HDFS are broken
into block-sized chunks,which are stored as independent units.Unlike a file
system, if the file is in HDFS is smaller than block size, then it does not occupy
full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes
5MB of space only.The HDFS block size is large just to minimize the cost of
seek.
 Name Node: HDFS works in master-worker pattern where the name node
acts as master.Name Node is controller and manager of HDFS as it knows the
status and the metadata of all the files in HDFS; the metadata information
being file permission, names and location of each block.The metadata are
small, so it is stored in the memory of name node,allowing faster access to
data. Moreover the HDFS cluster is accessed by multiple clients
concurrently,so all this information is handled bya single machine. The file
system operations like opening, closing, renaming etc. are executed by it.
 Data Node: They store and retrieve blocks when they are told to; by client or
name node. They report back to name node periodically, with list of blocks
that they are storing. The data node being a commodity hardware also does
the work of block creation, deletion and replication as stated by the name
node.

HDFS DataNode and NameNode Image:

HDFS Read Image:

COMMAND LINE INTERFACE


The File System (FS) shell includes various shell-like commands that
directly interact with the Hadoop Distributed File System (HDFS) as well
as other file systems that Hadoop supports, such as Local FS, HFTP
FS, S3 FS, and others.Below are the commands supported.
For complete documentation please refer the link: FileSystemShell.html
 The HDFS can be manipulated through a Java API or through a
command-line interface.
 The File System (FS) shell includes various shell-like commands that
directly interact with the Hadoop Distributed File System (HDFS) as well
as other file systems that Hadoop supports.
 Below are the commands supported :
 appendToFile: Append the content of the text file in the HDFS.
 cat: Copies source paths to stdout.
 checksum: Returns the checksum information of a file.
 chgrp : Change group association of files. The user must be the owner
of files, or else a super-user.
 chmod : Change the permissions of files. The user must be the owner
of the file, or else a super-user.
 chown: Change the owner of files. The user must be a super-user.
 copyFromLocal: This command copies all the files inside the test
folder in the edge node to the test folder in the HDFS.
 copyToLocal : This command copies all the files inside the test
folder in the HDFS to the test folder in the edge node.
 count: Count the number of directories, files and bytes under the paths
that match the specified file pattern.
 cp: Copy files from source to destination. This command allows
multiple sources as well in which case the destination must be a
directory.
 createSnapshot: HDFS Snapshots are read-only point-in-time copies
of the file system. Snapshots can be taken on a subtree of the file
system or the entire file system. Some common use cases of snapshots
are data backup, protection against user errors and disaster recovery.
 deleteSnapshot: Delete a snapshot from a snapshot table directory.
This operation requires the owner privilege of the snapshottable
directory.
 df: Displays free space
 du: Displays sizes of files and directories contained in the given
directory or the length of a file in case its just a file.
 expunge: Empty the Trash.
 find: Finds all files that match the specified expression and applies
selected actions to them. If no path is specified then defaults to
the current working directory. If no expression is specified then
defaults to -print.
 get Copy files to the local file system.
 getfacl: Displays the Access Control Lists (ACLs) of files and
directories. If a directory has a default ACL, then getfacl also
displays the default ACL.
 getfattr: Displays the extended attribute names and values for a
file or directory.
 getmerge : Takes a source directory and a destination file as input and
concatenates files in src into the destination local file.
 help: Return usage output.
 ls: list files
 lsr: Recursive version of ls.
 mkdir: Takes path URI’s as argument and creates directories.
 moveFromLocal: Similar to put command, except that the source
localsrc is deleted after it’s copied.
 moveToLocal: Displays a “Not implemented yet” message.
 mv: Moves files from source to destination. This command allows
multiple sources as well in which case the destination needs to be
a directory.
 put : Copy single src, or multiple srcs from local file system to the
destination file system. Also reads input from stdin and writes to
destination file system.
 renameSnapshot : Rename a snapshot. This operation requires the
owner privilege of the snapshottable directory.
 rm : Delete files specified as args.
 rmdir : Delete a directory.
 rmr : Recursive version of delete.
 setfacl : Sets Access Control Lists (ACLs) of files and directories.
 setfattr : Sets an extended attribute name and value for a file or
directory.
 setrep: Changes the replication factor of a file. If the path is a
directory then the command recursively changes the replication
factor of all files under the directory tree rooted at the path.
 stat : Print statistics about the file/directory at <path> in the
specified format.
 tail: Displays the last kilobyte of the file to stdout.
 test : Hadoop fs -test -[defsz] URI.
 text: Takes a source file and outputs the file in text format. The
allowed formats are zip and TextRecordInputStream.
 touchz: Create a file of zero length.
 truncate: Truncate all files that match the specified file pattern to the
specified length.
 usage: Return the help for an individual command.

Hadoop file system interfaces

HDFS Storage
One of Hadoop’s key features is its implementation of a distributed file system.
A distributed file system offers a logical file system that is implemented over a
cluster of machines rather than on a single machine. Additional details on
HDFS are available on the Apache Hadoop website. Hadoop offers access to
HDFS through a command line interface.

The Syncfusion Big Data Studio offers Windows Explorer-like access to


HDFS. It allows for common tasks such as folder and file management to be
performed interactively.
Features

 Create new file - Click New button in the HDFS tab, Enter the name of
the file and click Create button in the prompt box opened to create a
new HDFS file.
 Upload files/folder - You can upload files / folder from your local file
system to HDFS by clicking the Upload button in the HDFS tab. In the
prompt select file or folder and then browse for the same and click
upload.
 Set Permission - Click Permission option in the HDFS tab, a popup
window will be displayed. In the popup window change the mode of
permission and click change button to set the permission for files in the
HDFS.
 Copy - Select the file / folder from HDFS directory and click Copy
button in the HDFS tab, in the prompt displayed, enter / browse for
target directory and click Copy button to copy the selected HDFS item.
 Move - Select the file / folder from HDFS directory and click Move
button in the HDFS tab, in the prompt displayed, enter / browse for
target directory and click Move button to move the selected HDFS item.
 Rename - Select the file / folder from the HDFS directory and click
Rename button in the HDFS tab, enter the new name and click Apply to
rename in the HDFS.
 Delete - Select the file / folder from the HDFS directory and click Delete
to perform delete operation.
 Drag and Drop - Select the file / folder from the HDFS directory and
drag it to another folder in the HDFS directory / Windows explorer and
vice versa.
 HDFS File viewer – you can view the content of the HDFS files directly
by double clicking the file.

Data flow

Input reader
The input reader reads the upcoming data and splits it into the data
blocks of the appropriate size (64 MB to 128 MB). Each data block is
associated with a Map function.
Once input reads the data, it generates the corresponding key-value
pairs. The input files reside in HDFS.

Map function
The map function process the upcoming key-value pairs and
generated the corresponding output key-value pairs. The map input
and output type may be different from each other.
Partition function
The partition function assigns the output of each Map function to the
appropriate reducer. The available key and value provide this
function. It returns the index of reducers.
Shuffling and Sorting
The data are shuffled between/within nodes so that it moves out
from the map and get ready to process for reduce function.
Sometimes, the shuffling of data can take much computation time.
Reduce function
The Reduce function is assigned to each unique key. These keys are
already arranged in sorted order. The values associated with the keys
can iterate the Reduce and generates the corresponding output.
Output writer
Once the data flow from all the above phases, Output writer
executes. The role of Output writer is to write the Reduce output to
the stable storage.

Data Ingestion
Hadoop Data ingestion is the beginning of your data pipeline
in a data lake. It means taking data from various silo
databases and files and putting it into Hadoop.
Sqoop
 Apache Sqoop(which is a portmanteau for “sql-to-
hadoop”) is an open source tool that allows users to
extract data from a structured data store into Hadoop
for further processing. This processing can be done with
MapReduce programs or other higher-level tools such as
Hive, Pig or Spark.
 Sqoop can automatically create Hive tables from
imported data from a RDBMS (Relational Database
Management System) table.
 Sqoop can also be used to send data from Hadoop to a
relational database, useful for sending results processed
in Hadoop to an operational transaction processing
system.

Sqoop includes tools for the following


operations:
 Listing databases and tables on a database system
 Importing a single table from a database system,
including specifying which columns to import and
specifying which rows to import using a WHERE clause
 Importing data from one or more tables using a SELECT
statement
 Incremental imports from a table on a database system
(importing only what has changed since a known
previous state)
 Exporting of data from HDFS to a table on a remote
database system
Sqoop Import
The import tool imports individual tables from RDBMS to HDFS. Each row in a table
is treated as a record in HDFS. All records are stored as text data in text files or as
binary data in Avro and Sequence files.

Sqoop Export
The export tool exports a set of files from HDFS back to an RDBMS. The files given
as input to Sqoop contain records, which are called as rows in table. Those are read
and parsed into a set of records and delimited with user-specified delimiter.

Hadoop I/O: Compression


 In the Hadoop framework, where large data sets are stored and
processed, you will need storage for large files.
 These files are divided into blocks and those blocks are stored in
different nodes across the cluster so lots of I/O and network data
transfer is also involved.
 In order to reduce the storage requirements and to reduce the time
spent in-network transfer, you can have a look at data compression in
the Hadoop framework.
 Using data compression in Hadoop you can compress files at various
steps, at all of these steps it will help to reduce storage and quantity of
data transferred.
 You can compress the input file itself.
 That will help you reduce storage space in HDFS.
 You can also configure that the output of a MapReduce job is
compressed in Hadoop.
 That helps is reducing storage space if you are archiving output or
sending it to some other application for further processing.

gzip:
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm,
which is a combination of LZ77 and Huffman Coding.

bzip2:
bzip2 is a freely available, patent free (see below), high-quality data compressor. It
typically compresses files to within 10% to 15% of the best available techniques (the
PPM family of statistical compressors), whilst being around twice as fast at
compression and six times faster at decompression.

LZO:
The LZO compression format is composed of many smaller (~256K) blocks of
compressed data, allowing jobs to be split along block boundaries. Moreover, it was
designed with speed in mind: it decompresses about twice as fast as gzip, meaning
it’s fast enough to keep up with hard drive read speeds. It doesn’t compress quite as
well as gzip — expect files that are on the order of 50% larger than their gzipped
version. But that is still 20-50% of the size of the files without any compression at all,
which means that IO-bound jobs complete the map phase about four times faster.

Snappy:
Snappy is a compression/decompression library. It does not aim for maximum
compression, or compatibility with any other compression library; instead, it aims for
very high speeds and reasonable compression. For instance, compared to the
fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the
resulting compressed files are anywhere from 20% to 100% bigger. On a single core
of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or
more and decompresses at about 500 MB/sec or more. Snappy is widely used inside
Google, in everything from BigTable and MapReduce to our internal RPC systems.

Some tradeoffs:
All compression algorithms exhibit a space/time trade-off: faster compression and
decompression speeds usually come at the expense of smaller space savings. The
tools listed in above table typically give some control over this trade-off at
compression time by offering nine different options: –1 means optimize for speed
and -9 means optimize for space.

The different tools have very different compression characteristics. Gzip is a general
purpose compressor, and sits in the middle of the space/time trade-off. Bzip2
compresses more effectively than gzip, but is slower. Bzip2’s decompression speed
is faster than its compression speed, but it is still slower than the other formats. LZO
and Snappy, on the other hand, both optimize for speed and are around an order of
magnitude faster than gzip, but compress less effectively. Snappy is also
significantly faster than LZO for decompression.

Hadoop I/O Serialization


 Data serialization is a process that converts structure
data manually back to the original form
 Serialize to translate data structures into a stream of
data. Transmit this stream of data over the network or
store it in DB regardless of the system architecture.
 Isn't storing information in binary form or stream of
bytes is the right approach.
 Serialization does the same but isn't dependent on
architecture.

Serialization for Storage Formats

 The RPC serialization format is required to be as follows



o Compact − To make the best use of network

bandwidth, which is the most scarce resource in a


data center.
o Fast − Since the communication between the nodes

is crucial in distributed systems, the serialization


and deserialization process should be quick,
producing less overhead.
o Extensible − Protocols change over time to meet
new requirements, so it should be straightforward
to evolve the protocol in a controlled manner for
clients and servers.
o Interoperable − The message format should support
the nodes that are written in different languages.

Serialization is used in two quite distinct areas of distributed


data processing …
 Interprocess communication
When a client calls a function or subroutine from one pc to
the pc in-network or server, that calling is a remote
procedure call.
 Persistent storage
It is better than java's inbuilt serialization as java serialization
isn't compact. Serialization and Deserialization of data help
maintain and manage corporate decisions for effective use of
resources and data available in Data warehouse or any other
database -writable - language specific to java.

What is Avro?
Apache Avro is a language-neutral data serialization system.
It was developed by Doug Cutting, the father of Hadoop.
Since Hadoop writable classes lack language portability, Avro
becomes quite helpful, as it deals with data formats that can
be processed by multiple languages. Avro is a preferred tool
to serialize data in Hadoop.
Avro has a schema-based system. A language-independent
schema is associated with its read and write operations. Avro
serializes the data which has a built-in schema. Avro serializes
the data into a compact binary format, which can be
deserialized by any application.
Avro uses JSON format to declare the data structures.
Presently, it supports languages such as Java, C, C++, C#,
Python, and Ruby.
Features of Avro

Listed below are some of the prominent features of Avro −


 Avro is a language-neutral data serialization system.
 It can be processed by many languages (currently C, C++,
C#, Java, Python, and Ruby).
 Avro creates binary structured format that is
both compressible and splittable. Hence it can be
efficiently used as the input to Hadoop MapReduce jobs.
 Avro provides rich data structures. For example, you can
create a record that contains an array, an enumerated
type, and a sub record. These datatypes can be created
in any language, can be processed in Hadoop, and the
results can be fed to a third language.
 Avro schemas defined in JSON, facilitate implementation
in the languages that already have JSON libraries.
 Avro creates a self-describing file named Avro Data
File, in which it stores data along with its schema in the
metadata section.
 Avro is also used in Remote Procedure Calls (RPCs).
During RPC, client and server exchange schemas in the
connection handshake.

You might also like