Chapter 4_Hadoop Ecosystem

Hadoop Ecosystem
Unit II Chapter 1
Prof. Abhishek. N. Nazare
Contents
• Understanding Hadoop Ecosystem
• Hadoop Distributed File System: HDFS Architecture
• Concept of Blocks in HDFS Architecture
• NameNodes and DataNodes
• The Command-Line Interface
• Using HDFS Files
• Hadoop-Specific File System Types
• HDFS Commands
• The org.apache.hadoop.io package
• HDFS High availability: Features of HDFS.
Understanding Hadoop Ecosystem
Hadoop ecosystem can be defined as a

comprehensive collection of tools and
technologies that can be effectively
implemented and deployed to provide Big
Data solutions in a cost effective manner.
MapReduce and Hadoop Distributed File

System (HDFS) are two components of the
Hadoop ecosystem.
Along with these two it provides a collection

of various elements to support the complete
development and deployment of Big Data
solutions.
The fig depicts the elements of the Hadoop

Ecosystem
All these elements enable users to process
large datasets in real time and provide tools
to support various types of Hadoop
projects, schedule jobs and manage cluster
resources.
The fig depicts how the various elements

of Hadoop involve at various stages of
processing data
MapReduce and HDFS provide the

necessary services and basic structure to
deal with the core requirements of Big Data
solutions.
Other services and tools of the ecosystem
provide the environment and components
required to build and manage purpose
driven Big Data applications.
Hadoop Distributed File System
Concepts related to HDFS:
1. Huge document – HDFS is a file system intended for putting away huge documents with streaming information
access. Huge in this connection means records in the vicinity of GB, TB or even PBs in size.
2. Streaming information access – HDFS is created for batch processing. The priority is given to the high throughput of
data access rather than the low latency of data access. A dataset is commonly produced or replicated from the source
and then various analyses are performed on that dataset in the long run.
3. Appliance hardware – Hadoop does not require large, exceptionally dependable hardware to run.
4. Low-latency information access – Applications that permit access to information in milliseconds do not function
well with HDFS. HDFS hence is upgraded for conveying a high transaction volume of information with an expense of
idleness. Hbase is at present superior.
5. Loads of small documents – Since the NameNode holds file system data information in memory, the quantity of
docs in a file system is administered in terms of the memory on the server.
HDFS also makes applications available to parallel processing.

Features of HDFS:
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of NameNode and DataNode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave

architecture
and it has the following elements.
NameNode and a no. of
DataNodes
The NameNode is the master that

manages the various DataNodes
as shown in the fig.
Namenode
The NameNode is the commodity hardware that contains the

GNU/Linux operating system and the NameNode software. It is a
software that can be run on commodity hardware.
The system having the NameNode acts as the master server and it
does the following tasks −
•Manages the file system namespace.

•Stores the metadata for all docs and indexes in the file system.
•Regulates client’s access to files.
•It also executes file system operations such as renaming, closing,
and opening files and directories.
DataNode
The DataNode is a commodity hardware having the GNU/Linux

operating system and DataNode software.
For every node (Commodity hardware/System) in a cluster, there

will be a DataNode.
These nodes manage the data storage of their system.
•DataNodes perform read-write operations on the file systems, as

per client request.
•They also perform operations such as block creation, deletion,
and replication according to the instructions of the NameNode.
NN – NameNode DN - DataNode
HDFS Client
A code library exports HDFS interface
 Read a file
– Ask for a list of DN host replicas of the blocks

– Contact a DN directly and request transfer
 Write a file
– Ask NN to choose DNs to host replicas of the first block of the file
– Organize a pipeline and send the data
– Iteration
 Delete a file and create/delete directory

 Various APIs
– Schedule tasks to where the data are located

– Set replication factor (number of replicas)
HDFS Client (cont.)
Concepts of Blocks in HDFS Architecture
Block
- Generally the user data is
stored in the files of HDFS.
- The file in a file system will be

divided into one or more
segments and/or stored in
individual data nodes.
- These file segments are called

as blocks.
- In other words, the minimum

amount of data that HDFS can
read or write is called a Block.
- The default block size is 64MB,

but it can be increased as per
the need to change in HDFS
configuration. Illustration of Hadoop Heartbeat Message
Block report: identify block replicas
– Block ID, the generation stamp, and the length
– Send first when register and then send per hour
 Heartbeats: message to indicate availability

– Default interval is three seconds
– DN is considered “dead” if not received in 10 ms.
– Contains Information for space allocation and load balancing
● Storage capacity
● Fraction of storage in use
● Number of data transfers currently in progress
– NN replies with instructions to the DN
– Keep frequent. Scalability
To enable this task of reliability one should facilitate no of tasks for failure
management, Some of which are utilized within HDFS and others are still in process to
be implemented.
 Monitoring – DataNode and NameNode communicate through continuous signals

(Heartbeat). If signal is unheard by either of the two the node is considered to have
failed and no longer considered and is replaced by the replica with change in
replication scheme too.
 Rebalancing – According to this process the blocks are shifted from one to another
location where ever free space is available. Better performance can be judged by the
increase in demand of data as well as the increase in demand for replication towards
frequent node failures.
 Metadata replication – these files are prone to failures hence they maintain the
replica of the corresponding file on the same HDFS.
Hadoop-Specific File System Types
File System URI Scheme Java Implementation Definition
Local File fs.LocalFileSystem A file system for a locally

connected disk with client-side
checksums
HDFS hdfs hdfs.DistributedFileSystem Hadoops distributed file system.
HDFS is designed to work
efficiently with MapReduce.
HFTP hftp hdfs.HftpFileSystem A file system providing read-

only access to HDFS(No link
with FTP).
HSFTP hsftp hdfs.HsftpFileSystem A file system providing read-
only access to HDFS over HTTPS
HAR har fs.HarFileSystem A fie system layered on another

file system for archiving files.
KFS (CloudStore) kfs fs.kfs.KosmosFileSystem ClousStore is a distributed file

system like HDFS or Google’s
GFS written in C++.
FTP ftp fs.ftp.FTPFileSystem A file system backed by an FTP server
S3 (native) s3n fs.s3native.NativeS3FileSystem A file system backed by Amazon S3

S3 (blockbased) s3 fs.s3.S3FileSystem A file system backed by Amazon S3, which
stores files in blocks (much like HDFS) to
overcome S3’s 5GB file size limit.
HDFS Commands and their description
Commands Description Syntax Exit
appendToFile Append single src, or multiple srcs hdfs dfs -appendToFile … Exit Code: Returns 0 on
from local file system to the success and 1 on error.
destination file system. Also reads Eg: hadoop fs -appendToFile
input from stdin and appends to localfile /user/hadoop/hadoopfile
destination file system.
cat Copies source paths to stdout. Hadoop fs -cat URI [URI …] Exit Code:
Returns 0 on success and -
Eg: Hadoop fs -cat 1 on error.
hdfs://nn1.example.com/file1
checksum Returns the checksum information hadoop fs -checksum URI

of a file
Eg: hadoop fs -checksum
Commands Description Syntax Exit Code
chgrp Change group association of files. The hadoop fs -chgrp [-R] GROUP URI [URI …]
user must be the owner of files, or else
a super-user. Additional information is
in the Permissions Guide.
• The -R option will make the change
recursively through the directory
structure.
chmod Change the permissions of files. hadoop fs -chmod [-R] <MODE[,MODE]… |

OCTALMODE> URI [URI …]
chown Change the owner of files. hadoop fs -chown [-R] [OWNER][:
[GROUP]] URI [URI ]
copyFromLocal Similar to put command, except that hadoop fs -copyFromLocal URI
the source is restricted to a local file
reference.
copyToLocal Similar to get command, except that hadoop fs -copyToLocal [-ignorecrc] [-crc]
the destination is restricted to a local URI
file reference.
Commands Description Syntax Exit Code
count Count the number of directories, files, and hadoop fs -count [-q] [-h] [-v] Returns 0 on success
bytes under the paths that match the and -1 on error.
specified file pattern. The output columns Eg: hdfs dfs -count -q -h -v
with -count are: DIR_COUNT, FILE_COUNT, hdfs://nn1.example.com/file1
CONTENT_SIZE, PATHNAME
The -h option shows sizes in human
readable format.
The -v option displays a header line.
get Used for copying files to the local file system hdfs –dfs –get [-ignorecrc] [-crc] <src>
<localdst>
mkdir Is used to create a directory on an HDFS hdfs dfs -mkdir [-p]
environment.
mv It is used for moving a file from one Eg: hadoop fs -mv
directory to another directory within the /user/hadoop/sample1.txt
HDFS file system. /user/text/
rm It is used for removing a file from the HDFS hadoop fs -rm [-f] [-r|-R] [-skipTrash]
file system.
Eg:hadoop fs -rm -r
/user/test/sample.txt
–rm Only files can be removed but directories can’t be deleted by this command
–rm r Recursively remove directories and files

–skipTrash used to bypass the trash then it immediately deletes the source
–f mention that if there is no file existing
–rR used to recursively delete directories
Interfaces of the org.apache.Hadoop.io Package and its Description
Interface Description
RawComparator<T> A Comparator that operates directly on byte representations of objects.
Stringifier<T> Stringifier interface offers two methods to convert an object to a string representation
and restore the object given its string representation.
Writable A serializable object which implements a simple, efficient, serialization protocol, based
on DataInput and DataOutput.
WritableComparable<T> A Writable which is also Comparable.
WritableFactory A factory for a class of Writable.

Classes of the org.apache.Hadoop.io Package and its Description
Class Description
AbstractMapWritable Abstract base class for MapWritable and SortedMapWritable Unlike
org.apache.nutch.crawl.MapWritable, this class allows creation of MapWritable<Writable,
MapWritable> so the CLASS_TO_ID and ID_TO_CLASS maps travel with the class instead of
being static.
ArrayFile A dense file-based mapping from integers to values.
ArrayPrimitiveWritable This is a wrapper class.
ArrayWritable A Writable for arrays containing instances of a class.
BinaryComparable Interface supported by WritableComparable types supporting ordering/permutation by a
representative set of bytes.
BloomMapFile This class extends MapFile and provides very much the same functionality.
BooleanWritable A WritableComparable for booleans.
BytesWritable A byte sequence that is usable as a key or value.
ByteWritable A WritableComparable for a single byte.
CompressedWritable A base-class for Writables which store themselves compressed and lazily inflate on field access.
DataOutputOutputStream OutputStream implementation that wraps a DataOutput.
DefaultStringifier<T> DefaultStringifier is the default implementation of the Stringifier interface which stringifies the
objects using base64 encoding of the serialized version of the objects.
Class Description
DoubleWritable Writable for Double values.
ElasticByteBufferPool This is a simple ByteBufferPool which just creates ByteBuffers as needed.
EnumSetWritable<E A Writable wrapper for EnumSet.
extends Enum<E>>
FloatWritable A WritableComparable for floats.
GenericWritable A wrapper for Writable instances.
IntWritable A WritableComparable for ints.
IOUtils An utility class for I/O related functionality.
LongWritable A WritableComparable for longs.
MapFile A file-based map from keys to values.
MapWritable A Writable Map.
MD5Hash A Writable for MD5 hash values.

Class Description
NullWritable Singleton Writable with no data.
ObjectWritable A polymorphic Writable that writes an instance with it's class name.
SequenceFile SequenceFiles are flat files consisting of binary key/value pairs.
SetFile A file-based set of keys.
ShortWritable A WritableComparable for shorts.
SortedMapWritable A Writable SortedMap.
Text This class stores text using standard UTF8 encoding.
TwoDArrayWritable A Writable for 2D arrays containing a matrix of instances of a class.
VersionedWritable A base class for Writables that provides version checking.
VIntWritable A WritableComparable for integer values stored in variable-length format.

VLongWritable A WritableComparable for longs in a variable-length format.
WritableComparator A Comparator for WritableComparables.
WritableFactories Factories for non-public writables.

Exception Description
MultipleIOException Encapsulate a list of IOException into an IOException
VersionMismatchException Thrown by VersionedWritable.readFields(DataInput) when the version of an

object being read does not match the current implementation version as
returned by VersionedWritable.getVersion().

Chapter 4_Hadoop Ecosystem

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4_Hadoop Ecosystem

Uploaded by

Copyright:

Available Formats

Hadoop Ecosystem

Hadoop ecosystem can be defined as a

MapReduce and Hadoop Distributed File

Along with these two it provides a collection

The fig depicts the elements of the Hadoop

The fig depicts how the various elements

MapReduce and HDFS provide the

Concepts related to HDFS:

HDFS also makes applications available to parallel processing.

Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave

The NameNode is the master that

The NameNode is the commodity hardware that contains the

•Manages the file system namespace.

The DataNode is a commodity hardware having the GNU/Linux

For every node (Commodity hardware/System) in a cluster, there

These nodes manage the data storage of their system.

•DataNodes perform read-write operations on the file systems, as

– Ask for a list of DN host replicas of the blocks

 Delete a file and create/delete directory

– Schedule tasks to where the data are located

- The file in a file system will be

- These file segments are called

- In other words, the minimum

- The default block size is 64MB,

 Heartbeats: message to indicate availability

 Monitoring – DataNode and NameNode communicate through continuous signals

Local File fs.LocalFileSystem A file system for a locally

HFTP hftp hdfs.HftpFileSystem A file system providing read-

HAR har fs.HarFileSystem A fie system layered on another

KFS (CloudStore) kfs fs.kfs.KosmosFileSystem ClousStore is a distributed file

S3 (native) s3n fs.s3native.NativeS3FileSystem A file system backed by Amazon S3

checksum Returns the checksum information hadoop fs -checksum URI

chmod Change the permissions of files. hadoop fs -chmod [-R] <MODE[,MODE]… |

–rm r Recursively remove directories and files

RawComparator<T> A Comparator that operates directly on byte representations of objects.

WritableComparable<T> A Writable which is also Comparable.

WritableFactory A factory for a class of Writable.

GenericWritable A wrapper for Writable instances.

IntWritable A WritableComparable for ints.

IOUtils An utility class for I/O related functionality.

LongWritable A WritableComparable for longs.

MapFile A file-based map from keys to values.

MapWritable A Writable Map.

MD5Hash A Writable for MD5 hash values.

NullWritable Singleton Writable with no data.

SequenceFile SequenceFiles are flat files consisting of binary key/value pairs.

SetFile A file-based set of keys.

ShortWritable A WritableComparable for shorts.

SortedMapWritable A Writable SortedMap.

Text This class stores text using standard UTF8 encoding.

TwoDArrayWritable A Writable for 2D arrays containing a matrix of instances of a class.

VersionedWritable A base class for Writables that provides version checking.

VIntWritable A WritableComparable for integer values stored in variable-length format.

WritableComparator A Comparator for WritableComparables.

WritableFactories Factories for non-public writables.

MultipleIOException Encapsulate a list of IOException into an IOException

VersionMismatchException Thrown by VersionedWritable.readFields(DataInput) when the version of an

You might also like