Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

Hadoop Ecosystem

Unit II Chapter 1
Prof. Abhishek. N. Nazare
Contents
• Understanding Hadoop Ecosystem
• Hadoop Distributed File System: HDFS Architecture
• Concept of Blocks in HDFS Architecture
• NameNodes and DataNodes
• The Command-Line Interface
• Using HDFS Files
• Hadoop-Specific File System Types
• HDFS Commands
• The org.apache.hadoop.io package
• HDFS High availability: Features of HDFS.
Understanding Hadoop Ecosystem

Hadoop ecosystem can be defined as a


comprehensive collection of tools and
technologies that can be effectively
implemented and deployed to provide Big
Data solutions in a cost effective manner.

MapReduce and Hadoop Distributed File


System (HDFS) are two components of the
Hadoop ecosystem.

Along with these two it provides a collection


of various elements to support the complete
development and deployment of Big Data
solutions.

The fig depicts the elements of the Hadoop


Ecosystem
All these elements enable users to process
large datasets in real time and provide tools
to support various types of Hadoop
projects, schedule jobs and manage cluster
resources.

The fig depicts how the various elements


of Hadoop involve at various stages of
processing data

MapReduce and HDFS provide the


necessary services and basic structure to
deal with the core requirements of Big Data
solutions.
Other services and tools of the ecosystem
provide the environment and components
required to build and manage purpose
driven Big Data applications.
Hadoop Distributed File System

Concepts related to HDFS:

1. Huge document – HDFS is a file system intended for putting away huge documents with streaming information
access. Huge in this connection means records in the vicinity of GB, TB or even PBs in size.
2. Streaming information access – HDFS is created for batch processing. The priority is given to the high throughput of
data access rather than the low latency of data access. A dataset is commonly produced or replicated from the source
and then various analyses are performed on that dataset in the long run.
3. Appliance hardware – Hadoop does not require large, exceptionally dependable hardware to run.
4. Low-latency information access – Applications that permit access to information in milliseconds do not function
well with HDFS. HDFS hence is upgraded for conveying a high transaction volume of information with an expense of
idleness. Hbase is at present superior.
5. Loads of small documents – Since the NameNode holds file system data information in memory, the quantity of
docs in a file system is administered in terms of the memory on the server.

HDFS also makes applications available to parallel processing.


Features of HDFS:
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of NameNode and DataNode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture

Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave


architecture
and it has the following elements.
NameNode and a no. of
DataNodes

The NameNode is the master that


manages the various DataNodes
as shown in the fig.
Namenode

The NameNode is the commodity hardware that contains the


GNU/Linux operating system and the NameNode software. It is a
software that can be run on commodity hardware.

The system having the NameNode acts as the master server and it
does the following tasks −

•Manages the file system namespace.


•Stores the metadata for all docs and indexes in the file system.
•Regulates client’s access to files.
•It also executes file system operations such as renaming, closing,
and opening files and directories.
DataNode

The DataNode is a commodity hardware having the GNU/Linux


operating system and DataNode software.

For every node (Commodity hardware/System) in a cluster, there


will be a DataNode.

These nodes manage the data storage of their system.

•DataNodes perform read-write operations on the file systems, as


per client request.
•They also perform operations such as block creation, deletion,
and replication according to the instructions of the NameNode.
NN – NameNode DN - DataNode
HDFS Client
A code library exports HDFS interface
 Read a file

– Ask for a list of DN host replicas of the blocks


– Contact a DN directly and request transfer

 Write a file
– Ask NN to choose DNs to host replicas of the first block of the file
– Organize a pipeline and send the data

– Iteration

 Delete a file and create/delete directory


 Various APIs

– Schedule tasks to where the data are located


– Set replication factor (number of replicas)
HDFS Client (cont.)
Concepts of Blocks in HDFS Architecture

Block
- Generally the user data is
stored in the files of HDFS.

- The file in a file system will be


divided into one or more
segments and/or stored in
individual data nodes.

- These file segments are called


as blocks.

- In other words, the minimum


amount of data that HDFS can
read or write is called a Block.

- The default block size is 64MB,


but it can be increased as per
the need to change in HDFS
configuration. Illustration of Hadoop Heartbeat Message
Block report: identify block replicas
– Block ID, the generation stamp, and the length
– Send first when register and then send per hour

 Heartbeats: message to indicate availability


– Default interval is three seconds
– DN is considered “dead” if not received in 10 ms.
– Contains Information for space allocation and load balancing
● Storage capacity
● Fraction of storage in use
● Number of data transfers currently in progress
– NN replies with instructions to the DN
– Keep frequent. Scalability
To enable this task of reliability one should facilitate no of tasks for failure
management, Some of which are utilized within HDFS and others are still in process to
be implemented.

 Monitoring – DataNode and NameNode communicate through continuous signals


(Heartbeat). If signal is unheard by either of the two the node is considered to have
failed and no longer considered and is replaced by the replica with change in
replication scheme too.
 Rebalancing – According to this process the blocks are shifted from one to another
location where ever free space is available. Better performance can be judged by the
increase in demand of data as well as the increase in demand for replication towards
frequent node failures.
 Metadata replication – these files are prone to failures hence they maintain the
replica of the corresponding file on the same HDFS.
Hadoop-Specific File System Types
File System URI Scheme Java Implementation Definition

Local File fs.LocalFileSystem A file system for a locally


connected disk with client-side
checksums
HDFS hdfs hdfs.DistributedFileSystem Hadoops distributed file system.
HDFS is designed to work
efficiently with MapReduce.

HFTP hftp hdfs.HftpFileSystem A file system providing read-


only access to HDFS(No link
with FTP).
HSFTP hsftp hdfs.HsftpFileSystem A file system providing read-
only access to HDFS over HTTPS

HAR har fs.HarFileSystem A fie system layered on another


file system for archiving files.

KFS (CloudStore) kfs fs.kfs.KosmosFileSystem ClousStore is a distributed file


system like HDFS or Google’s
GFS written in C++.
FTP ftp fs.ftp.FTPFileSystem A file system backed by an FTP server

S3 (native) s3n fs.s3native.NativeS3FileSystem A file system backed by Amazon S3


S3 (blockbased) s3 fs.s3.S3FileSystem A file system backed by Amazon S3, which
stores files in blocks (much like HDFS) to
overcome S3’s 5GB file size limit.
HDFS Commands and their description
Commands Description Syntax Exit
appendToFile Append single src, or multiple srcs hdfs dfs -appendToFile … Exit Code: Returns 0 on
from local file system to the success and 1 on error.
destination file system. Also reads Eg: hadoop fs -appendToFile
input from stdin and appends to localfile /user/hadoop/hadoopfile
destination file system.

cat Copies source paths to stdout. Hadoop fs -cat URI [URI …] Exit Code:
Returns 0 on success and -
Eg: Hadoop fs -cat 1 on error.
hdfs://nn1.example.com/file1
hdfs://nn2.example.com/file2

checksum Returns the checksum information hadoop fs -checksum URI


of a file
Eg: hadoop fs -checksum
hdfs://nn1.example.com/file1
Commands Description Syntax Exit Code
chgrp Change group association of files. The hadoop fs -chgrp [-R] GROUP URI [URI …]
user must be the owner of files, or else
a super-user. Additional information is
in the Permissions Guide.
• The -R option will make the change
recursively through the directory
structure.

chmod Change the permissions of files. hadoop fs -chmod [-R] <MODE[,MODE]… |


OCTALMODE> URI [URI …]
chown Change the owner of files. hadoop fs -chown [-R] [OWNER][:
[GROUP]] URI [URI ]
copyFromLocal Similar to put command, except that hadoop fs -copyFromLocal URI
the source is restricted to a local file
reference.

copyToLocal Similar to get command, except that hadoop fs -copyToLocal [-ignorecrc] [-crc]
the destination is restricted to a local URI
file reference.
Commands Description Syntax Exit Code
count Count the number of directories, files, and hadoop fs -count [-q] [-h] [-v] Returns 0 on success
bytes under the paths that match the and -1 on error.
specified file pattern. The output columns Eg: hdfs dfs -count -q -h -v
with -count are: DIR_COUNT, FILE_COUNT, hdfs://nn1.example.com/file1
CONTENT_SIZE, PATHNAME
The -h option shows sizes in human
readable format.
The -v option displays a header line.
get Used for copying files to the local file system hdfs –dfs –get [-ignorecrc] [-crc] <src>
<localdst>
mkdir Is used to create a directory on an HDFS hdfs dfs -mkdir [-p]
environment.
mv It is used for moving a file from one Eg: hadoop fs -mv
directory to another directory within the /user/hadoop/sample1.txt
HDFS file system. /user/text/
rm It is used for removing a file from the HDFS hadoop fs -rm [-f] [-r|-R] [-skipTrash]
file system.
Eg:hadoop fs -rm -r
/user/test/sample.txt
–rm Only files can be removed but directories can’t be deleted by this command

–rm r Recursively remove directories and files


–skipTrash used to bypass the trash then it immediately deletes the source
–f mention that if there is no file existing
–rR used to recursively delete directories
Interfaces of the org.apache.Hadoop.io Package and its Description

Interface Description

RawComparator<T> A Comparator that operates directly on byte representations of objects.

Stringifier<T> Stringifier interface offers two methods to convert an object to a string representation
and restore the object given its string representation.

Writable A serializable object which implements a simple, efficient, serialization protocol, based
on DataInput and DataOutput.

WritableComparable<T> A Writable which is also Comparable.

WritableFactory A factory for a class of Writable.


Classes of the org.apache.Hadoop.io Package and its Description

Class Description
AbstractMapWritable Abstract base class for MapWritable and SortedMapWritable Unlike
org.apache.nutch.crawl.MapWritable, this class allows creation of MapWritable<Writable,
MapWritable> so the CLASS_TO_ID and ID_TO_CLASS maps travel with the class instead of
being static.
ArrayFile A dense file-based mapping from integers to values.
ArrayPrimitiveWritable This is a wrapper class.
ArrayWritable A Writable for arrays containing instances of a class.
BinaryComparable Interface supported by WritableComparable types supporting ordering/permutation by a
representative set of bytes.
BloomMapFile This class extends MapFile and provides very much the same functionality.
BooleanWritable A WritableComparable for booleans.
BytesWritable A byte sequence that is usable as a key or value.
ByteWritable A WritableComparable for a single byte.
CompressedWritable A base-class for Writables which store themselves compressed and lazily inflate on field access.
DataOutputOutputStream OutputStream implementation that wraps a DataOutput.
DefaultStringifier<T> DefaultStringifier is the default implementation of the Stringifier interface which stringifies the
objects using base64 encoding of the serialized version of the objects.
Classes of the org.apache.Hadoop.io Package and its Description

Class Description
DoubleWritable Writable for Double values.
ElasticByteBufferPool This is a simple ByteBufferPool which just creates ByteBuffers as needed.
EnumSetWritable<E A Writable wrapper for EnumSet.
extends Enum<E>>
FloatWritable A WritableComparable for floats.

GenericWritable A wrapper for Writable instances.

IntWritable A WritableComparable for ints.

IOUtils An utility class for I/O related functionality.

LongWritable A WritableComparable for longs.

MapFile A file-based map from keys to values.

MapWritable A Writable Map.

MD5Hash A Writable for MD5 hash values.


Classes of the org.apache.Hadoop.io Package and its Description

Class Description

NullWritable Singleton Writable with no data.

ObjectWritable A polymorphic Writable that writes an instance with it's class name.

SequenceFile SequenceFiles are flat files consisting of binary key/value pairs.

SetFile A file-based set of keys.

ShortWritable A WritableComparable for shorts.

SortedMapWritable A Writable SortedMap.

Text This class stores text using standard UTF8 encoding.

TwoDArrayWritable A Writable for 2D arrays containing a matrix of instances of a class.

VersionedWritable A base class for Writables that provides version checking.

VIntWritable A WritableComparable for integer values stored in variable-length format.


VLongWritable A WritableComparable for longs in a variable-length format.

WritableComparator A Comparator for WritableComparables.

WritableFactories Factories for non-public writables.


Exception Description

MultipleIOException Encapsulate a list of IOException into an IOException

VersionMismatchException Thrown by VersionedWritable.readFields(DataInput) when the version of an


object being read does not match the current implementation version as
returned by VersionedWritable.getVersion().

You might also like