Professional Documents
Culture Documents
Chapter 4_Hadoop Ecosystem
Chapter 4_Hadoop Ecosystem
Unit II Chapter 1
Prof. Abhishek. N. Nazare
Contents
• Understanding Hadoop Ecosystem
• Hadoop Distributed File System: HDFS Architecture
• Concept of Blocks in HDFS Architecture
• NameNodes and DataNodes
• The Command-Line Interface
• Using HDFS Files
• Hadoop-Specific File System Types
• HDFS Commands
• The org.apache.hadoop.io package
• HDFS High availability: Features of HDFS.
Understanding Hadoop Ecosystem
1. Huge document – HDFS is a file system intended for putting away huge documents with streaming information
access. Huge in this connection means records in the vicinity of GB, TB or even PBs in size.
2. Streaming information access – HDFS is created for batch processing. The priority is given to the high throughput of
data access rather than the low latency of data access. A dataset is commonly produced or replicated from the source
and then various analyses are performed on that dataset in the long run.
3. Appliance hardware – Hadoop does not require large, exceptionally dependable hardware to run.
4. Low-latency information access – Applications that permit access to information in milliseconds do not function
well with HDFS. HDFS hence is upgraded for conveying a high transaction volume of information with an expense of
idleness. Hbase is at present superior.
5. Loads of small documents – Since the NameNode holds file system data information in memory, the quantity of
docs in a file system is administered in terms of the memory on the server.
The system having the NameNode acts as the master server and it
does the following tasks −
Write a file
– Ask NN to choose DNs to host replicas of the first block of the file
– Organize a pipeline and send the data
– Iteration
Block
- Generally the user data is
stored in the files of HDFS.
cat Copies source paths to stdout. Hadoop fs -cat URI [URI …] Exit Code:
Returns 0 on success and -
Eg: Hadoop fs -cat 1 on error.
hdfs://nn1.example.com/file1
hdfs://nn2.example.com/file2
copyToLocal Similar to get command, except that hadoop fs -copyToLocal [-ignorecrc] [-crc]
the destination is restricted to a local URI
file reference.
Commands Description Syntax Exit Code
count Count the number of directories, files, and hadoop fs -count [-q] [-h] [-v] Returns 0 on success
bytes under the paths that match the and -1 on error.
specified file pattern. The output columns Eg: hdfs dfs -count -q -h -v
with -count are: DIR_COUNT, FILE_COUNT, hdfs://nn1.example.com/file1
CONTENT_SIZE, PATHNAME
The -h option shows sizes in human
readable format.
The -v option displays a header line.
get Used for copying files to the local file system hdfs –dfs –get [-ignorecrc] [-crc] <src>
<localdst>
mkdir Is used to create a directory on an HDFS hdfs dfs -mkdir [-p]
environment.
mv It is used for moving a file from one Eg: hadoop fs -mv
directory to another directory within the /user/hadoop/sample1.txt
HDFS file system. /user/text/
rm It is used for removing a file from the HDFS hadoop fs -rm [-f] [-r|-R] [-skipTrash]
file system.
Eg:hadoop fs -rm -r
/user/test/sample.txt
–rm Only files can be removed but directories can’t be deleted by this command
Interface Description
Stringifier<T> Stringifier interface offers two methods to convert an object to a string representation
and restore the object given its string representation.
Writable A serializable object which implements a simple, efficient, serialization protocol, based
on DataInput and DataOutput.
Class Description
AbstractMapWritable Abstract base class for MapWritable and SortedMapWritable Unlike
org.apache.nutch.crawl.MapWritable, this class allows creation of MapWritable<Writable,
MapWritable> so the CLASS_TO_ID and ID_TO_CLASS maps travel with the class instead of
being static.
ArrayFile A dense file-based mapping from integers to values.
ArrayPrimitiveWritable This is a wrapper class.
ArrayWritable A Writable for arrays containing instances of a class.
BinaryComparable Interface supported by WritableComparable types supporting ordering/permutation by a
representative set of bytes.
BloomMapFile This class extends MapFile and provides very much the same functionality.
BooleanWritable A WritableComparable for booleans.
BytesWritable A byte sequence that is usable as a key or value.
ByteWritable A WritableComparable for a single byte.
CompressedWritable A base-class for Writables which store themselves compressed and lazily inflate on field access.
DataOutputOutputStream OutputStream implementation that wraps a DataOutput.
DefaultStringifier<T> DefaultStringifier is the default implementation of the Stringifier interface which stringifies the
objects using base64 encoding of the serialized version of the objects.
Classes of the org.apache.Hadoop.io Package and its Description
Class Description
DoubleWritable Writable for Double values.
ElasticByteBufferPool This is a simple ByteBufferPool which just creates ByteBuffers as needed.
EnumSetWritable<E A Writable wrapper for EnumSet.
extends Enum<E>>
FloatWritable A WritableComparable for floats.
Class Description
ObjectWritable A polymorphic Writable that writes an instance with it's class name.