Professional Documents
Culture Documents
UNIT-2
UNIT-2
(AUTONOMOUS)
Accredited by NAAC with‘A’ Grade & NBA (Under Tier - I), ISO 9001:2015 Certified Institution
Approved by AICTE, New Delhi and Affiliated to JNTUK, Kakinada
L.B. REDDY NAGAR, MYLAVARAM, NTR DIST., A.P.-521 230.
hodcse@lbrce.ac.in, cseoffice@lbrce.ac.in, Phone: 08659-222933, Fax: 08659-222931
DEPARTMENT OF COMPUTER SCIENCE &ENGINEERING
UNIT – II
HDFS:
Hadoop comes with a distributed file system called HDFS. In HDFS data is
distributed over several machines and replicated to ensure their durability to
failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of
blocks, data nodes and node name.
HDFS Concepts
BLOCKS
NAME NODE
DATA NODE
SECOUNDARY NAME NODE
Blocks: A Block is the minimum amount of data that it can read or write.HDFS
blocks are 128 MB by default and this is configurable.Files n HDFS are broken
into block-sized chunks,which are stored as independent units.Unlike a file
system, if the file is in HDFS is smaller than block size, then it does not occupy
full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes
5MB of space only.The HDFS block size is large just to minimize the cost of
seek.
Name Node: HDFS works in master-worker pattern where the name node
acts as master.Name Node is controller and manager of HDFS as it knows the
status and the metadata of all the files in HDFS; the metadata information
being file permission, names and location of each block.The metadata are
small, so it is stored in the memory of name node,allowing faster access to
data. Moreover the HDFS cluster is accessed by multiple clients
concurrently,so all this information is handled bya single machine. The file
system operations like opening, closing, renaming etc. are executed by it.
Data Node: They store and retrieve blocks when they are told to; by client or
name node. They report back to name node periodically, with list of blocks
that they are storing. The data node being a commodity hardware also does
the work of block creation, deletion and replication as stated by the name
node.
HDFS Storage
One of Hadoop’s key features is its implementation of a distributed file system.
A distributed file system offers a logical file system that is implemented over a
cluster of machines rather than on a single machine. Additional details on
HDFS are available on the Apache Hadoop website. Hadoop offers access to
HDFS through a command line interface.
Create new file - Click New button in the HDFS tab, Enter the name of
the file and click Create button in the prompt box opened to create a
new HDFS file.
Upload files/folder - You can upload files / folder from your local file
system to HDFS by clicking the Upload button in the HDFS tab. In the
prompt select file or folder and then browse for the same and click
upload.
Set Permission - Click Permission option in the HDFS tab, a popup
window will be displayed. In the popup window change the mode of
permission and click change button to set the permission for files in the
HDFS.
Copy - Select the file / folder from HDFS directory and click Copy
button in the HDFS tab, in the prompt displayed, enter / browse for
target directory and click Copy button to copy the selected HDFS item.
Move - Select the file / folder from HDFS directory and click Move
button in the HDFS tab, in the prompt displayed, enter / browse for
target directory and click Move button to move the selected HDFS item.
Rename - Select the file / folder from the HDFS directory and click
Rename button in the HDFS tab, enter the new name and click Apply to
rename in the HDFS.
Delete - Select the file / folder from the HDFS directory and click Delete
to perform delete operation.
Drag and Drop - Select the file / folder from the HDFS directory and
drag it to another folder in the HDFS directory / Windows explorer and
vice versa.
HDFS File viewer – you can view the content of the HDFS files directly
by double clicking the file.
Data flow
Input reader
The input reader reads the upcoming data and splits it into the data
blocks of the appropriate size (64 MB to 128 MB). Each data block is
associated with a Map function.
Once input reads the data, it generates the corresponding key-value
pairs. The input files reside in HDFS.
Map function
The map function process the upcoming key-value pairs and
generated the corresponding output key-value pairs. The map input
and output type may be different from each other.
Partition function
The partition function assigns the output of each Map function to the
appropriate reducer. The available key and value provide this
function. It returns the index of reducers.
Shuffling and Sorting
The data are shuffled between/within nodes so that it moves out
from the map and get ready to process for reduce function.
Sometimes, the shuffling of data can take much computation time.
Reduce function
The Reduce function is assigned to each unique key. These keys are
already arranged in sorted order. The values associated with the keys
can iterate the Reduce and generates the corresponding output.
Output writer
Once the data flow from all the above phases, Output writer
executes. The role of Output writer is to write the Reduce output to
the stable storage.
Data Ingestion
Hadoop Data ingestion is the beginning of your data pipeline
in a data lake. It means taking data from various silo
databases and files and putting it into Hadoop.
Sqoop
Apache Sqoop(which is a portmanteau for “sql-to-
hadoop”) is an open source tool that allows users to
extract data from a structured data store into Hadoop
for further processing. This processing can be done with
MapReduce programs or other higher-level tools such as
Hive, Pig or Spark.
Sqoop can automatically create Hive tables from
imported data from a RDBMS (Relational Database
Management System) table.
Sqoop can also be used to send data from Hadoop to a
relational database, useful for sending results processed
in Hadoop to an operational transaction processing
system.
Sqoop Export
The export tool exports a set of files from HDFS back to an RDBMS. The files given
as input to Sqoop contain records, which are called as rows in table. Those are read
and parsed into a set of records and delimited with user-specified delimiter.
gzip:
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm,
which is a combination of LZ77 and Huffman Coding.
bzip2:
bzip2 is a freely available, patent free (see below), high-quality data compressor. It
typically compresses files to within 10% to 15% of the best available techniques (the
PPM family of statistical compressors), whilst being around twice as fast at
compression and six times faster at decompression.
LZO:
The LZO compression format is composed of many smaller (~256K) blocks of
compressed data, allowing jobs to be split along block boundaries. Moreover, it was
designed with speed in mind: it decompresses about twice as fast as gzip, meaning
it’s fast enough to keep up with hard drive read speeds. It doesn’t compress quite as
well as gzip — expect files that are on the order of 50% larger than their gzipped
version. But that is still 20-50% of the size of the files without any compression at all,
which means that IO-bound jobs complete the map phase about four times faster.
Snappy:
Snappy is a compression/decompression library. It does not aim for maximum
compression, or compatibility with any other compression library; instead, it aims for
very high speeds and reasonable compression. For instance, compared to the
fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the
resulting compressed files are anywhere from 20% to 100% bigger. On a single core
of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or
more and decompresses at about 500 MB/sec or more. Snappy is widely used inside
Google, in everything from BigTable and MapReduce to our internal RPC systems.
Some tradeoffs:
All compression algorithms exhibit a space/time trade-off: faster compression and
decompression speeds usually come at the expense of smaller space savings. The
tools listed in above table typically give some control over this trade-off at
compression time by offering nine different options: –1 means optimize for speed
and -9 means optimize for space.
The different tools have very different compression characteristics. Gzip is a general
purpose compressor, and sits in the middle of the space/time trade-off. Bzip2
compresses more effectively than gzip, but is slower. Bzip2’s decompression speed
is faster than its compression speed, but it is still slower than the other formats. LZO
and Snappy, on the other hand, both optimize for speed and are around an order of
magnitude faster than gzip, but compress less effectively. Snappy is also
significantly faster than LZO for decompression.
What is Avro?
Apache Avro is a language-neutral data serialization system.
It was developed by Doug Cutting, the father of Hadoop.
Since Hadoop writable classes lack language portability, Avro
becomes quite helpful, as it deals with data formats that can
be processed by multiple languages. Avro is a preferred tool
to serialize data in Hadoop.
Avro has a schema-based system. A language-independent
schema is associated with its read and write operations. Avro
serializes the data which has a built-in schema. Avro serializes
the data into a compact binary format, which can be
deserialized by any application.
Avro uses JSON format to declare the data structures.
Presently, it supports languages such as Java, C, C++, C#,
Python, and Ruby.
Features of Avro