UNIT 3 HDFS, Hadoop Environment Part 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

BIG DATA (KCS 061)

UNIT 3: HDFS (Hadoop Distributed File System), Hadoop Environment

Command Line Interface:

The Hadoop FS command line is a simple way to access and interface with HDFS.

The HDFS can be manipulated through a Java API or through a command line interface. All
commands for manipulating HDFS through Hadoop's command line interface begin with
"hadoop", a space, and "fs". This is the file system shell. This is followed by the command
name as an argument to "hadoop fs". These commands start with a dash.

For example, the "ls" command for listing a directory is a common UNIX command and is
preceded with a dash. As on UNIX systems, ls can take a path as an argument. In this
example, the path is the current directory, represented by a single dot.

As we saw for the "ls" command, the file system shell commands can take paths as
arguments. These paths can be expressed in the form of uniform resource identifiers or URIs.

The URI format consists of a scheme, an authority, and path. There are multiple schemes
supported. The local file system has a scheme of "file". HDFS has a scheme called "hdfs".

For example, let us say you wish to copy a file called "myfile.txt" from your local filesystem
to an HDFS file system on the localhost. You can do this by issuing the command shown.
The cp command takes a URI for the source and a URI for the destination. The scheme and
the authority do not always need to be specified. Instead you may rely on their default values.
These defaults can be overridden by specifying them in a file named core-site.xml in the conf
directory of your Hadoop installation.

Hadoop File System Interfaces:

Hadoop is an open-source software framework written in Java along with some shell
scripting and C code for performing computation over very large data. Hadoop is utilized for
batch/offline processing over the network of so many machines forming a physical cluster.
The framework works in such a manner that it is capable enough to provide distributed
storage and processing over the same cluster. It is designed to work on cheaper systems
commonly known as commodity hardware where each system offers its local storage and
computation power. Hadoop is capable of running various file systems and HDFS is just one
single implementation that out of all those file systems. The Hadoop has a variety of file
systems that can be implemented concretely. The Java abstract class
org.apache.hadoop.fs.FileSystem represents a file system in Hadoop.
URI Java implementation (all under
Filesystem scheme org.apache.hadoop) Description

The Hadoop Local filesystem is


used for a locally connected disk
with client-side checksumming.
Local file fs.LocalFileSystem
The local filesystem uses
RawLocalFileSystem with no
checksums.

HDFS stands for Hadoop


Distributed File System and it is
HDFS hdfs hdfs.DistributedFileSystem
drafted for working with
MapReduce efficiently.

The HFTP filesystem provides


read-only access to HDFS over
HTTP. There is no connection of
HFTP with FTP.
HFTP hftp hdfs.HftpFileSystem
This filesystem is commonly
used with distcp to share data
between HDFS clusters
possessing different versions.

The HSFTP filesystem provides


read-only access to HDFS over
HSFTP hsftp hdfs.HsftpFileSystem HTTPS. This file system also
does not have any connection
with FTP.

The HAR file system is mainly


used to reduce the memory usage
of NameNode by registering files
HAR har fs.HarFileSystem in Hadoop HDFS. This file
system is layered on some other
file system for archiving
purposes.
Cloud store
or KFS(KosmosFileSystem) is a
KFS
file system that is written in c++.
(Cloud- kfs fs.kfs.KosmosFileSystem
It is very much similar to a
Store)
distributed file system like HDFS
and GFS (Google File System).

The FTP filesystem is supported


FTP ftp fs.ftp.FTPFileSystem
by the FTP server.

S3 This file system is backed


s3n fs.s3native.NativeS3FileSystem
(native) by AmazonS3.

S3 (block-based) file system


S3 which is supported by Amazon s3
(block- s3 fs.s3.S3FileSystem stores files in blocks(similar to
based) HDFS) just to overcome S3’s file
system 5 GB file size limit.

Hadoop gives numerous interfaces to its various filesystems, and it for the most part utilizes the
URI plan to pick the right filesystem example to speak with. You can use any of this filesystem
for working with MapReduce while processing very large datasets but distributed file systems
with data locality features are preferable like HDFS and KFS (KosmosFileSystem).

Data flow & data ingest with Flume and Scoop:

Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of
scalability for processing petabytes of data.

Data analysis using hadoop is just half the battle won. Getting data into the Hadoop cluster plays
a critical role in any big data deployment.

Data ingestion is important in any big data project because the volume of data is generally in
petabytes or exabytes. Hadoop Sqoop and Hadoop Flume are the two tools in Hadoop which is
used to gather data from different sources and load them into HDFS.

Sqoop in Hadoop is mostly used to extract structured data from databases like Teradata, Oracle,
etc., and Flume in Hadoop is used to sources data which is stored in various sources like and
deals mostly with unstructured data.
Big data systems are popular for processing huge amounts of unstructured data from multiple
data sources. The complexity of the big data system increases with each data source.

Most of the business domains have different data types like marketing genes in healthcare, audio
and video systems, telecom CDR, and social media. All these have diverse data sources and data
from these sources is consistently produced on large scale.

The challenge is to leverage the resources available and manage the consistency of data. Data
ingestion is complex in hadoop because processing is done in batch, stream or in real time which
increases the management and complexity of data.

Some of the common challenges with data ingestion in Hadoop are parallel processing, data
quality, machine data on a higher scale of several gigabytes per minute, multiple source
ingestion, real-time ingestion and scalability.

Apache Sqoop and Apache Flume are two popular open source etl tools for hadoop that help
organizations overcome the challenges encountered in data ingestion.

The major difference between Sqoop and Flume is that Sqoop is used for loading data from
relational databases into HDFS while Flume is used to capture a stream of moving data.

Apache Sqoop Apache Flume


Apache Sqoop is basically designed to work
with any type of Relational database system
Apache Flume works pretty well in
(RDBMS) which has the basic JDBC
Streaming data sources that are generated
connectivity. Apache Sqoop can import data
continuously in Hadoop environments, such
from NoSQL databases like MongoDB,
as log files
Cassandra and along with it also allow data
transfer to Apache Hive or HDFS.
Apache Flume data loading is completely
Apache Sqoop load is not driven by events
event-driven
Apache Sqoop will be considered an ideal fit if Apache Flume is considered the best choice
the data is being available in Teradata, Oracle, when we are talking about moving bulk
MySQL, PostgreSQL or any other JDBC streaming data from sources likes JMS or
compatible database Spooling directories
HDFS is the destination for importing data in Data is said to flow to HDFS through
Apache Sqoop channels in Apache Flume

Apache Sqoop has a connector based Apache Flume has agent-based architecture,
architecture, which means the connectors know that means code written in Flume is known
a great deal in connecting with the various data as an agent that will be held responsible for
sources and also to fetch data correspondingly fetching the data
Apache Flume is specifically designed to
Apache Sqoop connectors are designed
fetch streaming data like tweets from Twitter
specifically to work with structured data sources
or log files from Web servers or Application
and to fetch data from them alone.
servers etc.

Apache Flume is specifically used for


Apache Sqoop is specifically used for Parallel
collecting and aggregating data because of its
data transfers, data imports as it copies the data
distributed, reliable nature, and also because
pretty quick
of its highly available backup routes.

Hadoop archives:

Hadoop archive is a facility which packs up small files into one compact HDFS block to avoid
memory wastage of name node.name node stores the metadata information of the the HDFS
data.SO, say 1GB file is broken in 1000 pieces then namenode will have to store metadata about
all those 1000 small files. In that manner, namenode memory will be wasted it storing and
managing a lot of data.
HAR is created from a collection of files and the archiving tool will run a MapReduce job. These
Maps reduce jobs to process the input files in parallel to create an archive file.

Hadoop I/O: compression, serialization


Compressing data means reducing the size of the data by removing redundant or irrelevant
information.
Using data compression in the Hadoop framework is usually a tradeoff between I/O and speed
of computation. When enabled to compression, it reduces I/O and network usage. Compression
happens when MapReduce reads the data or when it writes it out. When the MapReduce job is
fired up against compressed data, CPU utilization generally increases as data must be
decompressed before the files can be processed by the Map and Reduce Tasks.

Serializing data means converting the data into a format that can be transmitted or stored.
Serialization is the process of translating data structures or objects state into binary or textual
form to transport the data over network or to store on some persisten storage. Once the data is
transported over network or retrieved from the persistent storage, it needs to be deserialized
again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Both compression and serialization can improve the performance, scalability, and cost-efficiency
of Hadoop applications.
However, not all compression and serialization methods are suitable for Hadoop. We need to
consider factors such as the type of data, the level of compression, the compression ratio, the
decompression speed, the compatibility with Hadoop tools, and the impact on data quality.

Avro and file-based data structures


Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting,
the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes
quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is
a preferred tool to serialize data in Hadoop.

Avro has a schema-based system. A language-independent schema is associated with its read
and write operations. Avro serializes the data which has a built-in schema. Avro serializes the
data into a compact binary format, which can be deserialized by any application.

Avro uses JSON format to declare the data structures. Presently, it supports languages such as
Java, C, C++, C#, Python, and Ruby.

Avro depends heavily on its schema. It allows every data to be written with no prior knowledge
of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored
along with the Avro data in a file for any further processing.

In RPC, the client and the server exchange schemas during the connection. This exchange helps
in the communication between same named fields, missing fields, extra fields, etc.

Avro schemas are defined with JSON that simplifies its implementation in languages with JSON
libraries.

Features of Avro

Listed below are some of the prominent features of Avro −

 Avro is a language-neutral data serialization system.


 It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).
 Avro creates binary structured format that is both compressible and splittable. Hence it
can be efficiently used as the input to Hadoop MapReduce jobs.
 Avro provides rich data structures. For example, you can create a record that contains
an array, an enumerated type, and a sub record. These datatypes can be created in any
language, can be processed in Hadoop, and the results can be fed to a third language.
 Avro schemas defined in JSON, facilitate implementation in the languages that already
have JSON libraries.
 Avro creates a self-describing file named Avro Data File, in which it stores data along
with its schema in the metadata section.
 Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server
exchange schemas in the connection handshake.

You might also like