Professional Documents
Culture Documents
UNIT 3 HDFS, Hadoop Environment Part 2
UNIT 3 HDFS, Hadoop Environment Part 2
UNIT 3 HDFS, Hadoop Environment Part 2
The Hadoop FS command line is a simple way to access and interface with HDFS.
The HDFS can be manipulated through a Java API or through a command line interface. All
commands for manipulating HDFS through Hadoop's command line interface begin with
"hadoop", a space, and "fs". This is the file system shell. This is followed by the command
name as an argument to "hadoop fs". These commands start with a dash.
For example, the "ls" command for listing a directory is a common UNIX command and is
preceded with a dash. As on UNIX systems, ls can take a path as an argument. In this
example, the path is the current directory, represented by a single dot.
As we saw for the "ls" command, the file system shell commands can take paths as
arguments. These paths can be expressed in the form of uniform resource identifiers or URIs.
The URI format consists of a scheme, an authority, and path. There are multiple schemes
supported. The local file system has a scheme of "file". HDFS has a scheme called "hdfs".
For example, let us say you wish to copy a file called "myfile.txt" from your local filesystem
to an HDFS file system on the localhost. You can do this by issuing the command shown.
The cp command takes a URI for the source and a URI for the destination. The scheme and
the authority do not always need to be specified. Instead you may rely on their default values.
These defaults can be overridden by specifying them in a file named core-site.xml in the conf
directory of your Hadoop installation.
Hadoop is an open-source software framework written in Java along with some shell
scripting and C code for performing computation over very large data. Hadoop is utilized for
batch/offline processing over the network of so many machines forming a physical cluster.
The framework works in such a manner that it is capable enough to provide distributed
storage and processing over the same cluster. It is designed to work on cheaper systems
commonly known as commodity hardware where each system offers its local storage and
computation power. Hadoop is capable of running various file systems and HDFS is just one
single implementation that out of all those file systems. The Hadoop has a variety of file
systems that can be implemented concretely. The Java abstract class
org.apache.hadoop.fs.FileSystem represents a file system in Hadoop.
URI Java implementation (all under
Filesystem scheme org.apache.hadoop) Description
Hadoop gives numerous interfaces to its various filesystems, and it for the most part utilizes the
URI plan to pick the right filesystem example to speak with. You can use any of this filesystem
for working with MapReduce while processing very large datasets but distributed file systems
with data locality features are preferable like HDFS and KFS (KosmosFileSystem).
Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of
scalability for processing petabytes of data.
Data analysis using hadoop is just half the battle won. Getting data into the Hadoop cluster plays
a critical role in any big data deployment.
Data ingestion is important in any big data project because the volume of data is generally in
petabytes or exabytes. Hadoop Sqoop and Hadoop Flume are the two tools in Hadoop which is
used to gather data from different sources and load them into HDFS.
Sqoop in Hadoop is mostly used to extract structured data from databases like Teradata, Oracle,
etc., and Flume in Hadoop is used to sources data which is stored in various sources like and
deals mostly with unstructured data.
Big data systems are popular for processing huge amounts of unstructured data from multiple
data sources. The complexity of the big data system increases with each data source.
Most of the business domains have different data types like marketing genes in healthcare, audio
and video systems, telecom CDR, and social media. All these have diverse data sources and data
from these sources is consistently produced on large scale.
The challenge is to leverage the resources available and manage the consistency of data. Data
ingestion is complex in hadoop because processing is done in batch, stream or in real time which
increases the management and complexity of data.
Some of the common challenges with data ingestion in Hadoop are parallel processing, data
quality, machine data on a higher scale of several gigabytes per minute, multiple source
ingestion, real-time ingestion and scalability.
Apache Sqoop and Apache Flume are two popular open source etl tools for hadoop that help
organizations overcome the challenges encountered in data ingestion.
The major difference between Sqoop and Flume is that Sqoop is used for loading data from
relational databases into HDFS while Flume is used to capture a stream of moving data.
Apache Sqoop has a connector based Apache Flume has agent-based architecture,
architecture, which means the connectors know that means code written in Flume is known
a great deal in connecting with the various data as an agent that will be held responsible for
sources and also to fetch data correspondingly fetching the data
Apache Flume is specifically designed to
Apache Sqoop connectors are designed
fetch streaming data like tweets from Twitter
specifically to work with structured data sources
or log files from Web servers or Application
and to fetch data from them alone.
servers etc.
Hadoop archives:
Hadoop archive is a facility which packs up small files into one compact HDFS block to avoid
memory wastage of name node.name node stores the metadata information of the the HDFS
data.SO, say 1GB file is broken in 1000 pieces then namenode will have to store metadata about
all those 1000 small files. In that manner, namenode memory will be wasted it storing and
managing a lot of data.
HAR is created from a collection of files and the archiving tool will run a MapReduce job. These
Maps reduce jobs to process the input files in parallel to create an archive file.
Serializing data means converting the data into a format that can be transmitted or stored.
Serialization is the process of translating data structures or objects state into binary or textual
form to transport the data over network or to store on some persisten storage. Once the data is
transported over network or retrieved from the persistent storage, it needs to be deserialized
again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.
Both compression and serialization can improve the performance, scalability, and cost-efficiency
of Hadoop applications.
However, not all compression and serialization methods are suitable for Hadoop. We need to
consider factors such as the type of data, the level of compression, the compression ratio, the
decompression speed, the compatibility with Hadoop tools, and the impact on data quality.
Avro has a schema-based system. A language-independent schema is associated with its read
and write operations. Avro serializes the data which has a built-in schema. Avro serializes the
data into a compact binary format, which can be deserialized by any application.
Avro uses JSON format to declare the data structures. Presently, it supports languages such as
Java, C, C++, C#, Python, and Ruby.
Avro depends heavily on its schema. It allows every data to be written with no prior knowledge
of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored
along with the Avro data in a file for any further processing.
In RPC, the client and the server exchange schemas during the connection. This exchange helps
in the communication between same named fields, missing fields, extra fields, etc.
Avro schemas are defined with JSON that simplifies its implementation in languages with JSON
libraries.
Features of Avro