Essential Hadoop Tools: Module - 2 Session - 2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Module – 2

Session – 2

ESSENTIAL HADOOP TOOLS


Sqoop Architecture:
All the existing Database Management Systems are designed with SQL standard in mind. However, each
DBMS differs with respect to dialect to some extent. So, this difference poses challenges when it comes to
data transfers across the systems. Sqoop Connectors are components which help overcome these challenges.
Data transfer between Sqoop Hadoop and external storage system is made possible with the help of Sqoop's
connectors.
Sqoop has connectors for working with a range of popular relational databases, including MySQL,
PostgreSQL, Oracle, SQL Server, and DB2. Each of these connectors knows how to interact with its
associated DBMS. There is also a generic JDBC connector for connecting to any database that supports
Java's JDBC protocol. In addition, Sqoop Big data provides optimized MySQL and PostgreSQL connectors
that use database-specific APIs to perform bulk transfers efficiently.

Figure: Sqoop Architecture


In addition to this, Sqoop in big data has various third-party connectors for data stores, ranging from
enterprise data warehouses (including Netezza, Teradata, and Oracle) to NoSQL stores (such as Couchbase).
However, these connectors do not come with Sqoop bundle; those need to be downloaded separately and can
be added easily to an existing Sqoop installation.
Why do we need Sqoop?
Analytical processing using Hadoop requires loading of huge amounts of data from diverse sources into
Hadoop clusters. This process of bulk data load into Hadoop, from heterogeneous sources and then
processing it, comes with a certain set of challenges. Maintaining and ensuring data consistency and
ensuring efficient utilization of resources, are some factors to consider before selecting the right approach
for data load.
The common large components in Sqoop are namely Blog and Clob. If the object is less than 16 MB, it is
stored as an in line with the rest of the data. If there are big objects, then they are temporarily stored in a
subdirectory containing the name _lob. Those data are then materialized in memory for processing further. If
we set the lob limit as ZERO (0) then it is stored in external memory for a time period.
Sqoop allows to Export and Import the data from the data table based on where clause. And the syntax is as
follows:

Example:

Sqoop supports data imported for the following services:


 HDFS
 Hive
 HBase
 Hcatalog
 Accumulo
Sqoop basically needs a connector to connect different relational databases. Almost all Database vendors are
using the JDBC connector available specific for the typical Database; Sqoop needs a JDBC driver of the
database for further interaction. No, Sqoop requires the JDBC and a connector to connect to the database.
Sqoop command to control the number of mappers
We can control the large number of mappers by executing the following parameter –num-mapers in sqoop
command. The –num-mapper’s arguments control the number of map tasks, where the degree of parallelism
is being used. Initially start with a small number of map tasks, and then later choose a high number of
mappers starting with the performance which may down on the database side.
It is a tool basically used for hosting in a shared metadata repository. Multiple and remote users can define
and execute saved jobs that are defined in meta store. End users are configured to connect the metastore with
respect to sqoop-site.xml or with the

The purpose and usage of Sqoop-merge is:


This tool combines two set of datasets where entries are the one dataset which overwrite entries of an older
dataset preserving only the new version of the records between both the data sets.

Major Issues:

1. Data load using Scripts: The traditional approach of using scripts to load data is not suitable for
bulk data load into Hadoop; this approach is inefficient and very time-consuming.
2. Direct access to external data via Map-Reduce application: Providing direct access to the data
residing at external systems (without loading into Hadoop) for map-reduce applications complicates
these applications. So, this approach is not feasible.
3. In addition to having the ability to work with enormous data, Hadoop can work with data in several
different forms. So, to load such heterogeneous data into Hadoop, different tools have been
developed. Sqoop and Flume are two such data loading tools.
Next in this Sqoop tutorial with examples, we will learn about the difference between Sqoop, Flume and
HDFS.
Apache Sqoop Version Changes:

Sqoop Version 1 uses specialized connectors to access external systems. These connectors are often
optimized for various RDBMSs or for systems that do not support JDBC. Connectors are plug-in
components based on Sqoop’s extension framework and can be added to any existing Sqoop installation.
Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external
store supported by the connector. By default, Sqoop version 1 includes connectors for popular databases
such as MySQL, PostgreSQL, Oracle, SQL Server, and DB2. It also supports direct transfer to and from the
RDBMS to HBase or Hive.

In contrast, to streamline the Sqoop input methods, Sqoop version 2 no longer supports specialized
connectors or direct import into HBase or Hive. All imports and exports are done through the JDBC
interface. Table 7.2 summarizes the changes from version 1 to version 2. Due to these changes, any new
development should be done with Sqoop version 2.
Feature Sqoop Version 1 Sqoop Version 2
Connectors for all major RDBMSs Supported Not supported. Use the generic
JDBC connector
Kerberos security integration Supported Not supported
Data transfer from RDBMS to Hive Supported Not supported. First import data
or HBase from RDBMS into HDFS, then
load data into Hive or HBase
manually
Data transfer from Hive or HBase Not supported. First export data Not supported. First export data
to RDBMS from Hive or HBase into from Hive or HBase into HDFS,
HDFS, and then use Sqoop for then use Sqoop for export
export

Using Apache Flume to Acquire Data Streams:

Apache Flume is an independent agent designed to collect, transport, and store data into HDFS. Often data
transport involves a number of Flume agents that may traverse a series of machines and locations. Flume is
often used for log files, social media-generated data, email messages, and just about any continuous data
source.
Apache Flume is basically a tool or a data ingestion mechanism responsible for collecting and transporting
huge amounts of data such as events, log files, etc. from several sources to one central data store. Apache
Flume is a unique tool designed to copy log data or streaming data from various different web servers to
HDFS.
Apache Flume supports several sources as follows:
 ‘Tail’: The data is piped from the local files and is written into the HDFS via Flume. It is
somewhat similar to a Unix command, ‘tail’.
 System logs
 Apache logs: This enables Java applications for writing events to files in HDFS via Flume

Features of Flume:
Before going further, let’s look at the features of Flume:
 Log data from different web servers is ingested by Flume into HDFS and HBase very efficiently.
Along with that, huge volumes of event data from social networking sites can also be retrieved.
 Data can be retrieved from multiple servers immediately into Hadoop by using Flume.
 Huge source of destination types is supported by Flume.
 Based on streaming data flows, Flume has a flexible design. This design stands out to be robust
and fault-tolerant with different recovery mechanisms.
 Data is carried between sources and sinks by Apache Flume which can either be event-driven or
can be scheduled.
Flume Architecture:
Refer to the image below for understanding the Flume architecture better. In Flume architecture, there are
data generators that generate data. This data that has been generated gets collected by Flume agents. The
data collector is another agent that collects data from various other agents that are aggregated. Then, it is
pushed to a centralized store, i.e., HDFS.

Figure: Flume agent with source, channel, and sink


A Flume agent must have all three of these components defined. A Flume agent can have several sources,
channels, and sinks. Sources can write to multiple channels, but a sink can take data from only a single
channel. Data written to a channel remain in the channel util a sink removes the data. By default, the data in
a channel are kept in memory but may be optionally stored on disk to prevent data loss in the event of
network failure.
Let’s now talk about each of the components present in the Flume architecture:
 Flume Events
The basic unit of the data which is transported inside Flume is what we call an Event. Generally, it
contains a payload of the byte array. Basically, that can be transported from the source to the
destination accompanied by optional headers.
 Flume Agents
However, in Apache Flume, an independent daemon process (JVM) is what we call an agent. At first,
it receives events from clients or other agents. Afterward, it forwards it to its next destination that is
sink or agent. Note that, it is possible that Flume can have more than one agent.
o Flume Source: Basically, Flume source receives data from the data generators. Then it
transfers it to one or more channels as Flume events. There are various types of sources
Apache Flume supports. Moreover, each source receives events from a specified data
generator.
o Flume Channel: A transient store that receives the events from the source also buffers them
till they are consumed by sinks is what we call a Flume channel. To be very specific it acts as
a bridge between the sources and the sinks in Flume. Basically, these channels can work with
any number of sources and sinks are they are fully transactional.
o Flume Sink: Generally, to store data into centralized stores like HBase and HDFS we use the
Flume sink component. Basically, it consumes events from the channels and then delivers it
to the destination. Also, we can say that the sink’s destination is might be another agent or the
central stores.
 Flume Clients
Those who generate events and then sent it to one or more agents is what we call Flume Clients.
Now, that we have seen in-depth the architecture of Flume, let’s look at the advantages of Flume as well.

Advantages of Flume:
Here are the advantages of using Flume −
 Using Apache Flume we can store the data into any of the centralized stores (HBase, HDFS).
 When the rate of incoming data exceeds the rate at which data can be written to the destination,
Flume acts as a mediator between data producers and the centralized stores and provides a steady
flow of data between them.
 Flume provides the feature of contextual routing.
 The transactions in Flume are channel-based where two transactions (one sender and one receiver)
are maintained for each message. It guarantees reliable message delivery.
 Flume is reliable, fault-tolerant, scalable, manageable, and customizable.
Now, we come to the end of this tutorial on Flume. We learned about Apache Flume in depth along with that
we saw the architecture of Flume.

You might also like