Professional Documents
Culture Documents
Essential Hadoop Tools: Module - 2 Session - 2
Essential Hadoop Tools: Module - 2 Session - 2
Essential Hadoop Tools: Module - 2 Session - 2
Session – 2
Example:
Major Issues:
1. Data load using Scripts: The traditional approach of using scripts to load data is not suitable for
bulk data load into Hadoop; this approach is inefficient and very time-consuming.
2. Direct access to external data via Map-Reduce application: Providing direct access to the data
residing at external systems (without loading into Hadoop) for map-reduce applications complicates
these applications. So, this approach is not feasible.
3. In addition to having the ability to work with enormous data, Hadoop can work with data in several
different forms. So, to load such heterogeneous data into Hadoop, different tools have been
developed. Sqoop and Flume are two such data loading tools.
Next in this Sqoop tutorial with examples, we will learn about the difference between Sqoop, Flume and
HDFS.
Apache Sqoop Version Changes:
Sqoop Version 1 uses specialized connectors to access external systems. These connectors are often
optimized for various RDBMSs or for systems that do not support JDBC. Connectors are plug-in
components based on Sqoop’s extension framework and can be added to any existing Sqoop installation.
Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external
store supported by the connector. By default, Sqoop version 1 includes connectors for popular databases
such as MySQL, PostgreSQL, Oracle, SQL Server, and DB2. It also supports direct transfer to and from the
RDBMS to HBase or Hive.
In contrast, to streamline the Sqoop input methods, Sqoop version 2 no longer supports specialized
connectors or direct import into HBase or Hive. All imports and exports are done through the JDBC
interface. Table 7.2 summarizes the changes from version 1 to version 2. Due to these changes, any new
development should be done with Sqoop version 2.
Feature Sqoop Version 1 Sqoop Version 2
Connectors for all major RDBMSs Supported Not supported. Use the generic
JDBC connector
Kerberos security integration Supported Not supported
Data transfer from RDBMS to Hive Supported Not supported. First import data
or HBase from RDBMS into HDFS, then
load data into Hive or HBase
manually
Data transfer from Hive or HBase Not supported. First export data Not supported. First export data
to RDBMS from Hive or HBase into from Hive or HBase into HDFS,
HDFS, and then use Sqoop for then use Sqoop for export
export
Apache Flume is an independent agent designed to collect, transport, and store data into HDFS. Often data
transport involves a number of Flume agents that may traverse a series of machines and locations. Flume is
often used for log files, social media-generated data, email messages, and just about any continuous data
source.
Apache Flume is basically a tool or a data ingestion mechanism responsible for collecting and transporting
huge amounts of data such as events, log files, etc. from several sources to one central data store. Apache
Flume is a unique tool designed to copy log data or streaming data from various different web servers to
HDFS.
Apache Flume supports several sources as follows:
‘Tail’: The data is piped from the local files and is written into the HDFS via Flume. It is
somewhat similar to a Unix command, ‘tail’.
System logs
Apache logs: This enables Java applications for writing events to files in HDFS via Flume
Features of Flume:
Before going further, let’s look at the features of Flume:
Log data from different web servers is ingested by Flume into HDFS and HBase very efficiently.
Along with that, huge volumes of event data from social networking sites can also be retrieved.
Data can be retrieved from multiple servers immediately into Hadoop by using Flume.
Huge source of destination types is supported by Flume.
Based on streaming data flows, Flume has a flexible design. This design stands out to be robust
and fault-tolerant with different recovery mechanisms.
Data is carried between sources and sinks by Apache Flume which can either be event-driven or
can be scheduled.
Flume Architecture:
Refer to the image below for understanding the Flume architecture better. In Flume architecture, there are
data generators that generate data. This data that has been generated gets collected by Flume agents. The
data collector is another agent that collects data from various other agents that are aggregated. Then, it is
pushed to a centralized store, i.e., HDFS.
Advantages of Flume:
Here are the advantages of using Flume −
Using Apache Flume we can store the data into any of the centralized stores (HBase, HDFS).
When the rate of incoming data exceeds the rate at which data can be written to the destination,
Flume acts as a mediator between data producers and the centralized stores and provides a steady
flow of data between them.
Flume provides the feature of contextual routing.
The transactions in Flume are channel-based where two transactions (one sender and one receiver)
are maintained for each message. It guarantees reliable message delivery.
Flume is reliable, fault-tolerant, scalable, manageable, and customizable.
Now, we come to the end of this tutorial on Flume. We learned about Apache Flume in depth along with that
we saw the architecture of Flume.