Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

-On the one hand, you may have a Hadoop cluster that used for the procession and

storing large amounts of data, in the other hand, you have an application that producing a large
of amount of data or you have a legacy system that storing data in a relational database.
how do you connect these two? That's exactly where Flume and Sqoop coming .

Generally the data where come from two kinds of sources: either it is an application that
produces data in regular bases. or a traditional "Relational database management system" like
for example "Oracle DB, SQL Server..." in both cases we have sources which contains data and
you have a destination which is a Hadoop ecosystem data store....

The question now is how do we get our data from these sources to Hadoop ?
Of course you will say after the introduction Flume and Sqoop . but let explain how this process
is done Or in the absence of these tools, what is the steps to do ...

Normally , all the Hadoop ecosystem technology . exposes Java APIs (application programming
interface), you can directly use these APIs to write data to for example to HDFS , HBase
Cassandra ...

But there are a few reasons why that may be a different problem , based on if you are
transforming data :form an application or if you are bulk transforming data such as RDBMS.

Let start with application : let suppose that we have a number of events that produce data for
this application which needs to be stored as the events occur . this is called streaming data . so
to do that:

because the HDFS files have to be large to take advantage of it distributed architecture . it
means buffer the data in memory or in an intermediate file before writing to HDFS

you should not lose any data even if there is a crash , and we need a guarantee so that no data
be lost.

All these difficulties and problems are then overridden by Flume

also there are a few problems with using RDBMS and directly integrating with a Java API .
Let say that you have a legacy system used RDBMS and you want to port your data from it to
HDFS .

Fortunately, we do not need to think about any option we'll choose thanks to Sqoop .

There role in the Hadoop ecosystem is slightly similar . but they used cases for each with be
slightly different.

the first different is:


In computer science, a data buffer (or just buffer) is a region
of a physical memory storage used to temporarily store data while it is
being moved from one place to another.

You might also like