Professional Documents
Culture Documents
To Memorize
To Memorize
storing large amounts of data, in the other hand, you have an application that producing a large
of amount of data or you have a legacy system that storing data in a relational database.
how do you connect these two? That's exactly where Flume and Sqoop coming .
Generally the data where come from two kinds of sources: either it is an application that
produces data in regular bases. or a traditional "Relational database management system" like
for example "Oracle DB, SQL Server..." in both cases we have sources which contains data and
you have a destination which is a Hadoop ecosystem data store....
The question now is how do we get our data from these sources to Hadoop ?
Of course you will say after the introduction Flume and Sqoop . but let explain how this process
is done Or in the absence of these tools, what is the steps to do ...
Normally , all the Hadoop ecosystem technology . exposes Java APIs (application programming
interface), you can directly use these APIs to write data to for example to HDFS , HBase
Cassandra ...
But there are a few reasons why that may be a different problem , based on if you are
transforming data :form an application or if you are bulk transforming data such as RDBMS.
Let start with application : let suppose that we have a number of events that produce data for
this application which needs to be stored as the events occur . this is called streaming data . so
to do that:
because the HDFS files have to be large to take advantage of it distributed architecture . it
means buffer the data in memory or in an intermediate file before writing to HDFS
you should not lose any data even if there is a crash , and we need a guarantee so that no data
be lost.
also there are a few problems with using RDBMS and directly integrating with a Java API .
Let say that you have a legacy system used RDBMS and you want to port your data from it to
HDFS .
Fortunately, we do not need to think about any option we'll choose thanks to Sqoop .
There role in the Hadoop ecosystem is slightly similar . but they used cases for each with be
slightly different.