Professional Documents
Culture Documents
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
relational database using RDBMS, is one of the sources that generate Big Data. Such Big
Data, generated by RDBMS, is stored in Relational Database Servers in the relational
database structure.
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra,
Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact with
the relational database servers for importing and exporting the Big Data residing in them.
Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction
between relational database server and Hadoop’s HDFS.
Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop is a tool designed to transfer data between Hadoop and relational database
servers. It is used to import data from relational databases such as MySQL, Oracle to
Hadoop HDFS, and export from Hadoop file system to relational databases. It is provided
by the Apache Software Foundation.
Sqoop Import
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is
treated as a record in HDFS. All records are stored as text data in text files or as binary
data in Avro and Sequence files.
Sqoop Export
The export tool exports a set of files from HDFS back to an RDBMS. The files given as
input to Sqoop contain records, which are called as rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.
This chapter describes how to import data from MySQL database to Hadoop HDFS. The
‘Import tool’ imports individual tables from RDBMS to HDFS. Each row in a table is treated
as a record in HDFS. All records are stored as text data in the text files or as binary data in
Avro and Sequence files.
Syntax
Example
Let us take an example of three tables named as emp, emp_add, and emp_contact,
which are in a database called userdb in a MySQL database server.
The three tables and their data are as follows.
emp:
salar de
id name deg
y pt
120 50,0
gopal manager TP
1 00
120 30,0
khalil php dev AC
3 00
emp_add:
120 288
vgiri jublee
1 A
emp_contact:
id phno email
120 23567
gopal@tp.com
1 42
120 88877
khalil@ac.com
3 76
120 12312
kranthi@tp.com
5 31
Importing a Table
Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file system as
a text file or a binary file.
The following command is used to import the emp table from MySQL database server to
HDFS.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp --m 1
If it is executed successfully, then you get the following output.
14/12/22 15:24:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
14/12/22 15:24:56 INFO manager.MySQLManager: Preparing to use a MySQL
streaming resultset.
14/12/22 15:24:56 INFO tool.CodeGenTool: Beginning code generation
14/12/22 15:24:58 INFO manager.SqlManager: Executing SQL statement:
SELECT t.* FROM `emp` AS t LIMIT 1
14/12/22 15:24:58 INFO manager.SqlManager: Executing SQL statement:
SELECT t.* FROM `emp` AS t LIMIT 1
14/12/22 15:24:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
/usr/local/hadoop
14/12/22 15:25:11 INFO orm.CompilationManager: Writing jar file:
/tmp/sqoop-hadoop/compile/cebe706d23ebb1fd99c1f063ad51ebd7/emp.jar
-----------------------------------------------------
-----------------------------------------------------
14/12/22 15:25:40 INFO mapreduce.Job: The url to track the job:
http://localhost:8088/proxy/application_1419242001831_0001/
14/12/22 15:26:45 INFO mapreduce.Job: Job job_1419242001831_0001
running in uber mode :
false
14/12/22 15:26:45 INFO mapreduce.Job: map 0% reduce 0%
14/12/22 15:28:08 INFO mapreduce.Job: map 100% reduce 0%
14/12/22 15:28:16 INFO mapreduce.Job: Job job_1419242001831_0001
completed successfully
-----------------------------------------------------
-----------------------------------------------------
14/12/22 15:28:17 INFO mapreduce.ImportJobBase: Transferred 145 bytes
in 177.5849 seconds
(0.8165 bytes/sec)
14/12/22 15:28:17 INFO mapreduce.ImportJobBase: Retrieved 5 records.
We can specify the target directory while importing table data into HDFS using the Sqoop
import tool.
Following is the syntax to specify the target directory as option to the Sqoop import
command.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult
The following command is used to verify the imported data in /queryresult directory
form emp_add table.
$ $HADOOP_HOME/bin/hadoop fs -cat /queryresult/part-m-*
It will show you the emp_add table data with comma (,) separated fields.
1201, 288A, vgiri, jublee
1202, 108I, aoc, sec-bad
1203, 144Z, pgutta, hyd
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad
We can import a subset of a table using the ‘where’ clause in Sqoop import tool. It
executes the corresponding SQL query in the respective database server and stores the
result in a target directory in HDFS.
The syntax for where clause is as follows.
--where <condition>
The following command is used to import a subset of emp_add table data. The subset
query is to retrieve the employee id and address, who lives in Secunderabad city.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /wherequery
The following command is used to verify the imported data in /wherequery directory from
the emp_add table.
Incremental Import
Incremental import is a technique that imports only the newly added rows in a table. It is
required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the
incremental import.
The following syntax is used for the incremental option in Sqoop import command.
--incremental <mode>
--check-column <column name>
--last value <last check column value>
Let us assume the newly added data into emp table is as follows −
1206, satish p, grp des, 20000, GR
The following command is used to perform the incremental import in the emp table.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205
The following command is used to verify the imported data from emp table to HDFS emp/
directory.
The following command is used to see the modified or newly added rows from
the emp table.
Syntax
Example
Let us take an example of importing all tables from the userdb database. The list of tables
that the database userdb contains is as follows.
+--------------------+
| Tables |
+--------------------+
| emp |
| emp_add |
| emp_contact |
+--------------------+
The following command is used to import all the tables from the userdb database.
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/userdb \
--username root
Note − If you are using the import-all-tables, it is mandatory that every table in that
database must have a primary key field.
The following command is used to verify all the table data to the userdb database in
HDFS.
$ $HADOOP_HOME/bin/hadoop fs -ls
It will show you the list of table names in userdb database as directories.
Output
Syntax
Example
Let us take an example of the employee data in file, in HDFS. The employee data is
available in emp_data file in ‘emp/’ directory in HDFS. The emp_data is as follows.
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
1206, satish p, grp des, 20000, GR
It is mandatory that the table to be exported is created manually and is present in the
database from where it has to be exported.
The following query is used to create the table ‘employee’ in mysql command line.
$ mysql
mysql> USE db;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));
The following command is used to export the table data (which is in emp_data file on
HDFS) to the employee table in db database of Mysql database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.
This chapter describes how to create and maintain the Sqoop jobs. Sqoop job creates and
saves the import and export commands. It specifies parameters to identify and recall the
saved job. This re-calling or re-executing is used in the incremental import, which can
import the updated rows from RDBMS table to HDFS.
Syntax
Here we are creating a job with the name myjob, which can import the table data from
RDBMS table to HDFS. The following command is used to create a job that is importing
data from the employee table in the db database to the HDFS file.
‘--list’ argument is used to verify the saved jobs. The following command is used to verify
the list of saved Sqoop jobs.
‘--show’ argument is used to inspect or verify particular jobs and their details. The
following command and sample output is used to verify a job called myjob.
$ sqoop job --show myjob
It shows the tools and their options, which are used in myjob.
Job: myjob
Tool: import Options:
----------------------------
direct.import = true
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = employee
...
incremental.last.value = 1206
...
‘--exec’ option is used to execute a saved job. The following command is used to execute
a saved job called myjob.
This chapter describes how to list out the databases using Sqoop. Sqoop list-databases
tool parses and executes the ‘SHOW DATABASES’ query against the database server.
Thereafter, it lists out the present databases on the server.
Syntax
Sample Query
The following command is used to list all the databases in the MySQL database server.
$ sqoop list-databases \
--connect jdbc:mysql://localhost/ \
--username root
If the command executes successfully, then it will display the list of databases in your
MySQL database server as follows.
...
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL
streaming resultset.
mysql
test
userdb
db
Q - What is the default value used by sqoop when it encounters a missing value
while importing form CSV file.
A - NULL
B - null
C - space character
D - No values
Answer : B
Explanation
unlike databases there is no NULL values in CSV files. Those are handled by sqoop by
Show Answer
Q - Can sqoop use the TRUNCATE option in database while clearing data from a
table?
A - Yes
B - No
Answer : C
Explanation
If available through the database driver, sqoop can clear the data quickly using
TRUNCATE option.
Show Answer
Q - If the table to which data is being exported has more columns than the data
present in the hdfs file then
B - The load can be done only for the relevant columns present in HDFS file
Answer : B
Explanation
The load can still be done by specifying the –column parameter to populate a subset of
Show Answer
New Test
Q - What is the default value used by sqoop when it encounters a missing value
while importing form CSV file.
A - NULL
B - null
C - space character
D - No values
Answer : B
Explanation
unlike databases there is no NULL values in CSV files. Those are handled by sqoop by
Show Answer
Q - Can sqoop use the TRUNCATE option in database while clearing data from a
table?
A - Yes
B - No
Answer : C
Explanation
If available through the database driver, sqoop can clear the data quickly using
TRUNCATE option.
Show Answer
Q - If the table to which data is being exported has more columns than the data
present in the hdfs file then
B - The load can be done only for the relevant columns present in HDFS file
Answer : B
Explanation
The load can still be done by specifying the –column parameter to populate a subset of
Show Answer
Q - While importing data into Hadoop using sqoop the SQL SELCT clause is used.
Similarly while exporting data form Hadoop the SQL clause used is
A - APPEND
B - MERGE
C - UPDTAE
D - INSERT
Answer : D
Explanation
The INSERT statements are generated by sqoop to insert data into the relational tables.
Show Answer
A - RDBMS driver
B - JDBC Driver
C - IDBC Driver
D - SQL Driver
Answer : B
Explanation
The JDBC driver is a java program which has been traditionally providing data base
Show Answer
Answer : D
Explanation
As sqoop will call the stored procedure using parallel jobs, so heavy laod is induced in
the database.
Show Answer
Q - The data type mapping between the database column and sqoop column can be
overridden by using the parameter
A - --override-column-type
B - --map-column-type
C - --override-column-java
D - --map-column-java
Answer : D
Explanation
As sqoop uses the Java Data types internally, the mapping of the data types has to be
Show Answer
A - Update the primary key field present in the Hadoop data to be exported
B - Update the primary key field in the table to which data is already exported
Answer : D
Explanation
The –update-key parameter uses the primary key table to update the entire record in the
relational table.
Show Answer
Q - The parameter which can be used in place of --table parameter to insert data into
table is
A - –call
B - –insert-into
C - –populate
D - –load-into
Answer : A
Explanation
The –call parameter will call a database stored procedure which in turn can insert data
into table.
Show Answer
Q - The comparison of row counts between the source system and the target
database while loading the data using sqoop is done using the parameter
A - --Validate
B - --Rowcount
C - -row(count)
D - -allrows
Answer : A
Explanation
The –validate parameter is used to show the result of row comparison between source
and target.
Show Answer
D - store the parameters and their values in a file to be used by various sqoop commands.
Answer : D
Explanation
The command line options (the name and value of the parameters) that do not change
from time to time can be saved into a file and used again and again. This is called an
options file.
Show Answer
Q - What happens if the sqoop generated export query is not accepted by the
database?
D - Sqoop automatically modifies the query to succeed after receiving the failure response
form database.
Answer : A
Explanation
Show Answer
Q - While inserting data into Relational system from Hadoop using sqoop, the
various table constraints present in the relational table must be
A - Disabled temporarily
C - Renamed
D - Not violated
Answer : D
Explanation
We must verify that the data being exported does not violate the constraints error.
Show Answer
Q - Can the upsert feature of sqoop delete some data form the exported table?
A - Yes
B - No
C - Depends on database
Answer : A
Explanation
Show Answer
Q - The insert query used to insert exported data into tables is generate by
C - JDBC driver
Answer : A
Explanation
The insert query is generated only by the sqoop command and processed as such
Show Answer
A - the error margin between the source and target is within a range
D - the number of rows rejected by the target database while loading the data
Answer : A
Explanation
The ValidationThreshold - Determines if the error margin between the source and target
AbsoluteValidationThreshold which ensures the row counts from source and targets are
the same.
Show Answer
Q - What are the two different incremental modes of importing data into sqoop?
Answer : D
Explanation
The --incremental parameter is used to fetch only the new data (data which does not
checked for new data. it cal also use the last modified parameter which will use the
last_updated_date column from the existing table to identify the new row.
Show Answer
Q - For Text based columns the parameter used for substituting null values is
A - -input-null-string
B - -input-null-non-string
C - -input-null-text
D - -input-null-varchar
Answer : A
Explanation
The –input- null-string is used to substitute null values for text based columns.
Show Answer
Q - The free-form query import feature in sqoop allows to import data from
Answer : C
Explanation
With the The free form query we can write a sql query involving a join between 2 tables
and mention it with --query parameter while importing. It is used in place of the --table
parameter.
Show Answer
Q - The sqoop export/import jobs canbe stored and used again and again by using
A - sqoop- jobname
B - sqoop-save-job
C - sqoop-all-jobs
D - sqoop-job
Answer : D
Explanation
Running a sqoop job by using sqoop-job statement saves the job into metastore which
Example −
--table mytable
What is YARN
Yet Another Resource Manager takes programming to the next level beyond
Java , and makes it interactive to let another application Hbase, Spark etc. to
work on it.Different Yarn applications can co-exist on the same cluster so
MapReduce, Hbase, Spark all can run at the same time bringing great benefits
for manageability and cluster utilization.
Components Of YARN
•Client: For submitting MapReduce jobs.
Benefits of YARN
•Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes
and 40000 task, but Yarn is designed for 10,000 nodes and 1 lakh tasks.