How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"

The traditional application management system, that is, the interaction of applications with
relational database using RDBMS, is one of the sources that generate Big Data. Such Big
Data, generated by RDBMS, is stored in Relational Database Servers in the relational
database structure.
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra,
Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact with
the relational database servers for importing and exporting the Big Data residing in them.
Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction
between relational database server and Hadoop’s HDFS.
Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop is a tool designed to transfer data between Hadoop and relational database
servers. It is used to import data from relational databases such as MySQL, Oracle to
Hadoop HDFS, and export from Hadoop file system to relational databases. It is provided
by the Apache Software Foundation.
How Sqoop Works?
The following image describes the workflow of Sqoop.
Sqoop Import
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is
treated as a record in HDFS. All records are stored as text data in text files or as binary
data in Avro and Sequence files.
Sqoop Export
The export tool exports a set of files from HDFS back to an RDBMS. The files given as
input to Sqoop contain records, which are called as rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.
This chapter describes how to import data from MySQL database to Hadoop HDFS. The
‘Import tool’ imports individual tables from RDBMS to HDFS. Each row in a table is treated
as a record in HDFS. All records are stored as text data in the text files or as binary data in
Avro and Sequence files.
Syntax
The following syntax is used to import data into HDFS.
$ sqoop import (generic-args) (import-args)

$ sqoop-import (generic-args) (import-args)
Example
Let us take an example of three tables named as emp, emp_add, and emp_contact,
which are in a database called userdb in a MySQL database server.
The three tables and their data are as follows.
emp:
salar de
id name deg
y pt
120 50,0
gopal manager TP
1 00
120 manis Proof 50,0

TP
2 ha reader 00
120 30,0
khalil php dev AC
3 00
120 prasan 30,0

php dev AC
4 th 00
120 kranth 20,0

admin TP
4 i 00
emp_add:
id hno street city
120 288
vgiri jublee
1 A
120 108 sec-

aoc
2 I bad
120 144 pgutt

hyd
3 Z a
120 old sec-

78B
4 city bad
120 720 hitec sec-

5 X bad
emp_contact:
id phno email
120 23567
gopal@tp.com
1 42
120 16616 manisha@tp.co

2 63 m
120 88877
khalil@ac.com
3 76
120 99887 prasanth@ac.co

4 74 m
120 12312
kranthi@tp.com
5 31
Importing a Table
Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file system as
a text file or a binary file.
The following command is used to import the emp table from MySQL database server to
HDFS.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp --m 1
If it is executed successfully, then you get the following output.
14/12/22 15:24:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
14/12/22 15:24:56 INFO manager.MySQLManager: Preparing to use a MySQL
streaming resultset.
14/12/22 15:24:56 INFO tool.CodeGenTool: Beginning code generation
14/12/22 15:24:58 INFO manager.SqlManager: Executing SQL statement:
SELECT t.* FROM `emp` AS t LIMIT 1
14/12/22 15:24:58 INFO manager.SqlManager: Executing SQL statement:
SELECT t.* FROM `emp` AS t LIMIT 1
14/12/22 15:24:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
/usr/local/hadoop
14/12/22 15:25:11 INFO orm.CompilationManager: Writing jar file:
/tmp/sqoop-hadoop/compile/cebe706d23ebb1fd99c1f063ad51ebd7/emp.jar
-----------------------------------------------------
-----------------------------------------------------
14/12/22 15:25:40 INFO mapreduce.Job: The url to track the job:
http://localhost:8088/proxy/application_1419242001831_0001/
14/12/22 15:26:45 INFO mapreduce.Job: Job job_1419242001831_0001
running in uber mode :
false
14/12/22 15:26:45 INFO mapreduce.Job: map 0% reduce 0%
14/12/22 15:28:08 INFO mapreduce.Job: map 100% reduce 0%
14/12/22 15:28:16 INFO mapreduce.Job: Job job_1419242001831_0001
completed successfully
-----------------------------------------------------
-----------------------------------------------------
14/12/22 15:28:17 INFO mapreduce.ImportJobBase: Transferred 145 bytes
in 177.5849 seconds
(0.8165 bytes/sec)
14/12/22 15:28:17 INFO mapreduce.ImportJobBase: Retrieved 5 records.
To verify the imported data in HDFS, use the following command.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

It shows you the emp table data and fields are separated with comma (,).
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
Importing into Target Directory
We can specify the target directory while importing table data into HDFS using the Sqoop
import tool.
Following is the syntax to specify the target directory as option to the Sqoop import
command.
--target-dir <new or exist directory in HDFS>

The following command is used to import emp_add table data into ‘/queryresult’ directory.
$ sqoop import \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult
The following command is used to verify the imported data in /queryresult directory
form emp_add table.
$ $HADOOP_HOME/bin/hadoop fs -cat /queryresult/part-m-*
It will show you the emp_add table data with comma (,) separated fields.
1201, 288A, vgiri, jublee
1202, 108I, aoc, sec-bad
1203, 144Z, pgutta, hyd
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad
Import Subset of Table Data
We can import a subset of a table using the ‘where’ clause in Sqoop import tool. It
executes the corresponding SQL query in the respective database server and stores the
result in a target directory in HDFS.
The syntax for where clause is as follows.
--where <condition>
The following command is used to import a subset of emp_add table data. The subset
query is to retrieve the employee id and address, who lives in Secunderabad city.
$ sqoop import \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /wherequery
The following command is used to verify the imported data in /wherequery directory from
the emp_add table.
$ $HADOOP_HOME/bin/hadoop fs -cat /wherequery/part-m-*

It will show you the emp_add table data with comma (,) separated fields.
1202, 108I, aoc, sec-bad
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad
Incremental Import
Incremental import is a technique that imports only the newly added rows in a table. It is
required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the
incremental import.
The following syntax is used for the incremental option in Sqoop import command.
--incremental <mode>
--check-column <column name>
--last value <last check column value>
Let us assume the newly added data into emp table is as follows −
1206, satish p, grp des, 20000, GR
The following command is used to perform the incremental import in the emp table.
$ sqoop import \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205
The following command is used to verify the imported data from emp table to HDFS emp/
directory.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

It shows you the emp table data with comma (,) separated fields.
The following command is used to see the modified or newly added rows from
the emp table.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*1

It shows you the newly added rows to the emp table with comma (,) separated fields.
his chapter describes how to import all the tables from the RDBMS database server to the
HDFS. Each table data is stored in a separate directory and the directory name is same as
the table name.
Syntax
The following syntax is used to import all tables.
$ sqoop import-all-tables (generic-args) (import-args)

$ sqoop-import-all-tables (generic-args) (import-args)
Example
Let us take an example of importing all tables from the userdb database. The list of tables
that the database userdb contains is as follows.
+--------------------+
| Tables |
+--------------------+
| emp |
| emp_add |
| emp_contact |
+--------------------+
The following command is used to import all the tables from the userdb database.
$ sqoop import-all-tables \
--username root
Note − If you are using the import-all-tables, it is mandatory that every table in that
database must have a primary key field.
The following command is used to verify all the table data to the userdb database in
HDFS.
$ $HADOOP_HOME/bin/hadoop fs -ls
It will show you the list of table names in userdb database as directories.
Output
drwxr-xr-x - hadoop supergroup 0 2014-12-22 22:50 _sqoop

drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:46 emp
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:50 emp_add
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:52 emp_contact
his chapter describes how to export data back from the HDFS to the RDBMS database.
The target table must exist in the target database. The files which are given as input to the
Sqoop contain records, which are called rows in table. Those are read and parsed into a
set of records and delimited with user-specified delimiter.
The default operation is to insert all the record from the input files to the database table
using the INSERT statement. In update mode, Sqoop generates the UPDATE statement
that replaces the existing record into the database.
Syntax
The following is the syntax for the export command.
$ sqoop export (generic-args) (export-args)

$ sqoop-export (generic-args) (export-args)
Example
Let us take an example of the employee data in file, in HDFS. The employee data is
available in emp_data file in ‘emp/’ directory in HDFS. The emp_data is as follows.
It is mandatory that the table to be exported is created manually and is present in the
database from where it has to be exported.
The following query is used to create the table ‘employee’ in mysql command line.
$ mysql
mysql> USE db;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));
The following command is used to export the table data (which is in emp_data file on
HDFS) to the employee table in db database of Mysql database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.
mysql>select * from employee;

If the given data is stored successfully, then you can find the following table of given
employee data.
+------+--------------+-------------+-------------------+--------+
| Id | Name | Designation | Salary | Dept |
+------+--------------+-------------+-------------------+--------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | preader | 50000 | TP |
| 1203 | kalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
| 1206 | satish p | grp des | 20000 | GR |
This chapter describes how to create and maintain the Sqoop jobs. Sqoop job creates and
saves the import and export commands. It specifies parameters to identify and recall the
saved job. This re-calling or re-executing is used in the incremental import, which can
import the updated rows from RDBMS table to HDFS.
Syntax
The following is the syntax for creating a Sqoop job.
$ sqoop job (generic-args) (job-args)

[-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args)

[-- [subtool-name] (subtool-args)]
Create Job (--create)
Here we are creating a job with the name myjob, which can import the table data from
RDBMS table to HDFS. The following command is used to create a job that is importing
data from the employee table in the db database to the HDFS file.
$ sqoop job --create myjob \

-- import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)
‘--list’ argument is used to verify the saved jobs. The following command is used to verify
the list of saved Sqoop jobs.
$ sqoop job --list

It shows the list of saved jobs.
Available jobs:
myjob
Inspect Job (--show)
‘--show’ argument is used to inspect or verify particular jobs and their details. The
following command and sample output is used to verify a job called myjob.
$ sqoop job --show myjob
It shows the tools and their options, which are used in myjob.
Job: myjob
Tool: import Options:
----------------------------
direct.import = true
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = employee
...
incremental.last.value = 1206
...
Execute Job (--exec)
‘--exec’ option is used to execute a saved job. The following command is used to execute
a saved job called myjob.
$ sqoop job --exec myjob

It shows you the following output.
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
...
This chapter describes how to list out the databases using Sqoop. Sqoop list-databases
tool parses and executes the ‘SHOW DATABASES’ query against the database server.
Thereafter, it lists out the present databases on the server.
Syntax
The following syntax is used for Sqoop list-databases command.
$ sqoop list-databases (generic-args) (list-databases-args)

$ sqoop-list-databases (generic-args) (list-databases-args)
Sample Query
The following command is used to list all the databases in the MySQL database server.
$ sqoop list-databases \
--connect jdbc:mysql://localhost/ \
--username root
If the command executes successfully, then it will display the list of databases in your
MySQL database server as follows.
...
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL
streaming resultset.
mysql
test
userdb
db
Q - What is the default value used by sqoop when it encounters a missing value
while importing form CSV file.
A - NULL
B - null
C - space character
D - No values
Answer : B
Explanation
unlike databases there is no NULL values in CSV files. Those are handled by sqoop by
using null string.
Show Answer
Q - Can sqoop use the TRUNCATE option in database while clearing data from a
table?
A - Yes
B - No
C - Depends on the database

D - Depends on the Hadoop configuration
Answer : C
Explanation
If available through the database driver, sqoop can clear the data quickly using
TRUNCATE option.
Show Answer
Q - If the table to which data is being exported has more columns than the data
present in the hdfs file then
A - The load definitely fails
B - The load can be done only for the relevant columns present in HDFS file
C - The will populate values into the wrong columns
D - The load does not start
Answer : B
Explanation
The load can still be done by specifying the –column parameter to populate a subset of
columns in the relational table.
Show Answer
New Test
Q - What is the default value used by sqoop when it encounters a missing value
while importing form CSV file.
A - NULL
B - null
C - space character
D - No values
Answer : B
Explanation
unlike databases there is no NULL values in CSV files. Those are handled by sqoop by
using null string.
Show Answer
Q - Can sqoop use the TRUNCATE option in database while clearing data from a
table?
A - Yes
B - No
C - Depends on the database
D - Depends on the Hadoop configuration
Answer : C
Explanation
If available through the database driver, sqoop can clear the data quickly using
TRUNCATE option.
Show Answer
Q - If the table to which data is being exported has more columns than the data
present in the hdfs file then
A - The load definitely fails
B - The load can be done only for the relevant columns present in HDFS file
C - The will populate values into the wrong columns
D - The load does not start
Answer : B
Explanation
The load can still be done by specifying the –column parameter to populate a subset of
columns in the relational table.
Show Answer
Q - While importing data into Hadoop using sqoop the SQL SELCT clause is used.
Similarly while exporting data form Hadoop the SQL clause used is
A - APPEND
B - MERGE
C - UPDTAE
D - INSERT
Answer : D
Explanation
The INSERT statements are generated by sqoop to insert data into the relational tables.
Show Answer
Q - Which of the following is used by sqoop to establish a connection with

enterprise data warehouses?
A - RDBMS driver
B - JDBC Driver
C - IDBC Driver
D - SQL Driver
Answer : B
Explanation
The JDBC driver is a java program which has been traditionally providing data base
connectivity to a variety of databases.
Show Answer
Q - The disadvantage of using a stored procedure to laod data is
A - Data gets loaded faster
B - Parallel loads in the database table
C - The store procedure cannot load multiple tables at a time
D - It induces heavy load on database
Answer : D
Explanation
As sqoop will call the stored procedure using parallel jobs, so heavy laod is induced in
the database.
Show Answer
Q - The data type mapping between the database column and sqoop column can be
overridden by using the parameter
A - --override-column-type
B - --map-column-type
C - --override-column-java
D - --map-column-java
Answer : D
Explanation
As sqoop uses the Java Data types internally, the mapping of the data types has to be
done with Java Data Types.
Show Answer
Q - The –update-key parameter is used to
A - Update the primary key field present in the Hadoop data to be exported
B - Update the primary key field in the table to which data is already exported
C - Update the database connectivity parameters like username, password etc
D - Update the already exported rows based on a primary key field
Answer : D
Explanation
The –update-key parameter uses the primary key table to update the entire record in the
relational table.
Show Answer
Q - The parameter which can be used in place of --table parameter to insert data into
table is
A - –call
B - –insert-into
C - –populate
D - –load-into
Answer : A
Explanation
The –call parameter will call a database stored procedure which in turn can insert data
into table.
Show Answer
Q - The comparison of row counts between the source system and the target
database while loading the data using sqoop is done using the parameter
A - --Validate
B - --Rowcount
C - -row(count)
D - -allrows
Answer : A
Explanation
The –validate parameter is used to show the result of row comparison between source
and target.
Show Answer
Q - The --options-file parameter is used to
A - save the import log
B - specify the name of the data files to be created after import
C - store all the sqoop variables
D - store the parameters and their values in a file to be used by various sqoop commands.
Answer : D
Explanation
The command line options (the name and value of the parameters) that do not change
from time to time can be saved into a file and used again and again. This is called an
options file.
Show Answer
Q - What happens if the sqoop generated export query is not accepted by the
database?
A - The export fails
B - The export succeeds partially.

C - The export does not start
D - Sqoop automatically modifies the query to succeed after receiving the failure response
form database.
Answer : A
Explanation
The export fails if the query is not accepted by the database.
Show Answer
Q - While inserting data into Relational system from Hadoop using sqoop, the
various table constraints present in the relational table must be
A - Disabled temporarily
B - Dropped and re created
C - Renamed
D - Not violated
Answer : D
Explanation
We must verify that the data being exported does not violate the constraints error.
Show Answer
Q - Can the upsert feature of sqoop delete some data form the exported table?
A - Yes
B - No
C - Depends on database
D - Yes With some additional sqoop parameters
Answer : A
Explanation
Sqoop will never delete data as part of upsert statement.
Show Answer
Q - The insert query used to insert exported data into tables is generate by
A - Sqoop command and processed as such
B - Sqoop command and modified suitably by JDBC drive
C - JDBC driver
D - Database specific driver
Answer : A
Explanation
The insert query is generated only by the sqoop command and processed as such
without any further modification by any other driver.
Show Answer
Q - In both import and export scenario, the role of ValidationThreshold is to

determine if
A - the error margin between the source and target is within a range
B - the Sqoop command can handle the entire number of rows

C - the number of rows rejected by sqoop while reading the data
D - the number of rows rejected by the target database while loading the data
Answer : A
Explanation
The ValidationThreshold - Determines if the error margin between the source and target
are acceptable: Absolute, Percentage Tolerant, etc. Default implementation is
AbsoluteValidationThreshold which ensures the row counts from source and targets are
the same.
Show Answer
Q - What are the two different incremental modes of importing data into sqoop?
A - merge and add
B - append and modified
C - merge and lastmodified
D - append and lastmodified
Answer : D
Explanation
The --incremental parameter is used to fetch only the new data (data which does not
already exist in hadoop) . It is done as an append if there are columns specified to be
checked for new data. it cal also use the last modified parameter which will use the
last_updated_date column from the existing table to identify the new row.
Show Answer
Q - For Text based columns the parameter used for substituting null values is
A - -input-null-string
B - -input-null-non-string
C - -input-null-text
D - -input-null-varchar
Answer : A
Explanation
The –input- null-string is used to substitute null values for text based columns.
Show Answer
Q - The free-form query import feature in sqoop allows to import data from
A - non relations sources
B - a relational source without using a connector
C - a relation source using a sql query
D - a relational source using custom java classes
Answer : C
Explanation
With the The free form query we can write a sql query involving a join between 2 tables
and mention it with --query parameter while importing. It is used in place of the --table
parameter.
Show Answer
Q - The sqoop export/import jobs canbe stored and used again and again by using
A - sqoop- jobname
B - sqoop-save-job
C - sqoop-all-jobs
D - sqoop-job
Answer : D
Explanation
Running a sqoop job by using sqoop-job statement saves the job into metastore which
can be retrived later and used again and again
Example −
$ sqoop-job --create jobname -- import --connect

jdbc:mysql://example.com/db \
--table mytable
What is YARN
Yet Another Resource Manager takes programming to the next level beyond
Java , and makes it interactive to let another application Hbase, Spark etc. to
work on it.Different Yarn applications can co-exist on the same cluster so
MapReduce, Hbase, Spark all can run at the same time bringing great benefits
for manageability and cluster utilization.
Components Of YARN
•Client: For submitting MapReduce jobs.
•Resource Manager: To manage the use of resources across the cluster
•Node Manager:For launching and monitoring the computer containers

on machines in the cluster.
•Map Reduce Application Master: Checks tasks running the

MapReduce job. The application master and the MapReduce tasks run in
containers that are scheduled by the resource manager, and managed by
the node managers.
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which
were responsible for handling resources and checking progress management.
However, Hadoop 2.0 has Resource manager and NodeManager to overcome
the shortfall of Jobtracker & Tasktracker.
Benefits of YARN
•Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes
and 40000 task, but Yarn is designed for 10,000 nodes and 1 lakh tasks.
•Utiliazation: Node Manager manages a pool of resources, rather than a

fixed number of the designated slots thus increasing the utilization.
•Multitenancy: Different version of MapReduce can run on YARN, which

makes the process of upgrading MapReduce more manageable.

How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"

Uploaded by

Copyright:

Available Formats

The traditional application management system, that is, the interaction of applications with

How Sqoop Works?

The following image describes the workflow of Sqoop.

The following syntax is used to import data into HDFS.

$ sqoop import (generic-args) (import-args)

120 manis Proof 50,0

120 prasan 30,0

120 kranth 20,0

id hno street city

120 108 sec-

120 144 pgutt

120 old sec-

120 720 hitec sec-

120 16616 manisha@tp.co

120 99887 prasanth@ac.co

To verify the imported data in HDFS, use the following command.

$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

Importing into Target Directory

--target-dir <new or exist directory in HDFS>

Import Subset of Table Data

$ $HADOOP_HOME/bin/hadoop fs -cat /wherequery/part-m-*

$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*1

The following syntax is used to import all tables.

$ sqoop import-all-tables (generic-args) (import-args)

drwxr-xr-x - hadoop supergroup 0 2014-12-22 22:50 _sqoop

The following is the syntax for the export command.

$ sqoop export (generic-args) (export-args)

mysql>select * from employee;

The following is the syntax for creating a Sqoop job.

$ sqoop job (generic-args) (job-args)

$ sqoop-job (generic-args) (job-args)

Create Job (--create)

$ sqoop job --create myjob \

Verify Job (--list)

$ sqoop job --list

Inspect Job (--show)

Execute Job (--exec)

$ sqoop job --exec myjob

The following syntax is used for Sqoop list-databases command.

$ sqoop list-databases (generic-args) (list-databases-args)

using null string.

C - Depends on the database

A - The load definitely fails

C - The will populate values into the wrong columns

D - The load does not start

columns in the relational table.

using null string.

C - Depends on the database

D - Depends on the Hadoop configuration

A - The load definitely fails

C - The will populate values into the wrong columns

D - The load does not start

columns in the relational table.

Q - Which of the following is used by sqoop to establish a connection with

connectivity to a variety of databases.

Q - The disadvantage of using a stored procedure to laod data is

A - Data gets loaded faster

B - Parallel loads in the database table

C - The store procedure cannot load multiple tables at a time

D - It induces heavy load on database

done with Java Data Types.

Q - The –update-key parameter is used to