Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

The traditional application management system, that is, the interaction of applications with

relational database using RDBMS, is one of the sources that generate Big Data. Such Big
Data, generated by RDBMS, is stored in Relational Database Servers in the relational
database structure.
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra,
Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact with
the relational database servers for importing and exporting the Big Data residing in them.
Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction
between relational database server and Hadoop’s HDFS.
Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop is a tool designed to transfer data between Hadoop and relational database
servers. It is used to import data from relational databases such as MySQL, Oracle to
Hadoop HDFS, and export from Hadoop file system to relational databases. It is provided
by the Apache Software Foundation.

How Sqoop Works?

The following image describes the workflow of Sqoop.

Sqoop Import

The import tool imports individual tables from RDBMS to HDFS. Each row in a table is
treated as a record in HDFS. All records are stored as text data in text files or as binary
data in Avro and Sequence files.
Sqoop Export

The export tool exports a set of files from HDFS back to an RDBMS. The files given as
input to Sqoop contain records, which are called as rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.

This chapter describes how to import data from MySQL database to Hadoop HDFS. The
‘Import tool’ imports individual tables from RDBMS to HDFS. Each row in a table is treated
as a record in HDFS. All records are stored as text data in the text files or as binary data in
Avro and Sequence files.

Syntax

The following syntax is used to import data into HDFS.

$ sqoop import (generic-args) (import-args)


$ sqoop-import (generic-args) (import-args)

Example

Let us take an example of three tables named as emp, emp_add, and emp_contact,
which are in a database called userdb in a MySQL database server.
The three tables and their data are as follows.

emp:

salar de
id name deg
y pt

120 50,0
gopal manager TP
1 00

120 manis Proof 50,0


TP
2 ha reader 00

120 30,0
khalil php dev AC
3 00

120 prasan 30,0


php dev AC
4 th 00

120 kranth 20,0


admin TP
4 i 00

emp_add:

id hno street city

120 288
vgiri jublee
1 A

120 108 sec-


aoc
2 I bad

120 144 pgutt


hyd
3 Z a

120 old sec-


78B
4 city bad

120 720 hitec sec-


5 X bad

emp_contact:

id phno email

120 23567
gopal@tp.com
1 42

120 16616 manisha@tp.co


2 63 m

120 88877
khalil@ac.com
3 76

120 99887 prasanth@ac.co


4 74 m

120 12312
kranthi@tp.com
5 31

Importing a Table

Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file system as
a text file or a binary file.
The following command is used to import the emp table from MySQL database server to
HDFS.

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp --m 1
If it is executed successfully, then you get the following output.
14/12/22 15:24:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
14/12/22 15:24:56 INFO manager.MySQLManager: Preparing to use a MySQL
streaming resultset.
14/12/22 15:24:56 INFO tool.CodeGenTool: Beginning code generation
14/12/22 15:24:58 INFO manager.SqlManager: Executing SQL statement:
SELECT t.* FROM `emp` AS t LIMIT 1
14/12/22 15:24:58 INFO manager.SqlManager: Executing SQL statement:
SELECT t.* FROM `emp` AS t LIMIT 1
14/12/22 15:24:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
/usr/local/hadoop
14/12/22 15:25:11 INFO orm.CompilationManager: Writing jar file:
/tmp/sqoop-hadoop/compile/cebe706d23ebb1fd99c1f063ad51ebd7/emp.jar
-----------------------------------------------------
-----------------------------------------------------
14/12/22 15:25:40 INFO mapreduce.Job: The url to track the job:
http://localhost:8088/proxy/application_1419242001831_0001/
14/12/22 15:26:45 INFO mapreduce.Job: Job job_1419242001831_0001
running in uber mode :
false
14/12/22 15:26:45 INFO mapreduce.Job: map 0% reduce 0%
14/12/22 15:28:08 INFO mapreduce.Job: map 100% reduce 0%
14/12/22 15:28:16 INFO mapreduce.Job: Job job_1419242001831_0001
completed successfully
-----------------------------------------------------
-----------------------------------------------------
14/12/22 15:28:17 INFO mapreduce.ImportJobBase: Transferred 145 bytes
in 177.5849 seconds
(0.8165 bytes/sec)
14/12/22 15:28:17 INFO mapreduce.ImportJobBase: Retrieved 5 records.

To verify the imported data in HDFS, use the following command.

$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*


It shows you the emp table data and fields are separated with comma (,).
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

Importing into Target Directory

We can specify the target directory while importing table data into HDFS using the Sqoop
import tool.
Following is the syntax to specify the target directory as option to the Sqoop import
command.

--target-dir <new or exist directory in HDFS>


The following command is used to import emp_add table data into ‘/queryresult’ directory.

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult
The following command is used to verify the imported data in /queryresult directory
form emp_add table.
$ $HADOOP_HOME/bin/hadoop fs -cat /queryresult/part-m-*
It will show you the emp_add table data with comma (,) separated fields.
1201, 288A, vgiri, jublee
1202, 108I, aoc, sec-bad
1203, 144Z, pgutta, hyd
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad

Import Subset of Table Data

We can import a subset of a table using the ‘where’ clause in Sqoop import tool. It
executes the corresponding SQL query in the respective database server and stores the
result in a target directory in HDFS.
The syntax for where clause is as follows.

--where <condition>
The following command is used to import a subset of emp_add table data. The subset
query is to retrieve the employee id and address, who lives in Secunderabad city.

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /wherequery
The following command is used to verify the imported data in /wherequery directory from
the emp_add table.

$ $HADOOP_HOME/bin/hadoop fs -cat /wherequery/part-m-*


It will show you the emp_add table data with comma (,) separated fields.
1202, 108I, aoc, sec-bad
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad

Incremental Import

Incremental import is a technique that imports only the newly added rows in a table. It is
required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the
incremental import.
The following syntax is used for the incremental option in Sqoop import command.

--incremental <mode>
--check-column <column name>
--last value <last check column value>
Let us assume the newly added data into emp table is as follows −
1206, satish p, grp des, 20000, GR

The following command is used to perform the incremental import in the emp table.

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205
The following command is used to verify the imported data from emp table to HDFS emp/
directory.

$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*


It shows you the emp table data with comma (,) separated fields.
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
1206, satish p, grp des, 20000, GR

The following command is used to see the modified or newly added rows from
the emp table.

$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*1


It shows you the newly added rows to the emp table with comma (,) separated fields.
1206, satish p, grp des, 20000, GR
his chapter describes how to import all the tables from the RDBMS database server to the
HDFS. Each table data is stored in a separate directory and the directory name is same as
the table name.

Syntax

The following syntax is used to import all tables.

$ sqoop import-all-tables (generic-args) (import-args)


$ sqoop-import-all-tables (generic-args) (import-args)

Example

Let us take an example of importing all tables from the userdb database. The list of tables
that the database userdb contains is as follows.
+--------------------+
| Tables |
+--------------------+
| emp |
| emp_add |
| emp_contact |
+--------------------+

The following command is used to import all the tables from the userdb database.

$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/userdb \
--username root
Note − If you are using the import-all-tables, it is mandatory that every table in that
database must have a primary key field.
The following command is used to verify all the table data to the userdb database in
HDFS.

$ $HADOOP_HOME/bin/hadoop fs -ls
It will show you the list of table names in userdb database as directories.

Output

drwxr-xr-x - hadoop supergroup 0 2014-12-22 22:50 _sqoop


drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:46 emp
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:50 emp_add
drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:52 emp_contact
his chapter describes how to export data back from the HDFS to the RDBMS database.
The target table must exist in the target database. The files which are given as input to the
Sqoop contain records, which are called rows in table. Those are read and parsed into a
set of records and delimited with user-specified delimiter.
The default operation is to insert all the record from the input files to the database table
using the INSERT statement. In update mode, Sqoop generates the UPDATE statement
that replaces the existing record into the database.

Syntax

The following is the syntax for the export command.

$ sqoop export (generic-args) (export-args)


$ sqoop-export (generic-args) (export-args)

Example

Let us take an example of the employee data in file, in HDFS. The employee data is
available in emp_data file in ‘emp/’ directory in HDFS. The emp_data is as follows.
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
1206, satish p, grp des, 20000, GR

It is mandatory that the table to be exported is created manually and is present in the
database from where it has to be exported.
The following query is used to create the table ‘employee’ in mysql command line.

$ mysql
mysql> USE db;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));

The following command is used to export the table data (which is in emp_data file on
HDFS) to the employee table in db database of Mysql database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.

mysql>select * from employee;


If the given data is stored successfully, then you can find the following table of given
employee data.
+------+--------------+-------------+-------------------+--------+
| Id | Name | Designation | Salary | Dept |
+------+--------------+-------------+-------------------+--------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | preader | 50000 | TP |
| 1203 | kalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
| 1206 | satish p | grp des | 20000 | GR |

This chapter describes how to create and maintain the Sqoop jobs. Sqoop job creates and
saves the import and export commands. It specifies parameters to identify and recall the
saved job. This re-calling or re-executing is used in the incremental import, which can
import the updated rows from RDBMS table to HDFS.

Syntax

The following is the syntax for creating a Sqoop job.

$ sqoop job (generic-args) (job-args)


[-- [subtool-name] (subtool-args)]

$ sqoop-job (generic-args) (job-args)


[-- [subtool-name] (subtool-args)]

Create Job (--create)

Here we are creating a job with the name myjob, which can import the table data from
RDBMS table to HDFS. The following command is used to create a job that is importing
data from the employee table in the db database to the HDFS file.

$ sqoop job --create myjob \


-- import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1

Verify Job (--list)

‘--list’ argument is used to verify the saved jobs. The following command is used to verify
the list of saved Sqoop jobs.

$ sqoop job --list


It shows the list of saved jobs.
Available jobs:
myjob

Inspect Job (--show)

‘--show’ argument is used to inspect or verify particular jobs and their details. The
following command and sample output is used to verify a job called myjob.
$ sqoop job --show myjob
It shows the tools and their options, which are used in myjob.
Job: myjob
Tool: import Options:
----------------------------
direct.import = true
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = employee
...
incremental.last.value = 1206
...

Execute Job (--exec)

‘--exec’ option is used to execute a saved job. The following command is used to execute
a saved job called myjob.

$ sqoop job --exec myjob


It shows you the following output.
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
...

This chapter describes how to list out the databases using Sqoop. Sqoop list-databases
tool parses and executes the ‘SHOW DATABASES’ query against the database server.
Thereafter, it lists out the present databases on the server.
Syntax

The following syntax is used for Sqoop list-databases command.

$ sqoop list-databases (generic-args) (list-databases-args)


$ sqoop-list-databases (generic-args) (list-databases-args)

Sample Query

The following command is used to list all the databases in the MySQL database server.

$ sqoop list-databases \
--connect jdbc:mysql://localhost/ \
--username root
If the command executes successfully, then it will display the list of databases in your
MySQL database server as follows.
...
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL
streaming resultset.

mysql
test
userdb
db
Q - What is the default value used by sqoop when it encounters a missing value
while importing form CSV file.

A - NULL

B - null

C - space character

D - No values

Answer : B

Explanation

unlike databases there is no NULL values in CSV files. Those are handled by sqoop by

using null string.

Show Answer

Q - Can sqoop use the TRUNCATE option in database while clearing data from a
table?

A - Yes

B - No

C - Depends on the database


D - Depends on the Hadoop configuration

Answer : C

Explanation

If available through the database driver, sqoop can clear the data quickly using

TRUNCATE option.

Show Answer

Q - If the table to which data is being exported has more columns than the data
present in the hdfs file then

A - The load definitely fails

B - The load can be done only for the relevant columns present in HDFS file

C - The will populate values into the wrong columns

D - The load does not start

Answer : B

Explanation

The load can still be done by specifying the –column parameter to populate a subset of

columns in the relational table.

Show Answer

New Test

Q - What is the default value used by sqoop when it encounters a missing value
while importing form CSV file.
A - NULL

B - null

C - space character

D - No values

Answer : B

Explanation

unlike databases there is no NULL values in CSV files. Those are handled by sqoop by

using null string.

Show Answer

Q - Can sqoop use the TRUNCATE option in database while clearing data from a
table?

A - Yes

B - No

C - Depends on the database

D - Depends on the Hadoop configuration

Answer : C

Explanation

If available through the database driver, sqoop can clear the data quickly using

TRUNCATE option.

Show Answer
Q - If the table to which data is being exported has more columns than the data
present in the hdfs file then

A - The load definitely fails

B - The load can be done only for the relevant columns present in HDFS file

C - The will populate values into the wrong columns

D - The load does not start

Answer : B

Explanation

The load can still be done by specifying the –column parameter to populate a subset of

columns in the relational table.

Show Answer

Q - While importing data into Hadoop using sqoop the SQL SELCT clause is used.
Similarly while exporting data form Hadoop the SQL clause used is

A - APPEND

B - MERGE

C - UPDTAE

D - INSERT

Answer : D

Explanation

The INSERT statements are generated by sqoop to insert data into the relational tables.
Show Answer

Q - Which of the following is used by sqoop to establish a connection with


enterprise data warehouses?

A - RDBMS driver

B - JDBC Driver

C - IDBC Driver

D - SQL Driver

Answer : B

Explanation

The JDBC driver is a java program which has been traditionally providing data base

connectivity to a variety of databases.

Show Answer

Q - The disadvantage of using a stored procedure to laod data is

A - Data gets loaded faster

B - Parallel loads in the database table

C - The store procedure cannot load multiple tables at a time

D - It induces heavy load on database

Answer : D

Explanation

As sqoop will call the stored procedure using parallel jobs, so heavy laod is induced in

the database.
Show Answer

Q - The data type mapping between the database column and sqoop column can be
overridden by using the parameter

A - --override-column-type

B - --map-column-type

C - --override-column-java

D - --map-column-java

Answer : D

Explanation

As sqoop uses the Java Data types internally, the mapping of the data types has to be

done with Java Data Types.

Show Answer

Q - The –update-key parameter is used to

A - Update the primary key field present in the Hadoop data to be exported

B - Update the primary key field in the table to which data is already exported

C - Update the database connectivity parameters like username, password etc

D - Update the already exported rows based on a primary key field

Answer : D

Explanation

The –update-key parameter uses the primary key table to update the entire record in the

relational table.
Show Answer

Q - The parameter which can be used in place of --table parameter to insert data into
table is

A - –call

B - –insert-into

C - –populate

D - –load-into

Answer : A

Explanation

The –call parameter will call a database stored procedure which in turn can insert data

into table.

Show Answer

Q - The comparison of row counts between the source system and the target
database while loading the data using sqoop is done using the parameter

A - --Validate

B - --Rowcount

C - -row(count)

D - -allrows
Answer : A

Explanation

The –validate parameter is used to show the result of row comparison between source

and target.

Show Answer

Q - The --options-file parameter is used to

A - save the import log

B - specify the name of the data files to be created after import

C - store all the sqoop variables

D - store the parameters and their values in a file to be used by various sqoop commands.

Answer : D

Explanation

The command line options (the name and value of the parameters) that do not change

from time to time can be saved into a file and used again and again. This is called an

options file.

Show Answer

Q - What happens if the sqoop generated export query is not accepted by the
database?

A - The export fails

B - The export succeeds partially.


C - The export does not start

D - Sqoop automatically modifies the query to succeed after receiving the failure response

form database.

Answer : A

Explanation

The export fails if the query is not accepted by the database.

Show Answer

Q - While inserting data into Relational system from Hadoop using sqoop, the
various table constraints present in the relational table must be

A - Disabled temporarily

B - Dropped and re created

C - Renamed

D - Not violated

Answer : D

Explanation

We must verify that the data being exported does not violate the constraints error.

Show Answer

Q - Can the upsert feature of sqoop delete some data form the exported table?

A - Yes

B - No
C - Depends on database

D - Yes With some additional sqoop parameters

Answer : A

Explanation

Sqoop will never delete data as part of upsert statement.

Show Answer

Q - The insert query used to insert exported data into tables is generate by

A - Sqoop command and processed as such

B - Sqoop command and modified suitably by JDBC drive

C - JDBC driver

D - Database specific driver

Answer : A

Explanation

The insert query is generated only by the sqoop command and processed as such

without any further modification by any other driver.

Show Answer

Q - In both import and export scenario, the role of ValidationThreshold is to


determine if

A - the error margin between the source and target is within a range

B - the Sqoop command can handle the entire number of rows


C - the number of rows rejected by sqoop while reading the data

D - the number of rows rejected by the target database while loading the data

Answer : A

Explanation

The ValidationThreshold - Determines if the error margin between the source and target

are acceptable: Absolute, Percentage Tolerant, etc. Default implementation is

AbsoluteValidationThreshold which ensures the row counts from source and targets are

the same.

Show Answer

Q - What are the two different incremental modes of importing data into sqoop?

A - merge and add

B - append and modified

C - merge and lastmodified

D - append and lastmodified

Answer : D

Explanation

The --incremental parameter is used to fetch only the new data (data which does not

already exist in hadoop) . It is done as an append if there are columns specified to be

checked for new data. it cal also use the last modified parameter which will use the

last_updated_date column from the existing table to identify the new row.
Show Answer

Q - For Text based columns the parameter used for substituting null values is

A - -input-null-string

B - -input-null-non-string

C - -input-null-text

D - -input-null-varchar

Answer : A

Explanation

The –input- null-string is used to substitute null values for text based columns.

Show Answer

Q - The free-form query import feature in sqoop allows to import data from

A - non relations sources

B - a relational source without using a connector

C - a relation source using a sql query

D - a relational source using custom java classes

Answer : C

Explanation

With the The free form query we can write a sql query involving a join between 2 tables

and mention it with --query parameter while importing. It is used in place of the --table

parameter.
Show Answer

Q - The sqoop export/import jobs canbe stored and used again and again by using

A - sqoop- jobname

B - sqoop-save-job

C - sqoop-all-jobs

D - sqoop-job

Answer : D

Explanation

Running a sqoop job by using sqoop-job statement saves the job into metastore which

can be retrived later and used again and again

Example −

$ sqoop-job --create jobname -- import --connect


jdbc:mysql://example.com/db \

--table mytable
What is YARN
Yet Another Resource Manager takes programming to the next level beyond
Java , and makes it interactive to let another application Hbase, Spark etc. to
work on it.Different Yarn applications can co-exist on the same cluster so
MapReduce, Hbase, Spark all can run at the same time bringing great benefits
for manageability and cluster utilization.

Components Of YARN
•Client: For submitting MapReduce jobs.

•Resource Manager: To manage the use of resources across the cluster

•Node Manager:For launching and monitoring the computer containers


on machines in the cluster.

•Map Reduce Application Master: Checks tasks running the


MapReduce job. The application master and the MapReduce tasks run in
containers that are scheduled by the resource manager, and managed by
the node managers.
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which
were responsible for handling resources and checking progress management.
However, Hadoop 2.0 has Resource manager and NodeManager to overcome
the shortfall of Jobtracker & Tasktracker.

Benefits of YARN
•Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes
and 40000 task, but Yarn is designed for 10,000 nodes and 1 lakh tasks.

•Utiliazation: Node Manager manages a pool of resources, rather than a


fixed number of the designated slots thus increasing the utilization.

•Multitenancy: Different version of MapReduce can run on YARN, which


makes the process of upgrading MapReduce more manageable.

You might also like