Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

ITD351 Big Data Management Essentials

Lab 2 Import Data from MySQL Using Sqoop

Requirement:
1. VMWare Player 15 or better. Or Oracle VM VirtualBox Manager version 6.1 or
better.
2. Cloudera CDH Virtual Machine
3. PC or laptop with at least 16 GB RAM, multicore CPU, 40 GB hard disk.

Resources:
1. https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
2. https://www.w3schools.com/sql/

Objective:

1. You will use Sqoop to look at the table layout in MySQL.


2. You will also use Sqoop to import the table from MySQL to HDFS.

Tasks:

A. Launch VMWare Player (or Oracle VM VirtualBox Manager) and import the
relevant Hadoop VM

B. Explore MySQL database


For SQL commands, you may refer to https://www.w3schools.com/sql/
1. First explore the databases in mysql. Login to mysql:

$mysql -u root -p
Enter password:
mysql>show databases;
mysql>use retail_db;
mysql>show tables;
mysql>select * from customers limit 10;
mysql>select * from customers order by customer_id
desc limit 5;

ITD351 Lab Page 1


The first four digits in the output are the customer_id.

C. Import mysql database into HDFS using Sqoop


You may refer to
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_eval_literal for
the sqoop commands below.

1. Open a new terminal window if necessary.


2. Run the sqoop help command to familiarize yourself with the options in Sqoop:
$ sqoop help
(Note: You may ignore any warning messages)
3. List the tables in the retail_db database. (Replace dbhost with quickstart,
database_name with retail_db, dbuser with root, and pw with cloudera)
$ sqoop list-tables \
--connect jdbc:mysql://dbhost/database_name \
--username dbuser --password pw

[Note: Although we use the hostname '’localhost” for the connection string, in
real life the RDBMS you import data from will almost certainly be running on
some machine that is not part of your cluster (i.e. it will be some server managed
by your DBA).]

The eval tool allows users to quickly run simple SQL queries against a database;
results are printed to the console. This allows users to preview their import queries
to ensure they import the data they expect. The eval tool is provided for evaluation
purpose only. You can use it to verify database connection from within the Sqoop
or to test simple queries. It’s not supposed to be used in production workflows.

4. Run the sqoop import command to see its options:


$ sqoop import –-help
5. Use Sqoop to import the single customers table in the retail_db database and save
it in HDFS under /retail_db. (Replace the tablename with customers and
hadoop_dir_db_name with retail_db, and hadoop_dir_tablename with
customers).
$ sqoop import \
--connect jdbc:mysql://dbhost/database_name \
--username root --password cloudera \
--table tablename \
--target-dir /hadoop_dir_db_name/hadoop_dir_tablename \
--null-non-string '\\N' \
--fields-terminated-by ','

To specify a password-file, use --password-file ${user.home}/.password

[Note: The --null-non-string option tells Sqoop to represent null values as \N, which
makes the imported data compatible with Hive and Impala. The --fields-terminated-by ,
creates comma-delimited output because we explicitly specify that. This is also the default
format. It is possible to export the data to other file formats besides delimited text file (--

ITD351 Lab Page 2


as-textfile), such as binary files containing serialized records like Avro (--as-
avrodatafile), Parquet (--as-parquetfile) and SequenceFile (--as-sequencefile) ].

Inputs are performed using Hadoop MapReduce jobs. While the Sqoop job is running,
notice on the terminal screen the Hadoop MapReduce job created to execute this job.

By default, Sqoop will use four mappers and will split work between them by taking the
minimum and maximum values of the primary key column and dividing the range equally
among the mappers. You can use a different column using –-split-by option. You
can increase the performance by influencing the number of tasks using the -m option (e.g.
to double the mappers from 4 to 8, you can use, -m 8).

Increasing the number of tasks might improve import speed. However, each task adds
load to your database server.

D. View the imported data


1. Sqoop imports the contents of the specified tables to HDFS. You can use the
hdfs command line to view the files and their contents. To list the contents of
the customers directory:
$ hdfs dfs -ls /retail_db/customers

(Note: Output of Hadoop processing jobs is saved as one or more numbered “partition”
files.)
2. Use the HDFS tail command to view the last part of the file for each of the
MapReduce partition files, e.g.:

$ hdfs dfs -tail /retail_db/customers/part-m-00000


$ hdfs dfs -tail /retail_db/customers/part-m-00001
$ hdfs dfs -tail /retail_db/customers/part-m-00002
$ hdfs dfs -tail /retail_db/customers/part-m-00003

The first four digits in the output are the customer ID. Take note of highest
customer ID because you will use it in the next step.

E. Import incremental updates to customers


As retail_db adds new customers, the customer data in HDFS must be updated as
customers are created. You can use Sqoop to append these new records.

1. Add a new customer ID to retail_db MySQL database.


mysql>insert into customers
(customer_id,customer_fname,customer_lname,customer_
email,customer_password,customer_street,customer_cit
y,customer_state,customer_zipcode) values
('12436','Bill','Gates','XXXXXXXX','XXXXXXXX', '1
Microsoft Road','Seattle','Washington','12345');

ITD351 Lab Page 3


(Note: You can also write and run a python script to add the latest customer ID to
MySQL.)

2. Perform incremental import and append the newly added customer to the
customer directory. Use Sqoop to import on the last value on the customer_id
column largest customer_id:

$ sqoop import \
--connect jdbc:mysql://quickstart/retail_db \
--username root --password cloudera \
--incremental append \
--null-non-string '\\N' \
--table customers \
--target-dir /retail_db/customers \
--check-column customer_id \
--last-value <largest_customer_ID>

(Note: replace the <largest_customer_id> with the largest customer_id, 12435)

3. List the contents of the customers directory to verify the Sqoop import:
$ hdfs dfs -ls /retail_db/customers

4. You should see one new file. Use Hadoop’s cat command to view the entire
contents of the file.
$ hdfs dfs -cat /retail_db/customers/part-m-00004

F. Import a free-from query


1. You can also import the results of a query, rather than a single table. Supply a
complete SQL query using the --query option. Import the database to a HDFS
directory in /retail_db/customer_status.
$ sqoop import \
--connect jdbc:mysql://quickstart/retail_db \
--username root --password cloudera \
--target-dir /retail_db/customer_status \
--split-by customer_id \
--query 'SELECT customer_id,customer_fname, \
customer_lname,order_status \
from customers JOIN orders \
ON (customer_id=order_customer_id) \
WHERE $CONDITIONS'

The WHERE $CONDITIONS needs to be included explicitly - typed as shown. This is


where Sqoop itself will insert its range conditions for each tasks.

The above sqoop import gets 'customer_id', 'customer_fname', and 'customer_lname'


from 'customers' table and 'order_date' and 'order_status' form 'orders' table after a
joining. The JOIN statement is used to combine the two tables; customers and orders
ITD351 Lab Page 4
tables using the single column PRIMARY KEY and FOREIGN KEY after the condition
customer_id=order_customer_id. The 'customer_id' is primary key in 'customers' table.
The 'customer_id' is foreign key in 'orders' table which is referencing to the, primary key
of 'customers' table. The 'customer_id' of 'customers' and 'orders' must be same. The
retail_db tables and relationships is shown below.

G. Importing an entire database with Sqoop


1. To import the entire database with Sqoop, you use the ‘import-all-tables’
tool.
$ sqoop import-all-tables \
--connect jdbc:mysql://dbasehost/retail_db \
--username dbuser –-password pw \
--warehouse-dir /warehouse

There will be a subdirectory of /retail_db in HDFS for each table in the target
database. Note that we are showing the --warehouse-dir option, but there is also
a --target-dir option which is sometimes used when importing a single table.
The difference is that --warehouse-dir specifies the parent directory for the
import, while --target-dir specifies the directory itself. That is, if you import a
table 'customers’ using a --warehouse-dir value of '/warehouse' then Sqoop
will create '/warehouse/customers' and import the data there. If you had used
--target-dir with a value of '/warehouse' then no subdirectory is created (i.e.
your data will be directly inside the '/warehouse' directory).

H. Exporting data from HDFS to RDBMS with Sqoop


While Sqoop is most often used to import data into Hadoop, it can also export data to
a relational database.

ITD351 Lab Page 5


You can do this processing as, for example, a nightly job, and then push the results
(typically much smaller than the input data set, it certainly should not be terabytes of
data!) back to an RDBMS where it can be accessed by other systems that don't
interact with Hadoop.

The export command assumes that the data is already in the default (comma-
delimited) format and that the columns in the files being imported are in the same
order as in which they appear in the table (the table must exist before the export). It
is possible to export data in other formats (such as Avro) back to the relational
database.

To export data from HDFS to RDBMS:


$ sqoop export \
--connect jdbc:mysql://dbhost/database_name \
--username dbuser --password pw \
--export-dir /retail_db/customer_status \
--update-mode allowinsert \
--table customer_status

[Note: The --update-mode parameter shown here allows "upsert" mode, in


which records in the target table will be updated if they already exist or inserted if
they do not. The other allowed argument for this option is updateonly, which
only updates existing records (i.e. it will not add any new records). ]

To shutdown the VM, click Player->Power->Shutdown Guest.

Concluding Remarks
Sqoop2 has a client-server design that improves the security, accessibility and resource
manageability of Sqoop. Client only requires to connect to the Sqoop Server as DB
connections are configured on the server by a system administrator. Sqoop server is also
accessible via CLI, REST API, and Web UI. End user no longer need to possess database
credentials. Furthermore, there is a centralized audit trail feature in Sqoop2 for added
security.

I. Summary
1. In this lab, you have learned how to import SQL database into HDFS using
Sqoop.
2. Sqoop exchanges data between a database and the Hadoop cluster.
3. Tables are imported using MapReduce jobs.
4. Sqoop provides many options to control imports.

Nearly every database server in common production use supports JDBC, so Sqoop isa nearly
universal tool. Using JDBC can be slow, however, since it’s a generic approach. Direct mode
uses database-specific utilities (like mysqldump) to do faster imports, but not all databases
are supported. Even where a database is supported with direct mode, not all Sqoop features

ITD351 Lab Page 6

This is the end of the lab


are supported. For example, you cannot import a table in Avro or SequenceFile format when
using direct mode, because mysqldump does not support those formats.

ITD351 Lab Page 7

You might also like