Professional Documents
Culture Documents
Lab 2 Import Data From MySQL W Sqoop
Lab 2 Import Data From MySQL W Sqoop
Requirement:
1. VMWare Player 15 or better. Or Oracle VM VirtualBox Manager version 6.1 or
better.
2. Cloudera CDH Virtual Machine
3. PC or laptop with at least 16 GB RAM, multicore CPU, 40 GB hard disk.
Resources:
1. https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
2. https://www.w3schools.com/sql/
Objective:
Tasks:
A. Launch VMWare Player (or Oracle VM VirtualBox Manager) and import the
relevant Hadoop VM
$mysql -u root -p
Enter password:
mysql>show databases;
mysql>use retail_db;
mysql>show tables;
mysql>select * from customers limit 10;
mysql>select * from customers order by customer_id
desc limit 5;
[Note: Although we use the hostname '’localhost” for the connection string, in
real life the RDBMS you import data from will almost certainly be running on
some machine that is not part of your cluster (i.e. it will be some server managed
by your DBA).]
The eval tool allows users to quickly run simple SQL queries against a database;
results are printed to the console. This allows users to preview their import queries
to ensure they import the data they expect. The eval tool is provided for evaluation
purpose only. You can use it to verify database connection from within the Sqoop
or to test simple queries. It’s not supposed to be used in production workflows.
[Note: The --null-non-string option tells Sqoop to represent null values as \N, which
makes the imported data compatible with Hive and Impala. The --fields-terminated-by ,
creates comma-delimited output because we explicitly specify that. This is also the default
format. It is possible to export the data to other file formats besides delimited text file (--
Inputs are performed using Hadoop MapReduce jobs. While the Sqoop job is running,
notice on the terminal screen the Hadoop MapReduce job created to execute this job.
By default, Sqoop will use four mappers and will split work between them by taking the
minimum and maximum values of the primary key column and dividing the range equally
among the mappers. You can use a different column using –-split-by option. You
can increase the performance by influencing the number of tasks using the -m option (e.g.
to double the mappers from 4 to 8, you can use, -m 8).
Increasing the number of tasks might improve import speed. However, each task adds
load to your database server.
(Note: Output of Hadoop processing jobs is saved as one or more numbered “partition”
files.)
2. Use the HDFS tail command to view the last part of the file for each of the
MapReduce partition files, e.g.:
The first four digits in the output are the customer ID. Take note of highest
customer ID because you will use it in the next step.
2. Perform incremental import and append the newly added customer to the
customer directory. Use Sqoop to import on the last value on the customer_id
column largest customer_id:
$ sqoop import \
--connect jdbc:mysql://quickstart/retail_db \
--username root --password cloudera \
--incremental append \
--null-non-string '\\N' \
--table customers \
--target-dir /retail_db/customers \
--check-column customer_id \
--last-value <largest_customer_ID>
3. List the contents of the customers directory to verify the Sqoop import:
$ hdfs dfs -ls /retail_db/customers
4. You should see one new file. Use Hadoop’s cat command to view the entire
contents of the file.
$ hdfs dfs -cat /retail_db/customers/part-m-00004
There will be a subdirectory of /retail_db in HDFS for each table in the target
database. Note that we are showing the --warehouse-dir option, but there is also
a --target-dir option which is sometimes used when importing a single table.
The difference is that --warehouse-dir specifies the parent directory for the
import, while --target-dir specifies the directory itself. That is, if you import a
table 'customers’ using a --warehouse-dir value of '/warehouse' then Sqoop
will create '/warehouse/customers' and import the data there. If you had used
--target-dir with a value of '/warehouse' then no subdirectory is created (i.e.
your data will be directly inside the '/warehouse' directory).
The export command assumes that the data is already in the default (comma-
delimited) format and that the columns in the files being imported are in the same
order as in which they appear in the table (the table must exist before the export). It
is possible to export data in other formats (such as Avro) back to the relational
database.
Concluding Remarks
Sqoop2 has a client-server design that improves the security, accessibility and resource
manageability of Sqoop. Client only requires to connect to the Sqoop Server as DB
connections are configured on the server by a system administrator. Sqoop server is also
accessible via CLI, REST API, and Web UI. End user no longer need to possess database
credentials. Furthermore, there is a centralized audit trail feature in Sqoop2 for added
security.
I. Summary
1. In this lab, you have learned how to import SQL database into HDFS using
Sqoop.
2. Sqoop exchanges data between a database and the Hadoop cluster.
3. Tables are imported using MapReduce jobs.
4. Sqoop provides many options to control imports.
Nearly every database server in common production use supports JDBC, so Sqoop isa nearly
universal tool. Using JDBC can be slow, however, since it’s a generic approach. Direct mode
uses database-specific utilities (like mysqldump) to do faster imports, but not all databases
are supported. Even where a database is supported with direct mode, not all Sqoop features