Mysql To Snowflake Migration Guide

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

MYSQL TO SNOWFLAKE MIGRATION GUIDE

MIGRATION STRATEGIES AND BEST PRACTICES

What’s inside:

1. Why Migrate?
2. Strategy—thinking about your migration
3. Migrating your existing MySQL Database
4. Migrate Using Traditional Backup and Put/Copy Operations
1. Extract Data from MySQL
2. Data Types and Formatting
3. Stage Data Files
4. Copy Staged Files to Snowflake Table
5. Incremental Data Load
6. Increment Extract from MySQL
7. Update Snowflake Table

​ 5. MIGRATING YOUR QUERIES AND WORKLOADS


Why Migrate?
MySQL has had a role in relational databases and data warehouses for over the last 5
years. With the introduction of the engineered systems such as , Cluster and the 8.0
Database, the tight integration of storage and compute enabled faster processing of larger
amounts of data with on-premise infrastructure. However, as the volume, velocity and variety
of data has since changed, the cloud has enabled what’s possible today with modern data
analytics. For example, by separating compute from storage, Snowflake has developed a
modern cloud data warehouse that automatically and instantly scales in a way not possible
with MySQL, whether the current MySQL system is on-premises or hosted in the cloud.
Snowflake accomplishes this with its multi-cluster, shared data architecture.

YOUR MOTIVATION TO MIGRATE


Some of the key reasons customers migrate off of Snowflake:
1. Legacy platform is inadequate. Traditional technology fails to meet the needs of
today’s business users such as unlimited concurrency and performance.
2. Cloud offers a no-management solution. Moving from on-premise to Cloud means
moving away from traditional IT delivery models to on-demand, as-a-service models
with minimal management or intervention.
3. Cost is affordable and predictable. Snowflake allows for true pay-as-you-go cloud
scalability without the need for complex reconfiguration as your data or workloads
grow.

WHY SNOWFLAKE?
Snowflake’s innovations break down the technology and architecture barriers that
organizations still experience with other data warehouse vendors. Only Snowflake has
achieved all six of the defining qualities of a data warehouse built for the cloud:
➔ ZERO MANAGEMENT
Snowflake reduces complexity with built-in performance, so there’s no infrastructure
to tweak, no knobs to turn and no tuning required.

➔ ALL YOUR DATA

Create a single source of truth to easily store, integrate and extract critical insight
from petabytes of structured and semi-structured data (JSON, Avro, ORC, Parquet
or XML).

➔ ALL YOUR USERS


Provide access to an architecturally unlimited number of concurrent users and
applications without eroding performance.

➔ PAY ONLY FOR WHAT YOU USE


Snowflake’s built-for-the-cloud solution scales storage separate from compute, up
and down, transparently and automatically.

➔ DATA SHARING
Snowflake extends the data warehouse to the Data Sharehouse™, with direct,
governed and secure data sharing in real time, so enterprises can easily forge
one-to-one, one-to-many and many-to-many data sharing relationships.

➔ COMPLETE SQL DATABASE


Snowflake is ANSI SQL and thus supports the tools millions of business users
already know how to use today.

Strategy-Thinking about your migration

WHAT SHOULD YOU CONSIDER?


There are several things to contemplate when choosing your migration path. It’s usually
desirable to pilot the migration on a subset of the data and processes. Organizations often
prefer to migrate in stages, reducing risk and showing value sooner. However, you must
balance this against the need to maintain program momentum and minimize the period of
dual-running. In addition, your approach may be constrained by the interrelationships within
the data, such as data marts that rely on references to data populated via a separate
process in another schema.

Questions to ask about your workloads and data


1. What workloads and processes can you migrate with minimal effort?
2. Which processes have issues today and would benefit from re-engineering?
3. What workloads are outdated and require a complete overhaul?
4. What new workloads would you like to add that would deploy easier in Snowflake?

Approach of Migrations -
The decision whether to move data and processes in one bulk operation or deploy a staged
approach depends on several factors.
1. Nature of your current data analytics platform
2. The types and number of data sources
3. Time to move the legacy system to Snowflake

LIFT AND SHIFT MIGRATION​-


Lift & shift​ is a common option for moving data from one RDBMS to another RDBMS
irrespective of where they reside.

In Our Case, We will take the dump of Tables/Databases and copying across the internet
into a pre-deployed target Snowflake account. Although this Lift and shift can be done
manually, the process can and should be automated with ETL Tools.

BENEFITS -
● Migrate fast to new system
● Reduced risk compared to replatforming and and refactoring
● Lower initial cost compared to replatforming and and refactoring
● Thanks to multiple cloud native and partner tools available, the process can be highly
automated with limited or no downtime.

RISKS -

● Inefficient and expensive cloud consumption.


● Lack of understanding of the cloud. Inefficiency of work, or data leakage with wrong
operation due to lack of cloud knowledge.
● Poor cost & workload estimation to due to lack of cloud skills or understanding of
application data.

Migrating your existing MySQL warehouse

To successfully migrate your enterprise database to Snowflake, develop and follow a logical
plan that includes the steps presented in this section.
1. MOVING YOUR DATA MODEL
1. Using a data modeling tool(MySQL WorkBench/ERWin)
2. Using existing DDL scripts
3. Creating new DDL scripts using mysqldump
mysqldump --no-data -u someuser -papples
mydatabase>db_name_ddl.sql
2. MOVING YOUR EXISTING DATA SET
1. Moving Data using ETL tool(FiveTran,Stitch etc)
2. Moving Data using Traditional Backup utilities(mysqldump) and Setting
Up CDC

MOVING YOUR DATA MODEL


As a starting point for your migration, you need to move your database objects, including
databases, tables, views and sequences, from MySQL to Snowflake. In addition, you may
want to include all of your user account names, roles and objects grants.
At a minimum, the user who owns the MySQL Database must be created on the target
Snowflake system before migrating data. Your choice of which objects to move depends on
the scope of your initial migration.
After deciding which objects to move, choose a method for moving your data model from
MySQLto Snowflake.
The following sections outline three different methods.

Using a data modeling tool


If your database design is stored in a data modeling tool such as MySQL SQL Developer
Data Modeler/ERWin, you can generate the DDL needed to rebuild your existing database
objects.The majority of your MySQL DDL will execute in Snowflake without change. Keep in
mind that Snowflake is self-tuning and has a unique architecture.
Note - ​You won’t need to generate code for any indexes, partitions or storage clauses that
you may have needed in a MySQL database​.

You need only basic DDL, such as CREATE TABLE, CREATE VIEW and CREATE
SEQUENCE. Once you have these scripts, you can log into your Snowflake account to
execute them through the UI or the command line tool SnowSQL.
If you have a data modeling tool, but the model is not current, we recommend you reverse
engineer the current design into your tool, then follow the approach outlined above.

Using existing DDL scripts


If you don’t have a data modeling tool, you can begin with the most recent version of your
existing DDL scripts (in a version control system). Edit these scripts to remove code for
extraneous features and options not needed in Snowflake, such as indexes, tablespace
assignments and other storage or distribution-related clauses. Depending on the data types
you used in MySQL, you may also need to do a search-and-replace in the scripts to change
some of the data types to Snowflake optimized types. For a list of these data types, see
Appendix B.

Creating new DDL scripts


If you don’t have a data modeling tool or current DDL scripts for your data warehouse, you
will need to extract the metadata needed from the MySQL Information_schema to generate
these scripts. This task is somewhat simplified for Snowflake since you won’t need to extract
metadata for indexes and storage clauses.
As mentioned above, depending on the data types in your MySQL design, you may also
need to change some of the data types to Snowflake-optimized types. You will likely need to
write a SQL extract script to build the DDL scripts. Rather than do a search and replace after
the script is generated, you can code these data type conversions directly into the metadata
extract script. The benefit is that you will have automated the extract process so you can
execute the move iteratively. Plus, you will save time editing the script after the fact.
Additionally, coding the conversions into the script is less error-prone than any manual
clean-up process, especially if you are migrating hundreds or even thousands of tables.

MOVING YOUR EXISTING DATA SET

After building your objects in Snowflake, move the historical data loaded in your MySQL
system over to Snowflake.
Moving Data using ETL tool(FiveTran,Stitch,Alloma etc)

You can use a third-party migration tool (see Appendix A), an ETL tool or a manual process.
When choosing an option, consider how much data you have to move. For example, to
move 10s or 100s of terabytes up to a few petabytes of data, a practical approach is to
extract the data to files and move it via a service such as AWS Snowball or Azure Data Box.
If you have to move 100s of petabytes or even exabytes of data, AWS Snowmobile or Azure
Data Box are available options.

Moving Data using Traditional Backup utilities(mysqldump) and Setting


Up CDC

If you choose to move your data manually, you will need to extract the data for each table to
one or more delimited flat files in text format. Use one of the many methods available to the
MySQL database such as mysqldump,mydumper to pump the data out to the desired format.
Then upload these files using the PUT command into an Amazon S3 staging bucket, either
internal or external. We recommend these files be between 100MB and 1GB to take
advantage of Snowflake’s parallel bulk loading.
After you have extracted the data and moved it to S3, you can begin loading the data into
your table in Snowflake using the COPY command. You can check out more details about
our COPY command in our online documentation.
PROCEDURE TO MIGRATE THE DATABASE USING
TRADITIONAL BACKUP AND PUT/COPY
OPERATIONS STEP BY STEP.

The high-level steps to be followed for MySQL to Snowflake migration as shown in the figure
above are,

1. Extract data from MySQL


2. Data Types and Formatting
3. Stage Data Files
4. Copy staged files to Snowflake table

1. Extract Data from MySQL

Broadly, there are two methods that are followed to extract from MySQL. One is using the
command line tool – mysqldump and the other is running SQL query using MySQL client and
saving the output to files.

Extracting data with mysqldump:

Mysqldump is a client utility available by default with standard Mysql installation. Its main
usage is to create a logical backup of a database/table. It can be used to extract one table
as shown below:
mysqldump -u <username> -h <host_name> -p database_name my_table >
my_table_out.sql

Here, the output file table_name.sql will be in the form of insert statements like

INSERT INTO table_name (column1, column2, column3, ...)

VALUES (value1, value2, value3, ...);

To convert this format into a CSV file you have to write a small script or use some open
source library available. You can refer MySQL official documentation for more information.

If the mysqldump is running on the same machine or different machine where the mysql
server runs, you have another simpler option to get CSV directly. Use below command to get
CSV file:

mysqldump -u [username] -p -t -T/path/to/directory [database_name] --fields-terminated-by=,

Extract Data Using SQL Query

SQL commands can be executed using MySQL client utility and redirect output to a file.

mysql -B -u user database -h mysql_host -e "select * from my_table;" >


my_table_data_raw.txt

The output can be transformed using text editing utilities like sed or awk to clean and format
data.

Example:

mysql -B -u user database -h mysql_host -e "select * from my_table;" | sed


"s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g"\ > my_table_final_data.csv

2. Data Types and Formatting

Other than business-specific transformations, following things to be noted while replicating


data from MySQL to Snowflake.

● Snowflake support a number of character sets including UTF-8, UTF-16 etc.


To see the full list – click here.
● Snowflake supports UNIQUE, PRIMARY KEY, FOREIGN KEY, NOT NULL
constraints, unlike many other cloud analytical solutions.
● Snowflake has a rich set of data types. Here is the list of Snowflake data
types and corresponding MySQL.

MySQL Data Type Snowflake Data Type

TINYINT TINYINT

SMALLINT SMALLINT

MEDIUMINT INTEGER

INT INTEGER

BIGINT BIGINT

DECIMAL DECIMAL

FLOAT FLOAT, FLOAT4, FLOAT8

DOUBLE DOUBLE, DOUBLE PRECISION, REAL

BIT BOOLEAN

CHAR CHAR
VARCHAR VARCHAR

BINARY BINARY

VARBINARY VARBINARY

TINYTEXT STRING, TEXT

TEXT STRING, TEXT

MEDIUMTEXT STRING, TEXT

LONGTEXT STRING, TEXT

ENUM No type for ENUM. Must use any type which can
represent values in ENUM.

SET No type for SE. Must use any type which can
represent values in SET.

DATE DATE

TIME TIME

DATETIME DATETIME
TIMESTAMP TIMESTAMP

● Snowflake allows most of the date/time format and it can be explicitly specified
while loading data to table using File Format Option ( we will discuss this in detail
later). For the complete list of supported format please click here.

3. Stage Data Files

To insert MySQL data into a Snowflake table first data files needs to be uploaded to a
temporary location which is called staging. Snowflake support internal and external stages.

Internal Stage

Each user and table is automatically allocated an internal stage for staging data files. You
can also create named internal stages.

● User stage is referenced using ‘@~’.


● The name of a table stage will be the same as the table name.
● User/Table stages can’t be altered or dropped.
● User/Table stages do not support setting file format options.

Internal Named Stages are explicitly created by the user using respective SQL statements. It
provides a greater degree of flexibility while loading data. You can assign file format and
other options to named stages which makes data load easier.

While working with Snowflake you will need to run a lot of DML and DDL statements in SQL
and some specific commands like for data load as shown below. SnowSQL is a very handy
CLI client which can be used to run those commands and is available in
Linux/Mac/Windows.

Example:

create or replace stage my_mysql_stage


copy_options = (on_error='skip_file')

file_format = (type = 'CSV' field_delimiter = '|' skip_header = 1);

PUT command is used to stage data files to an internal stage. The syntax of the command is
as given below :

PUT file://path/to/file/filename internal_stage_name

Example:

Upload a file named mysql_data.csv in the /tmp/mysql/data directory to an internal stage


named mysql_stage.

PUT file:////tmp/mysql/data/mysql_data.csv @mysql_stage;

There are many useful options like set parallelism while uploading the file, automatic
compression of data files etc.

External Stage

Currently, Snowflake supports Amazon S3 and Microsoft Azure as an external staging


location. You can create an external stage with those locations and load data to a Snowflake
Table. To create an external stage on S3, you have to provide IAM credentials and
encryption keys if data is encrypted as shown in the example below:

create or replace stage mysql_ext_stage url='s3://snoflake/load/files/'


credentials= (aws_key_id='111a222b3c' aws_secret_key='abcd4x5y6z');

encryption= (master_key = 'eSxX0jzYfIamtnBKOEOwq80Au6NDwOaO8=');

Data to the external stage can be uploaded using respective cloud vendor interfaces. For S3
you can upload using web console or any SDK or third-party tools.

4. Copy Staged Files to Snowflake Table

COPY INTO command is to load the contents of the staged file(s) into a Snowflake table.
This command needs to compute resources in the form of virtual warehouses to run.

Example:
To load from a named internal stage:

COPY INTO mysql_table

FROM @mysql_stage;

Loading from the external stage. Only one file is specified.

COPY INTO mycsvtable

FROM @mysql_ext_stage/tutorials/dataloading/contacts1.csv;

You can even copy directly from an external location:

COPY INTO mysql_table


FROM s3://mybucket/data/files
credentials= (aws_key_id='$AWS_ACCESS_KEY_ID'
aws_secret_key='$AWS_SECRET_ACCESS_KEY')
encryption= (master_key = 'eSxX0jzYfIamtnBKOEOwq80Au6NbSgPH5r4BDDwOaO8=')

file_format = (format_name = my_csv_format);

Files can be specified using patterns:

COPY INTO mytable


FROM @mysql_stage
file_format = (type = 'CSV')

pattern='.*/.*/.*[.]csv[.]gz';

Some common format options for CSV format supported in the COPY command are the
following:

● COMPRESSION – Compression algorithm for the data files to be loaded.


● RECORD_DELIMITER – Character that separates records(lines) in an input
CSV file
● FIELD_DELIMITER – Fields separating character in the input file.
● SKIP_HEADER – Number of header lines to be skipped.
● DATE_FORMAT – String to specify the date format.
● TIME_FORMAT – String to specify the time format.
For the full list of options available please visit here.

Incremental Data Load

After initial full data is loaded to the target table, most of the time changed data is extracted
from the source and migrated to the target table at a regular interval. Sometimes for small
tables, full data dump can be used even for recurring data migration but for the larger table
we have to go with delta approach.

Increment Extract from MySQL

To get only modified records after a particular time, run SQL with proper predicates against
the table and write output to file. mysqldump not useful here as it always extracts full data.

Example: Extracting records based on last_updated_timestamp column and formatting data


using sed command.

mysql -B -u user database -h mysql_host -e "select * from my_table where


last_updated_timestamp < now() and
last_updated_timestamp >'#max_updated_ts_in_last_run#'"|

sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" \ > my_table_data.csv

Any records deleted physically will be missing here and will not be reflected in the target.

Update Snowflake Table

Snowflake supports row-level updates which makes delta data migration much easier. Basic
idea is to load incrementally extracted data into an intermediate table and modify records in
final table as per data in the intermediate table.

We can choose three methods to modify the final table once data is loaded into the
intermediate table.

● Update the existing rows in the final table and insert new rows from the
intermediate table which are not in the final table.

UPDATE final_table t
SET t.value = s.value
FROM intermed_table in

WHERE t.id = in.id;


INSERT INTO final_table (id, value)
SELECT id, value
FROM intermed_table

WHERE NOT id IN (SELECT id FROM final_table);

● Delete all rows from the final table which are present in the intermediate
table. Then insert all rows from the intermediate table to the final table.

DELETE .final_table f

WHERE f.id IN (SELECT id from intermed_table);


INSERT final_table (id, value)
SELECT id, value

FROM intermed_table;

● MERGE statement – Insert and update can be done with a single MERGE
statement which can be used to apply changes in the intermediate table to
the final table.

MERGE into final_table t1 using intermed_table t2 on t1.id = t2.id


WHEN matched then update set value = t2.vaue

WHEN not matched then INSERT (id, value) values (t2.id, t2.value);

MIGRATING YOUR QUERIES AND WORKLOADS

Data query migration

Since Snowflake uses ANSI-compliant SQL, most of your existing queries will execute on
Snowflake without requiring change. However, MySQL uses some MySQL-specific
extensions, so you need to watch out for a few constructs. Some examples include the use
of FETCH FIRST x ROWS ONLY. See Appendix C for details and suggested translations.
Another common change relates to formatting of date constants used for comparisons in
predicates. For example:
In MySQL it looks like this:

where my_date_datatype > '01-JAN-17';

Or

where to_char(my_date_datatype, 'YYYY-MM-DD') > '2017-01-01';

Or

where my_date_datatype > to_date('2017-01- 01', 'YY-MM-DD');

In Snowflake it looks like this:

where my_date_datatype > cast(‘2017-01-01’ as date)

Alternatively in Snowflake you can also use this form:

where my_date_datatype > ‘2017-01-01’::date

Migrating BI tools
Many of your queries and reports are likely to use an existing business intelligence (BI) tool.
Therefore, you’ll need to account for migrating those connections from MySQL to Snowflake.
You’ll also have to test those queries and reports to be sure you’re getting the expected
results.
This should not be difficult since Snowflake supports standard ODBC and JDBC
connectivity, which most modern BI tools use. Many of the mainstream tools have native
connectors to Snowflake. Check our website to see if your tools are part of our ecosystem.
Don’t worry if your tool of choice is not listed. You should be able to establish a connection
using either ODBC or JDBC. If you have questions about a specific tool, your Snowflake
contact will be happy to help.

You might also like