Errores Comunes y Mejores Prácticas para Inserciones Masivas en Bases de Datos SQL

Motivation
Problem:
• SQL databases are quite comfortable, easy and efficient to query.
• However, ingesting massive amounts of data becomes challenging.
• There are many common errors when ingesting data that result in poor insertion rates.
• We are going to cover all of these errors.
• With high performance distributed databases there are even more challenges in reaching the
maximum insertion rate.
Solution:
• Use a number of best practices to avoid the common pitfalls.
• We will use an iterative approach in which we will identify each error, what is wrong, and how to
fix it, till get to an optimal insertion approach.
Scenario 1
LeanXcale cluster
JDBC
Based on LendingClub Data
from 2007 to 2018 m5.2xlarge
Available on kaggle: 8vcpu
32GB
https://www.kaggle.com/wordsforth
ewise/lending-club
https://gitlab.com/leanxcale_public/highdataingestion
Ingesting Data: First Naïve Approach
Test 1: Loader Running in your Laptop
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test1_NaiveApproach.java
Test 1: Loader Running in your Laptop
• This first naïve approach has achieved:
Tests:
Insertion Rate 1. Naïve approach
13
Insertion Rate (rows/s)
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 1 Error: Loader Not Collocated with the Destination DB
• You run the loader on your laptop that is running on a location and you load data into a cloud DB
that is far away with high latency (say 100 ms to go and 100 ms to come back).
• Why is a problem?
• Ingesting a row will take 100 ms to reach the cloud, the

processing there will be very fast, say 0.1 ms.
• Then, the reply will take another 100 ms to reach your
laptop.
• How long took to insert a single rows?
• 200,1 ms.
• How many rows can you insert per second?
• 1000ms/200ms= 5 rows per second
• even if the DB takes 0 ms to insert the row!!!
Ingesting Data: Best Practices
Test 2: Collocating Loader with the Destination DB
• You run the loader on the same cloud as your cloud DB.
• Latency will be quite low, say, 0,1 ms.
• Ingesting a row now will take 0.1 ms to reach the DB.

• The processing there will be, say 0.1 ms.
• Then, the reply will take another 0.1 ms to reach your
loader.
• How long took to insert a single row?
• 0.3 ms.
• How many rows can you insert per second?
• 1000ms/0.3ms= 3,333 rows per second!!!
Scenario 2
LeanXcale cluster
Client Server with Tests
JDBC
Based on LendingClub Data
from 2007 to 2018 m5.2xlarge
Available on kaggle: 8vcpu
32GB
https://www.kaggle.com/wordsforth
ewise/lending-club
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test2_Collocated.java
Ingesting Data
• By collocating the loader, we achieve:
Tests:
25 2. Collocating Loader with DB
13
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Test 2 Error: Connection per Insert
• You are opening a connection per insert.
• Establishing a connection is expensive.

• You are paying that high cost per insert.
Ingesting Data
Test 3: Permanent Connection
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test3_PermanentConnection.java
Ingesting Data
Test 3: Permanent Connection
• By collocating the loader, we achieve:
Tests:
88 3. Permanent Connection
75
63
50
38
25
13
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Test 3 Error: Unprepared SQL Statements
• You just execute SQL statements.
• When you execute an SQL statement, in this case, insert,

SQL
Compilation at the server side it will have some processing:
• The string with the SQL will be compiled.
• Then the compiled SQL (typically an abstract syntax tree)
SQL goes through a series of iterations to transform it into a
Optimization
query plan (a tree of algebraic query operators) and then
optimized (gets transformations that provide the same
result but are more efficient to compute).
• All these processes are extremely expensive compared to
the cost of an SQL statement.
Test 4: Prepared SQL Statements
• First you prepare the SQL statement.
• Then, you execute it.
• When you prepare the SQL statement, it gets compiled and

SQL
Compilation optimized as before, but it gets assigned an identifier that is
returned to the JDBC driver.
• When you execute a prepared statement, then you simply
SQL send the prepared statement id that on the server side will
Optimization
recover the saved optimized query plan and simply execute
it, saving the expensive process of SQL compilation and
query optimization.
Ingesting Data
Test 4: Prepared Statements
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test4_PreparedStatements.java
Ingesting Data
Test 4: Prepared statements
• By preparing the SQL statements, we achieve:
Tests:
3. Permanent Connection
600 4. Prepared Statements
400
200
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Test 4 Error: Inserting Rows One by One
• You insert a row at a time.
• The cost of inserting a row has two components: one
constant and one variable.
• The constant cost is what would cost to insert an empty
tuple.
• You still have to create a message with its header,
serialize it, send it through the network, receive it,
deserialize it, and execute it, and commit it.
• All this cost is paid per each individual row.
• (cc+var)/1
Test 5: Batching
• By batching together many inserts in a single transaction and message, then the constant cost is
amortized across many rows, making it neglectable.
• Finally, one just pays for the variable cost per row.
• The constant cost when using batching then it is amortized among many rows.
• If the batch is 1000 rows, the constant cost will be reduced by 1000 times for each tuple, since
it is amortized among all of them.
Ingesting Data
Test 5: Batching
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test5_Batching.java
Ingesting Data
Test 5: Batching
• By batching inserts, we achieve:
Tests:
3. Permanent Connection
4. Prepared Statements
20000
5. Batching
10000
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Test 5 Error: Single Threaded/Machine Loader
• Your loader has a single thread that inserts rows
• The throughput of a single thread is determined by the
latency of each individual operation, say 10 ms.
• If you have a powerful database, you will never take
advantage of it, because at most the single threaded loader
can do 1/latency=1000ms/10ms=100 transactions per
second, even if your distributed database can do 100
million transactions per second.
Test 6: Multi-threaded Loader
• By using multiple threads, basically you can saturate your
distributed DB to reach the maximum throughput.
• But careful, maybe a single machine is not enough to
saturate the DB (use its maximum capacity), and you might
need multiple machines/nodes each with a multithreaded
loader.
• How can you know that you will need more than one
machine/node? When the CPU usage on the loader
machine reaches near 100% and the DB is still not
saturated.
• Basically, the load that you can ingest with n threads is n
times the one of a single thread, and with m
machines/nodes: m*n the one of a single thread.
Ingesting Data
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test6_Multithreader.java
Ingesting Data
Tests:

80000 5. Batching
60000
6. Multi-threaded Loader
40000
20000
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Test 6 Error: Insertion Degrades with Table Size
• As the table size grows, inserting rows gets slower and slower.
• Why?
• This is due data is stored in a B+ tree and rows are inserted on the rows of the tree.
• As more data is inserted the tree gets bigger with more levels, and the cache becomes more
and more ineffective.
• Thus, each insertion costs more and more IOs to get to the leaf node where to insert the row.
Test 7: LeanXcale Auto-Split (bidimensional partitioning)
• LeanXcale is not only able to partition data based on the primary key, but also on other dimensions
such as time.
• Typically, historic data is the one that can grow very much, and historic data has always associated
timestamps for each row.
• So, by setting auto-split based on the timestamp column, it becomes possible to keep each table
fragment small enough to ingest data very efficiently.
• In this way, the speed of ingestion remains constant instead of getting slower and slower.
Ingesting Data
Test 7: LeanXcale Auto-Split
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test7_Autosplit.java
Ingesting Data
Test 7: LeanXcale Auto-Split
• By using auto-split, we achieve:
Tests:

80000 5. Batching
60000
40000
7. Auto-split
20000
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Test 7 Error: Single DB Server Bottleneck
• At some point you do an optimal ingestion and saturates the DB server.
• You cannot ingest more data with a single server.
Scenario 3
Client Server with Tests LeanXcale cluster
JDBC
6 * m5.2xlarge
6 * 8vcpu =48
vcpu
6 * 32GB =192 GB
Test 8: Horizontal Scalability
• By using multiple instances of the DB server you can scale horizontally and multiply by n times the
throughput of insertion.
Ingesting Data
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test8_HorizontalScalability.java
Ingesting Data
Tests:
600000 2. Collocating loader with DB
500000
3. Permanent connection
4. Prepared statements
400000
5. Batching
300000 6. Multi-threaded Loader
200000 7. Auto-split
100000
8. Horizontal Scalability
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Test 8 Error: Cost of Queries with Distribution by Hashing
• Hashing is quite comfortable to distribute the load
• Because it has a strong tradeoff for queries that become slower because they have to be sent
to all servers.
Test 9: Primary Key Distribution
• By using the primary key to distribute the data then queries by the primary key will be efficient
because they will target a single service.
Ingesting Data
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test9_PrimaryKeyDistribution.java
Ingesting Data
• By using the PK to distribute the key, we achieve:
Tests:
500000
400000
5. Batching
300000 6. Multi-threaded Loader
200000 7. Auto-split
100000
8. Horizontal Scalability
9. Primary Key Distribution
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Test 9 Error: Cost of JDBC
• JDBC is standard.
• JDBC is not optimal for ingesting data.
Scenario 4
Client Server with Tests LeanXcale cluster
KiVi Direct
Interface
6 * m5.2xlarge
6 * 8vcpu =48
vcpu
6 * 32GB =192 GB
Test 10: Native KiVi Interface
• By using the native KiVi Interface the overhead of JDBC can be overcome and make more efficient
ingestion.
Ingesting Data
Test 10: Native KiVi Interface
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test10_NativeKiviApi.java
Ingesting Data
Test 10: Native KiVi interface
Tests:
2000000
1500000 5. Batching
1000000
7. Auto-split
500000 8. Horizontal Scalability
9. Primary Key Distribution
0
0 1 2 3 4 5 6 7 8 9 10 10. KiVi API
Test Number
What have we learnt today?
To sum up, SQL databases’ performance depends on the quality of
the application’s code. It is critical when dealing with high-
performance distributed databases, such as LeanXcale. Since these
databases are more capable, every detail matters to get the best result.
The proposed steps were:
•Test 1: Loader Running in your Laptop

•Test 2: Collocating Loader with the Destination DB
•Test 3: Permanent Connection
•Test 4: Prepared SQL Statements
•Test 5: Batching
•Test 6: Multi-threaded Loader
•Test 7: LeanXcale Auto-Split (bidimensional partitioning)
•Test 8: Horizontal Scalability
•Test 9: Primary Key Distribution
•Test 10: Native KiVi Interface
Links to the course materials
Gitlab containing this code:

You can request a trial to test this code at https://www.leanxcale.com/trial

This trial LeanXcale server will run at Virginia AWS DataCenter.
Want to join the LeanXcale family?
We are looking for these profiles:
• Java Back-End Engineer

• React and PHP Front-End Engineer
• Senior C Software Engineer
• Java Delivery Engineer
For more information, access all open positions at

https://www.leanxcale.com/careers-list
info@leanxcale.com
www.LeanXcale.com
@LeanXcale

Errores Comunes y Mejores Prácticas para Inserciones Masivas en Bases de Datos SQL

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Errores Comunes y Mejores Prácticas para Inserciones Masivas en Bases de Datos SQL

Uploaded by

Copyright:

Available Formats

Motivation

• Ingesting a row will take 100 ms to reach the cloud, the

• Ingesting a row now will take 0.1 ms to reach the DB.

• Establishing a connection is expensive.

• When you execute an SQL statement, in this case, insert,

• When you prepare the SQL statement, it gets compiled and

120000 3. Permanent Connection

120000 3. Permanent Connection

Client Server with Tests LeanXcale cluster

Client Server with Tests LeanXcale cluster

•Test 1: Loader Running in your Laptop

Gitlab containing this code:

You can request a trial to test this code at https://www.leanxcale.com/trial

We are looking for these profiles:

• Java Back-End Engineer

For more information, access all open positions at

You might also like