Professional Documents
Culture Documents
Errores Comunes y Mejores Prácticas para Inserciones Masivas en Bases de Datos SQL
Errores Comunes y Mejores Prácticas para Inserciones Masivas en Bases de Datos SQL
Problem:
• SQL databases are quite comfortable, easy and efficient to query.
• However, ingesting massive amounts of data becomes challenging.
• There are many common errors when ingesting data that result in poor insertion rates.
• We are going to cover all of these errors.
• With high performance distributed databases there are even more challenges in reaching the
maximum insertion rate.
Solution:
• Use a number of best practices to avoid the common pitfalls.
• We will use an iterative approach in which we will identify each error, what is wrong, and how to
fix it, till get to an optimal insertion approach.
Scenario 1
LeanXcale cluster
JDBC
Based on LendingClub Data
from 2007 to 2018 m5.2xlarge
Available on kaggle: 8vcpu
32GB
https://www.kaggle.com/wordsforth
ewise/lending-club
https://gitlab.com/leanxcale_public/highdataingestion
Ingesting Data: First Naïve Approach
Test 1: Loader Running in your Laptop
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test1_NaiveApproach.java
Ingesting Data: First Naïve Approach
Test 1: Loader Running in your Laptop
• This first naïve approach has achieved:
Tests:
Insertion Rate 1. Naïve approach
13
Insertion Rate (rows/s)
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 1 Error: Loader Not Collocated with the Destination DB
• You run the loader on your laptop that is running on a location and you load data into a cloud DB
that is far away with high latency (say 100 ms to go and 100 ms to come back).
• Why is a problem?
JDBC
Based on LendingClub Data
from 2007 to 2018 m5.2xlarge
Available on kaggle: 8vcpu
32GB
https://www.kaggle.com/wordsforth
ewise/lending-club
https://gitlab.com/leanxcale_public/highdataingestion
Ingesting Data: First Naïve Approach
Test 2: Collocating Loader with the Destination DB
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test2_Collocated.java
Ingesting Data
Test 2: Collocating Loader with the Destination DB
• By collocating the loader, we achieve:
Tests:
Insertion Rate 1. Naïve approach
25 2. Collocating Loader with DB
Insertion Rate (rows/s)
13
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 2 Error: Connection per Insert
• You are opening a connection per insert.
• Why is a problem?
Tests:
Insertion Rate 1. Naïve approach
100 2. Collocating Loader with DB
Insertion Rate (rows/s)
88 3. Permanent Connection
75
63
50
38
25
13
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 3 Error: Unprepared SQL Statements
• You just execute SQL statements.
• Why is a problem?
Tests:
Insertion Rate 1. Naïve approach
800 2. Collocating Loader with DB
Insertion Rate (rows/s)
3. Permanent Connection
600 4. Prepared Statements
400
200
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 4 Error: Inserting Rows One by One
• You insert a row at a time.
• Why is a problem?
• The cost of inserting a row has two components: one
constant and one variable.
• The constant cost is what would cost to insert an empty
tuple.
• You still have to create a message with its header,
serialize it, send it through the network, receive it,
deserialize it, and execute it, and commit it.
• All this cost is paid per each individual row.
• (cc+var)/1
Ingesting Data: Best Practices
Test 5: Batching
• By batching together many inserts in a single transaction and message, then the constant cost is
amortized across many rows, making it neglectable.
• Finally, one just pays for the variable cost per row.
• The constant cost when using batching then it is amortized among many rows.
• If the batch is 1000 rows, the constant cost will be reduced by 1000 times for each tuple, since
it is amortized among all of them.
Ingesting Data
Test 5: Batching
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test5_Batching.java
Ingesting Data
Test 5: Batching
• By batching inserts, we achieve:
Tests:
Insertion Rate 1. Naïve approach
30000 2. Collocating Loader with DB
Insertion Rate (rows/s)
3. Permanent Connection
4. Prepared Statements
20000
5. Batching
10000
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 5 Error: Single Threaded/Machine Loader
• Your loader has a single thread that inserts rows
• Why is a problem?
• The throughput of a single thread is determined by the
latency of each individual operation, say 10 ms.
• If you have a powerful database, you will never take
advantage of it, because at most the single threaded loader
can do 1/latency=1000ms/10ms=100 transactions per
second, even if your distributed database can do 100
million transactions per second.
Ingesting Data: Best Practices
Test 6: Multi-threaded Loader
• By using multiple threads, basically you can saturate your
distributed DB to reach the maximum throughput.
• But careful, maybe a single machine is not enough to
saturate the DB (use its maximum capacity), and you might
need multiple machines/nodes each with a multithreaded
loader.
• How can you know that you will need more than one
machine/node? When the CPU usage on the loader
machine reaches near 100% and the DB is still not
saturated.
• Basically, the load that you can ingest with n threads is n
times the one of a single thread, and with m
machines/nodes: m*n the one of a single thread.
Ingesting Data
Test 6: Multi-threaded Loader
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test6_Multithreader.java
Ingesting Data
Test 6: Multi-threaded Loader
• By batching inserts, we achieve:
Tests:
Insertion Rate 1. Naïve approach
140000 2. Collocating Loader with DB
Insertion Rate (rows/s)
20000
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 6 Error: Insertion Degrades with Table Size
• As the table size grows, inserting rows gets slower and slower.
• Why?
• This is due data is stored in a B+ tree and rows are inserted on the rows of the tree.
• As more data is inserted the tree gets bigger with more levels, and the cache becomes more
and more ineffective.
• Thus, each insertion costs more and more IOs to get to the leaf node where to insert the row.
Ingesting Data: Best Practices
Test 7: LeanXcale Auto-Split (bidimensional partitioning)
• LeanXcale is not only able to partition data based on the primary key, but also on other dimensions
such as time.
• Typically, historic data is the one that can grow very much, and historic data has always associated
timestamps for each row.
• So, by setting auto-split based on the timestamp column, it becomes possible to keep each table
fragment small enough to ingest data very efficiently.
• In this way, the speed of ingestion remains constant instead of getting slower and slower.
Ingesting Data
Test 7: LeanXcale Auto-Split
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test7_Autosplit.java
Ingesting Data
Test 7: LeanXcale Auto-Split
• By using auto-split, we achieve:
Tests:
Insertion Rate 1. Naïve approach
140000 2. Collocating Loader with DB
Insertion Rate (rows/s)
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 7 Error: Single DB Server Bottleneck
• At some point you do an optimal ingestion and saturates the DB server.
• Why is a problem?
• You cannot ingest more data with a single server.
Scenario 3
JDBC
6 * m5.2xlarge
6 * 8vcpu =48
vcpu
6 * 32GB =192 GB
Ingesting Data: Best Practices
Test 8: Horizontal Scalability
• By using multiple instances of the DB server you can scale horizontally and multiply by n times the
throughput of insertion.
Ingesting Data
Test 8: Horizontal Scalability
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test8_HorizontalScalability.java
Ingesting Data
Test 8: Horizontal Scalability
• By batching inserts, we achieve:
Tests:
Insertion Rate 1. Naïve approach
600000 2. Collocating loader with DB
Insertion Rate (rows/s)
500000
3. Permanent connection
4. Prepared statements
400000
5. Batching
300000 6. Multi-threaded Loader
200000 7. Auto-split
100000
8. Horizontal Scalability
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 8 Error: Cost of Queries with Distribution by Hashing
• Hashing is quite comfortable to distribute the load
• Why is a problem?
• Because it has a strong tradeoff for queries that become slower because they have to be sent
to all servers.
Ingesting Data: Best Practices
Test 9: Primary Key Distribution
• By using the primary key to distribute the data then queries by the primary key will be efficient
because they will target a single service.
Ingesting Data
Test 9: Primary Key Distribution
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test9_PrimaryKeyDistribution.java
Ingesting Data
Test 9: Primary Key Distribution
• By using the PK to distribute the key, we achieve:
Tests:
Insertion Rate 1. Naïve approach
600000 2. Collocating loader with DB
Insertion Rate (rows/s)
500000
3. Permanent connection
4. Prepared statements
400000
5. Batching
300000 6. Multi-threaded Loader
200000 7. Auto-split
100000
8. Horizontal Scalability
9. Primary Key Distribution
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 9 Error: Cost of JDBC
• JDBC is standard.
• Why is a problem?
• JDBC is not optimal for ingesting data.
Scenario 4
KiVi Direct
Interface
6 * m5.2xlarge
6 * 8vcpu =48
vcpu
6 * 32GB =192 GB
Ingesting Data: Best Practices
Test 10: Native KiVi Interface
• By using the native KiVi Interface the overhead of JDBC can be overcome and make more efficient
ingestion.
Ingesting Data
Test 10: Native KiVi Interface
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test10_NativeKiviApi.java
Ingesting Data
Test 10: Native KiVi interface
• By batching inserts, we achieve:
Tests:
Insertion Rate 1. Naïve approach
2500000 2. Collocating loader with DB
Insertion Rate (rows/s)
3. Permanent connection
2000000
4. Prepared statements
1500000 5. Batching
6. Multi-threaded Loader
1000000
7. Auto-split
500000 8. Horizontal Scalability
9. Primary Key Distribution
0
0 1 2 3 4 5 6 7 8 9 10 10. KiVi API
Test Number
What have we learnt today?
To sum up, SQL databases’ performance depends on the quality of
the application’s code. It is critical when dealing with high-
performance distributed databases, such as LeanXcale. Since these
databases are more capable, every detail matters to get the best result.
The proposed steps were:
www.LeanXcale.com
@LeanXcale