Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Motivation

Problem:
• SQL databases are quite comfortable, easy and efficient to query.
• However, ingesting massive amounts of data becomes challenging.
• There are many common errors when ingesting data that result in poor insertion rates.
• We are going to cover all of these errors.
• With high performance distributed databases there are even more challenges in reaching the
maximum insertion rate.

Solution:
• Use a number of best practices to avoid the common pitfalls.
• We will use an iterative approach in which we will identify each error, what is wrong, and how to
fix it, till get to an optimal insertion approach.
Scenario 1
LeanXcale cluster

JDBC
Based on LendingClub Data
from 2007 to 2018 m5.2xlarge
Available on kaggle: 8vcpu
32GB
https://www.kaggle.com/wordsforth
ewise/lending-club

https://gitlab.com/leanxcale_public/highdataingestion
Ingesting Data: First Naïve Approach
Test 1: Loader Running in your Laptop
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test1_NaiveApproach.java
Ingesting Data: First Naïve Approach
Test 1: Loader Running in your Laptop
• This first naïve approach has achieved:

Tests:
Insertion Rate 1. Naïve approach
13
Insertion Rate (rows/s)

0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 1 Error: Loader Not Collocated with the Destination DB
• You run the loader on your laptop that is running on a location and you load data into a cloud DB
that is far away with high latency (say 100 ms to go and 100 ms to come back).
• Why is a problem?

• Ingesting a row will take 100 ms to reach the cloud, the


processing there will be very fast, say 0.1 ms.
• Then, the reply will take another 100 ms to reach your
laptop.
• How long took to insert a single rows?
• 200,1 ms.
• How many rows can you insert per second?
• 1000ms/200ms= 5 rows per second
• even if the DB takes 0 ms to insert the row!!!
Ingesting Data: Best Practices
Test 2: Collocating Loader with the Destination DB
• You run the loader on the same cloud as your cloud DB.
• Latency will be quite low, say, 0,1 ms.

• Ingesting a row now will take 0.1 ms to reach the DB.


• The processing there will be, say 0.1 ms.
• Then, the reply will take another 0.1 ms to reach your
loader.
• How long took to insert a single row?
• 0.3 ms.
• How many rows can you insert per second?
• 1000ms/0.3ms= 3,333 rows per second!!!
Scenario 2
LeanXcale cluster
Client Server with Tests

JDBC
Based on LendingClub Data
from 2007 to 2018 m5.2xlarge
Available on kaggle: 8vcpu
32GB
https://www.kaggle.com/wordsforth
ewise/lending-club

https://gitlab.com/leanxcale_public/highdataingestion
Ingesting Data: First Naïve Approach
Test 2: Collocating Loader with the Destination DB
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test2_Collocated.java
Ingesting Data
Test 2: Collocating Loader with the Destination DB
• By collocating the loader, we achieve:

Tests:
Insertion Rate 1. Naïve approach
25 2. Collocating Loader with DB
Insertion Rate (rows/s)

13

0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 2 Error: Connection per Insert
• You are opening a connection per insert.
• Why is a problem?

• Establishing a connection is expensive.


• You are paying that high cost per insert.
Ingesting Data
Test 3: Permanent Connection
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test3_PermanentConnection.java
Ingesting Data
Test 3: Permanent Connection
• By collocating the loader, we achieve:

Tests:
Insertion Rate 1. Naïve approach
100 2. Collocating Loader with DB
Insertion Rate (rows/s)

88 3. Permanent Connection
75
63
50
38
25
13
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 3 Error: Unprepared SQL Statements
• You just execute SQL statements.
• Why is a problem?

• When you execute an SQL statement, in this case, insert,


SQL
Compilation at the server side it will have some processing:
• The string with the SQL will be compiled.
• Then the compiled SQL (typically an abstract syntax tree)
SQL goes through a series of iterations to transform it into a
Optimization
query plan (a tree of algebraic query operators) and then
optimized (gets transformations that provide the same
result but are more efficient to compute).
• All these processes are extremely expensive compared to
the cost of an SQL statement.
Ingesting Data: Best Practices
Test 4: Prepared SQL Statements
• First you prepare the SQL statement.
• Then, you execute it.

• When you prepare the SQL statement, it gets compiled and


SQL
Compilation optimized as before, but it gets assigned an identifier that is
returned to the JDBC driver.
• When you execute a prepared statement, then you simply
SQL send the prepared statement id that on the server side will
Optimization
recover the saved optimized query plan and simply execute
it, saving the expensive process of SQL compilation and
query optimization.
Ingesting Data
Test 4: Prepared Statements
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test4_PreparedStatements.java
Ingesting Data
Test 4: Prepared statements
• By preparing the SQL statements, we achieve:

Tests:
Insertion Rate 1. Naïve approach
800 2. Collocating Loader with DB
Insertion Rate (rows/s)

3. Permanent Connection
600 4. Prepared Statements

400

200

0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 4 Error: Inserting Rows One by One
• You insert a row at a time.
• Why is a problem?
• The cost of inserting a row has two components: one
constant and one variable.
• The constant cost is what would cost to insert an empty
tuple.
• You still have to create a message with its header,
serialize it, send it through the network, receive it,
deserialize it, and execute it, and commit it.
• All this cost is paid per each individual row.
• (cc+var)/1
Ingesting Data: Best Practices
Test 5: Batching
• By batching together many inserts in a single transaction and message, then the constant cost is
amortized across many rows, making it neglectable.
• Finally, one just pays for the variable cost per row.

• The constant cost when using batching then it is amortized among many rows.
• If the batch is 1000 rows, the constant cost will be reduced by 1000 times for each tuple, since
it is amortized among all of them.
Ingesting Data
Test 5: Batching
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test5_Batching.java
Ingesting Data
Test 5: Batching
• By batching inserts, we achieve:

Tests:
Insertion Rate 1. Naïve approach
30000 2. Collocating Loader with DB
Insertion Rate (rows/s)

3. Permanent Connection
4. Prepared Statements
20000
5. Batching

10000

0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 5 Error: Single Threaded/Machine Loader
• Your loader has a single thread that inserts rows
• Why is a problem?
• The throughput of a single thread is determined by the
latency of each individual operation, say 10 ms.
• If you have a powerful database, you will never take
advantage of it, because at most the single threaded loader
can do 1/latency=1000ms/10ms=100 transactions per
second, even if your distributed database can do 100
million transactions per second.
Ingesting Data: Best Practices
Test 6: Multi-threaded Loader
• By using multiple threads, basically you can saturate your
distributed DB to reach the maximum throughput.
• But careful, maybe a single machine is not enough to
saturate the DB (use its maximum capacity), and you might
need multiple machines/nodes each with a multithreaded
loader.
• How can you know that you will need more than one
machine/node? When the CPU usage on the loader
machine reaches near 100% and the DB is still not
saturated.
• Basically, the load that you can ingest with n threads is n
times the one of a single thread, and with m
machines/nodes: m*n the one of a single thread.
Ingesting Data
Test 6: Multi-threaded Loader
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test6_Multithreader.java
Ingesting Data
Test 6: Multi-threaded Loader
• By batching inserts, we achieve:

Tests:
Insertion Rate 1. Naïve approach
140000 2. Collocating Loader with DB
Insertion Rate (rows/s)

120000 3. Permanent Connection


100000 4. Prepared Statements
80000 5. Batching
60000
6. Multi-threaded Loader
40000

20000

0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 6 Error: Insertion Degrades with Table Size
• As the table size grows, inserting rows gets slower and slower.
• Why?
• This is due data is stored in a B+ tree and rows are inserted on the rows of the tree.
• As more data is inserted the tree gets bigger with more levels, and the cache becomes more
and more ineffective.
• Thus, each insertion costs more and more IOs to get to the leaf node where to insert the row.
Ingesting Data: Best Practices
Test 7: LeanXcale Auto-Split (bidimensional partitioning)
• LeanXcale is not only able to partition data based on the primary key, but also on other dimensions
such as time.
• Typically, historic data is the one that can grow very much, and historic data has always associated
timestamps for each row.
• So, by setting auto-split based on the timestamp column, it becomes possible to keep each table
fragment small enough to ingest data very efficiently.
• In this way, the speed of ingestion remains constant instead of getting slower and slower.
Ingesting Data
Test 7: LeanXcale Auto-Split
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test7_Autosplit.java
Ingesting Data
Test 7: LeanXcale Auto-Split
• By using auto-split, we achieve:

Tests:
Insertion Rate 1. Naïve approach
140000 2. Collocating Loader with DB
Insertion Rate (rows/s)

120000 3. Permanent Connection


100000 4. Prepared Statements
80000 5. Batching
60000
6. Multi-threaded Loader
40000
7. Auto-split
20000

0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 7 Error: Single DB Server Bottleneck
• At some point you do an optimal ingestion and saturates the DB server.
• Why is a problem?
• You cannot ingest more data with a single server.
Scenario 3

Client Server with Tests LeanXcale cluster

JDBC

6 * m5.2xlarge
6 * 8vcpu =48
vcpu
6 * 32GB =192 GB
Ingesting Data: Best Practices
Test 8: Horizontal Scalability
• By using multiple instances of the DB server you can scale horizontally and multiply by n times the
throughput of insertion.
Ingesting Data
Test 8: Horizontal Scalability
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test8_HorizontalScalability.java
Ingesting Data
Test 8: Horizontal Scalability
• By batching inserts, we achieve:

Tests:
Insertion Rate 1. Naïve approach
600000 2. Collocating loader with DB
Insertion Rate (rows/s)

500000
3. Permanent connection
4. Prepared statements
400000
5. Batching
300000 6. Multi-threaded Loader
200000 7. Auto-split
100000
8. Horizontal Scalability
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 8 Error: Cost of Queries with Distribution by Hashing
• Hashing is quite comfortable to distribute the load
• Why is a problem?

• Because it has a strong tradeoff for queries that become slower because they have to be sent
to all servers.
Ingesting Data: Best Practices
Test 9: Primary Key Distribution
• By using the primary key to distribute the data then queries by the primary key will be efficient
because they will target a single service.
Ingesting Data
Test 9: Primary Key Distribution
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test9_PrimaryKeyDistribution.java
Ingesting Data
Test 9: Primary Key Distribution
• By using the PK to distribute the key, we achieve:

Tests:
Insertion Rate 1. Naïve approach
600000 2. Collocating loader with DB
Insertion Rate (rows/s)

500000
3. Permanent connection
4. Prepared statements
400000
5. Batching
300000 6. Multi-threaded Loader
200000 7. Auto-split
100000
8. Horizontal Scalability
9. Primary Key Distribution
0
0 1 2 3 4 5 6 7 8 9 10
Test Number
Ingesting Data: Common Errors
Test 9 Error: Cost of JDBC
• JDBC is standard.
• Why is a problem?
• JDBC is not optimal for ingesting data.
Scenario 4

Client Server with Tests LeanXcale cluster

KiVi Direct
Interface

6 * m5.2xlarge
6 * 8vcpu =48
vcpu
6 * 32GB =192 GB
Ingesting Data: Best Practices
Test 10: Native KiVi Interface
• By using the native KiVi Interface the overhead of JDBC can be overcome and make more efficient
ingestion.
Ingesting Data
Test 10: Native KiVi Interface
• You can find the code of this step in: https://gitlab.com/leanxcale_public/highdataingestion/-
/blob/master/src/test/java/com/leanxcale/example/jdbcLowLevel/Test10_NativeKiviApi.java
Ingesting Data
Test 10: Native KiVi interface
• By batching inserts, we achieve:

Tests:
Insertion Rate 1. Naïve approach
2500000 2. Collocating loader with DB
Insertion Rate (rows/s)

3. Permanent connection
2000000
4. Prepared statements
1500000 5. Batching
6. Multi-threaded Loader
1000000
7. Auto-split
500000 8. Horizontal Scalability
9. Primary Key Distribution
0
0 1 2 3 4 5 6 7 8 9 10 10. KiVi API
Test Number
What have we learnt today?
To sum up, SQL databases’ performance depends on the quality of
the application’s code. It is critical when dealing with high-
performance distributed databases, such as LeanXcale. Since these
databases are more capable, every detail matters to get the best result.
The proposed steps were:

•Test 1: Loader Running in your Laptop


•Test 2: Collocating Loader with the Destination DB
•Test 3: Permanent Connection
•Test 4: Prepared SQL Statements
•Test 5: Batching
•Test 6: Multi-threaded Loader
•Test 7: LeanXcale Auto-Split (bidimensional partitioning)
•Test 8: Horizontal Scalability
•Test 9: Primary Key Distribution
•Test 10: Native KiVi Interface
Links to the course materials

Gitlab containing this code:


https://gitlab.com/leanxcale_public/highdataingestion

You can request a trial to test this code at https://www.leanxcale.com/trial


This trial LeanXcale server will run at Virginia AWS DataCenter.
Want to join the LeanXcale family?

We are looking for these profiles:

• Java Back-End Engineer


• React and PHP Front-End Engineer
• Senior C Software Engineer
• Java Delivery Engineer

For more information, access all open positions at


https://www.leanxcale.com/careers-list
info@leanxcale.com

www.LeanXcale.com
@LeanXcale

You might also like