Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

DuckDB and It's Benchmarks.

DuckDB is an open-source OLAP database designed for analytical data management


In an in-process database, the engine resides within the application, enabling data transfer within the same memory address space. This
eliminates the need to copy large amounts of data over sockets, resulting in improved performance.
DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a
“vector”) are processed in one operation

“If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating.”

What are the aspects that make databases slow and frustrating?
When conducting data analysis and moving large volumes of data into a database from an application, or extracting it from a database into
an analysis environment like R or Python, the process can be painfully slow.

Comparison DuckDB Traditional DBMSs

Columnar Storage DuckDB uses a columnar storage Traditional DBMSs like


format that reduces I/O and Postgresql,msql often use row-
improves compression, enhancing based storage, which can be less
analytical query performance. efficient for analytical queries that
involve aggregations and scans.

In-Memory Processing DuckDB primarily operates in Traditional DBMSs like


memory, minimizing disk I/O and PostgreSQL,use a combination of
improving query performance. in-memory and on-disk storage
but may frequently need to access
data from disk, causing longer
query execution times.

Analytical Performance DuckDB is optimized for analytical Traditional DBMSs, like MySQL
workloads and offers exceptional and PostgreSQL, are designed for
query performance, especially for a broader range of use cases and
complex analytical queries. may not offer the same level of
performance for analytical
workloads.

Vectorized Query Processing DuckDB employs vectorized query Many traditional DBMS like Mysql
processing, which allows for and PostgreSQL use row-based
parallel execution of operations on processing,When performing
data, making it highly efficient for analytical queries, such as
analytical workloads. calculating the average age of all
customers, row-based processing
requires scanning the entire table,
row by row, which can be
inefficient.

Embedded Database DuckDB is often used as an Traditional DBMS like MySQL


embedded database, which rypically runs as a separate server
means it can be included within process that communicates with
applications and doesn't require a the application over a network or
separate server or process to run. through local connections. This
This makes it suitable for client-server architecture often
applications where lightweight, demands more resources and
embedded databases are needed. setup, making it less suitable for
lightweight, embedded use cases
where the database is tightly
coupled with the application.
Overview of DuckDB Internals
Parser
The parser converts a query string into the following tokens:

SQLStatement

QueryNode

TableRef

ParsedExpression

It only transforms a query string into a set of tokens as specified.


1. ParsedExpression

The ParsedExpression represents an expression within a SQL statement. This can be e.g. a reference to a column, an addition operator
or a constant value.

ParsedExpressions do not have types.

2. TableRef

The TableRef represents any table source. This can be a reference to a base table, but it can also be a join, a table-producing function or a
subquery.

3. QueryNode

The QueryNode represents either (1) a SELECT statement, or (2) a set operation (i.e. UNION , INTERSECT or DIFFERENCE ).

4.SQL Statement

Logical Planner

The logical planner creates [LogicalOperator]


(<https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/planner/logical_operator.hpp>) nodes from the bound
statements. In this phase, the actual logical query tree is created.

Optimizer:
After the logical planner has created the logical query tree, the optimizers are run over that query tree to create an optimized query plan.

The following query optimizers are run:

1)Expression Rewriter:

This optimizer simplifies expressions and performs constant folding to reduce the complexity of the query.

Example: Suppose you have a query with an expression like SELECT 2 + 3 * 4 , the expression rewriter can simplify it to SELECT 14 .

2)Filter Pushdown:

Filter pushdown optimizer moves filters as close to the data source as possible and prunes unnecessary branches early in the query plan.

3)Join Order Optimizer:

This optimizer reorders join operations to minimize the overall cost of the query plan.

Example: Imagine a query that joins three tables: Customers, Orders, and Products. The join order optimizer may choose to start with the
smallest table (Products) and join it with Orders first before joining with Customers, reducing the intermediate result size.

4)Common Sub Expressions:

Common subexpressions optimizer identifies and extracts repeated subexpressions in the query plan to avoid redundant computations.

Example: If a query involves multiple calculations of the same value, like SELECT (x + y) * (x + y) , the optimizer can recognize that (x
+ y) is computed twice and store it in a temporary variable to avoid redundant calculations.

5)In Clause Rewriter:

This optimizer rewrites large static IN clauses to more efficient join operations.

Example: If you have a query with a large IN clause like

SELECT * FROM Products WHERE ProductID IN (1, 2, 3, ..., 1000) ,

the optimizer may rewrite it as a join operation, making it more efficient:

SELECT * FROM Products p INNER JOIN (VALUES (1), (2), (3), ..., (1000)) AS vals(id) ON p.ProductID = vals.id .

Physical Plan Generator

The physical plan generator converts the resulting logical operator tree into a PhysicalOperator tree.

Execution

In the execution phase, the physical operators are executed to produce the query result. The execution model is a vectorized volcano
model, where DataChunks are pulled from the root node of the physical operator tree. Each PhysicalOperator itself defines how it grants its
result. A PhysicalTableScan node will pull the chunk from the base tables on disk, whereas a PhysicalHashJoin will perform a hash join
between the output obtained from its child nodes.

Benchmarking:

1. Using the tpcds data generator to generate 1gb,10gb of data.

2. Generate the queries for the data and sql schema from tpcds data generator.
3. Load the generated data into DuckDB Database with sql schema.
4. Run a couple of tpcds queries on the database loaded.
5. Using a script automate the query running process and benchmark DuckDB.

Procedure:

1.Generated tpcds data using dsdgen application.

./dsdgen -SCALE 1 -DIR /home/bharat/Documents/tpcds_1gb


./dsdgen -SCALE 10 -DIR /home/bharat/Documents/tpcds_10gb

2.Generated the queries for the data and sql schema from tpcds data generator.

3.Have loaded the generated data into DuckDB with sql schema using a python script.

4.Ran a couple of tpcds queries on database loaded.

5.Automated the query running process using a python script and loaded the benchmarks into a csv file.

Results:

Results for DuckDB Benchmarks on TPCDS 1GB , 10GB.


Time taken by executor for 1GB dataset- 2.766 sec
Time taken by executor for 10GB dataset- 21.182 sec

Execution_Results_Duckdb_TPCDS_1GB.xlsx Execution_results_DuckDB_TPCDS_10GB.xlsx

You might also like