Professional Documents
Culture Documents
DuckDB Benchmarking
DuckDB Benchmarking
“If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating.”
What are the aspects that make databases slow and frustrating?
When conducting data analysis and moving large volumes of data into a database from an application, or extracting it from a database into
an analysis environment like R or Python, the process can be painfully slow.
Analytical Performance DuckDB is optimized for analytical Traditional DBMSs, like MySQL
workloads and offers exceptional and PostgreSQL, are designed for
query performance, especially for a broader range of use cases and
complex analytical queries. may not offer the same level of
performance for analytical
workloads.
Vectorized Query Processing DuckDB employs vectorized query Many traditional DBMS like Mysql
processing, which allows for and PostgreSQL use row-based
parallel execution of operations on processing,When performing
data, making it highly efficient for analytical queries, such as
analytical workloads. calculating the average age of all
customers, row-based processing
requires scanning the entire table,
row by row, which can be
inefficient.
SQLStatement
QueryNode
TableRef
ParsedExpression
The ParsedExpression represents an expression within a SQL statement. This can be e.g. a reference to a column, an addition operator
or a constant value.
2. TableRef
The TableRef represents any table source. This can be a reference to a base table, but it can also be a join, a table-producing function or a
subquery.
3. QueryNode
The QueryNode represents either (1) a SELECT statement, or (2) a set operation (i.e. UNION , INTERSECT or DIFFERENCE ).
4.SQL Statement
Logical Planner
Optimizer:
After the logical planner has created the logical query tree, the optimizers are run over that query tree to create an optimized query plan.
1)Expression Rewriter:
This optimizer simplifies expressions and performs constant folding to reduce the complexity of the query.
Example: Suppose you have a query with an expression like SELECT 2 + 3 * 4 , the expression rewriter can simplify it to SELECT 14 .
2)Filter Pushdown:
Filter pushdown optimizer moves filters as close to the data source as possible and prunes unnecessary branches early in the query plan.
This optimizer reorders join operations to minimize the overall cost of the query plan.
Example: Imagine a query that joins three tables: Customers, Orders, and Products. The join order optimizer may choose to start with the
smallest table (Products) and join it with Orders first before joining with Customers, reducing the intermediate result size.
Common subexpressions optimizer identifies and extracts repeated subexpressions in the query plan to avoid redundant computations.
Example: If a query involves multiple calculations of the same value, like SELECT (x + y) * (x + y) , the optimizer can recognize that (x
+ y) is computed twice and store it in a temporary variable to avoid redundant calculations.
This optimizer rewrites large static IN clauses to more efficient join operations.
SELECT * FROM Products p INNER JOIN (VALUES (1), (2), (3), ..., (1000)) AS vals(id) ON p.ProductID = vals.id .
The physical plan generator converts the resulting logical operator tree into a PhysicalOperator tree.
Execution
In the execution phase, the physical operators are executed to produce the query result. The execution model is a vectorized volcano
model, where DataChunks are pulled from the root node of the physical operator tree. Each PhysicalOperator itself defines how it grants its
result. A PhysicalTableScan node will pull the chunk from the base tables on disk, whereas a PhysicalHashJoin will perform a hash join
between the output obtained from its child nodes.
Benchmarking:
2. Generate the queries for the data and sql schema from tpcds data generator.
3. Load the generated data into DuckDB Database with sql schema.
4. Run a couple of tpcds queries on the database loaded.
5. Using a script automate the query running process and benchmark DuckDB.
Procedure:
2.Generated the queries for the data and sql schema from tpcds data generator.
3.Have loaded the generated data into DuckDB with sql schema using a python script.
5.Automated the query running process using a python script and loaded the benchmarks into a csv file.
Results:
Execution_Results_Duckdb_TPCDS_1GB.xlsx Execution_results_DuckDB_TPCDS_10GB.xlsx