Professional Documents
Culture Documents
CrateDB Vs PostgreSQL - Benchmark
CrateDB Vs PostgreSQL - Benchmark
CrateDB Vs PostgreSQL - Benchmark
PostgreSQL
Benchmark
Overview
CrateDB is a distributed SQL database that specializes in Industrial IoT cases, i.e. applications
involving huge data volumes and multiple data types coming in from different sources. It provides
real-time responses while ingesting data from up to thousands of different sensors, allowing to
visualize, monitor, and control complex operations in real-time.
Built on top of a NoSQL foundation, CrateDB makes the query and analysis of machine data
easily accessible. To show how its performance compares to PostgreSQL, a benchmark
comparing both databases was carried out, simulating an application collecting time-series data
from multiple customer sites and storing them in a multi-tenant database. The data set
contained 314,496,000 records.
The results are summarized in the following figure:
CrateDB query response time was up to 22x faster than PostgreSQL—and CrateDB ran in
hardware 30% cheaper.
How these results were obtained is presented in detail in the following pages, including
information about the benchmark use case and the schema and cluster setup. An appendix is
included, discussing how to partition data in CrateDB and PostgreSQL.
This benchmark simulates a common scenario when dealing with large amounts of numerical
sensor data: aggregations. Aggregations, like averaging and totaling, are everyday queries, useful
in real-time analysis of sensor data.
Schema setup
To hold the data, tables were defined in both CrateDB and PostgreSQL.
CrateDB schema
CREATE ANALYZER "tree" (
TOKENIZER tree WITH (
type = 'path_hierarchy',
delimiter = ':'
)
);
PostgreSQL schema
CREATE TABLE IF NOT EXISTS t1 (
"uuid" VARCHAR,
"ts" TIMESTAMP,
"tenant_id" INT4,
"sensor_id" VARCHAR,
"sensor_type" VARCHAR,
"v1" INT4,
"v2" INT4,
"v3" FLOAT4,
"v4" FLOAT4,
"v5" BOOL,
"week_generated" TIMESTAMP,
"taxonomy" ltree
);
Differences
There are two main differences to point out between both schemas, concerning table partitioning
and text searches.
Queries
The benchmark simulated multiple users querying the 314 million row database concurrently.
Three different SQL queries were used. The parameters in the queries were randomly generated,
to ensure the queries produced different responses as they would in real life. Except for the full-
text query, the SQL syntax was the same for both CrateDB and PostgreSQL.
Query #1
Data from a single tenant, aggregated over a single week:
Query #2
Data aggregated across multiple weeks, grouped and ordered by tenant:
Query #3
Group by with full text:
PostgreSQL:
SELECT sensor_type, COUNT(*) as sensor_count
FROM t1
WHERE taxonomy <@ ?::ltree AND tenant_id = ?
GROUP BY sensor_type;
Cluster setup
It was attempted to keep the hardware assigned to both CrateDB and PostgreSQL as similar as
possible. However, this was a little tricky, because CrateDB is a distributed system and
PostgreSQL is not.
- CrateDB v. 1.1.4 was set up with a cluster of 3 equally configured nodes, each one with:
○ 8 CPU cores (Intel Xeon CPU D-1541 @ 2.10GHz)
○ 16GB RAM
○ 32GB SSD
- PostgreSQL v. 9.6 was set up for RHEL with 1 master and 1 replica node, each one with:
○ 16 CPU cores (Intel Xeon CPU E5-2650 v4 @ 2.20GHz)
○ 24GB RAM
○ 256GB SSD
We ran both on the same internal hardware, with cloud services involved. The total cost of the
PostgreSQL hardware, $9,516, was 30% higher than the CrateDB hardware cost: $6,102. (Pricing
source here)
Results
A JMeter test harness simulated 30 concurrent users, each connecting to the database and
executing the three test queries in random sequence and with random query parameters. Each
database executed over 4,000 queries during their respective benchmark runs.
CrateDB 22x
Group by with full text 4,863 108,966
faster
First, a master table must be created, from which all the partitions will inherit:
Then, the partition tables needed are created. The scenario discussed in this white paper involved
a year's worth of data partitioned over the corresponding weeks, so 52 tables are needed. These
tables must inherit the schema from the previously defined master table.
An index on each of these tables can then be created, for the column being used to partition.
The partitioning logic is then defined in a function, applying it to the master table.
Unfortunately, this leads to a very large schema (300+ lines in this case). This might be
manageable if the partitions can be defined up-front. Unfortunately, because dates are non-
repeating, in this case it is impossible to pre-define all possible partitions, unless a definite end-
date is available.