1 s2.0 S1877050919310439 Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com

ScienceDirect
Procedia Computer Science 00 (2019) 000–000
Procedia
Procedia Computer
Computer Science
Science 00(2019)
156 (2019)19–28
000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia

8th International Young Scientist Conference on Computational Science


8th International Young Scientist Conference on Computational Science
Evaluation
Evaluation of
of modern
modern tools
tools and
and techniques
techniques for
for storing
storing time-series
time-series
data
data
Alexey Struckova,∗, Semen Yufaa , Alexander A. Visheratina , Denis Nasonova
Alexey Struckova,∗, Semen Yufaa , Alexander A. Visheratina , Denis Nasonova
a ITMO University, Saint Petersburg, Russia
a ITMO University, Saint Petersburg, Russia

Abstract
Abstract
Time series data as its analysis and applications recently have become increasingly important in different areas and domains. Many
Time
fields series data as
of science its industry
and analysis and
rely applications recently
on storing and have become
processing increasingly
large amounts important
of time series in different areas
– economics andand domains.
finance, Many
medicine,
fields of science and industry rely on storing and processing large amounts of time series – economics and finance,
the Internet of Things, environmental protection, hardware monitoring, and many others. This work presents a theoretical and medicine,
the Internet ofapproach
experimental Things, toenvironmental protection, instrument.
choosing an appropriate hardware monitoring, and many others. This work presents a theoretical and
experimental approach to choosing an appropriate instrument.
c 2019 The Authors.
©
 Authors. Published
The Authors. Published
by Elsevier by Elsevier Ltd.
Ltd.
c 2019 The Authors. The Authors. Published by Elsevier Ltd.

This BY-NC-ND license
is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
(https://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open
Peer-review access
under article under
responsibility
responsibility the scientific
of the CC BY-NC-ND license
committee (https://creativecommons.org/licenses/by-nc-nd/4.0/)
of the 8th International Young Scientist Conference on Computational
Peer-review
Science.
Science. under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational
Science.
Keywords: time series; distributed storage; database; cloud services
Keywords: time series; distributed storage; database; cloud services

1. Introduction
1. Introduction
Time series data plays a very important role in the modern world. Many fields of science and industry rely on
Time series data plays a very important role in the modern world. Many fields of science and industry rely on
storing and processing large amounts of time series – economics and finance [1], medicine [2], Internet of Things
storing and processing large amounts of time series – economics and finance [1], medicine [2], Internet of Things
[6], environmental protection [7], hardware monitoring [5], and many others. It is worth mentioning that time series
[6], environmental protection [7], hardware monitoring [5], and many others. It is worth mentioning that time series
applications greatly vary in functionality and operating scales. For example, DevOps monitoring tools operate the
applications greatly vary in functionality and operating scales. For example, DevOps monitoring tools operate the
data for several previous weeks and need aggregation functionality from the storage [11], whereas meteorological
data for several previous weeks and need aggregation functionality from the storage [11], whereas meteorological
processing utilizes tens of years of the data and need to extract only raw data since most of the times the processing
processing utilizes tens of years of the data and need to extract only raw data since most of the times the processing
involves complex and custom-built models [10].
involves complex and custom-built models [10].
Growing demand for various functional capabilities and processing speed led to the sharp rise of specialized time
Growing demand for various functional capabilities and processing speed led to the sharp rise of specialized time
series databases (TSDB) [12] and the development of custom TSDB solutions by large companies, e.g. Gorilla [14]
series databases (TSDB) [12] and the development of custom TSDB solutions by large companies, e.g. Gorilla [14]
by Facebook and Atlas by Netflix. Researchers develop novel approaches for optimizing storage structure [3], internal
by Facebook and Atlas by Netflix. Researchers develop novel approaches for optimizing storage structure [3], internal
layout [15], and compression mechanisms [17, 20] to boost the development and adoption of time series processing.
layout [15], and compression mechanisms [17, 20] to boost the development and adoption of time series processing.
By now there exist dozens of time series databases along with classic relational databases and modern in-memory
By now there exist dozens of time series databases along with classic relational databases and modern in-memory
columnar storages. Each of solutions provides different functionality in terms of available requests, aggregations, and
columnar storages. Each of solutions provides different functionality in terms of available requests, aggregations, and

∗ Corresponding author.
∗ Corresponding
E-mail address:author.
as5423.ru@gmail.com
E-mail address: as5423.ru@gmail.com
1877-0509  c 2019 The Authors. The Authors. Published by Elsevier Ltd.
1877-0509
This c 2019

is an open The Authors.
access The
theAuthors. Publishedlicense
by Elsevier Ltd.
1877-0509
This is an © 2019
open Thearticle
access article
under
Authors.
under
CC BY-NC-ND
Published by Elsevier Ltd.
the scientific
CC BY-NC-ND license
(https://creativecommons.org/licenses/by-nc-nd/4.0/)
(https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review
This under
is an open responsibility
access of the
article under committee
the CC BY-NC-ND oflicense
the 8th (https://creativecommons.org/licenses/by-nc-nd/4.0/)
International Young Scientist Conference on Computational Science.
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science.
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science.
10.1016/j.procs.2019.08.125
20 Alexey Struckov et al. / Procedia Computer Science 156 (2019) 19–28
2 Alexey Struckov et al. / Procedia Computer Science 00 (2019) 000–000

supported data types, and yield trade-offs between consistency, availability and partition tolerance [8], reliability and
availability, insertion and querying performance, etc. There are solutions for big enterprise systems, which can handle
a large number of simultaneous connections but are hard to maintain and have high hardware requirements. And when
one wants to employ a time series storage for their problem, it is very hard to choose the solution that fits best.
In this paper, we perform a thorough theoretical and experimental investigation of modern databases that are used
for storing time series. Section 2 provides a review of three main types of databases used for time series nowadays and
describes four solutions selected for theoretical and experimental evaluation. In Section 3 we formulate 10 criteria,
which can help to understand database capability and theoretical suitability of the database for the specific project,
and compare selected databases using these criteria. Section 4 is devoted to experimental benchmarking of selected
databases based on three real use cases – DevOps, Internet of Things, and meteorological analysis. Results allow
formulating a set of recommendations for selecting a database depending on planned data volumes and request rates.

2. Related work

Today there is a large variety of databases with different models and purposes. It would be too time-consuming
to perform an experimental evaluation of all of them. Because of that we consider the most popular and prominent
members from each of the major classes – relational, columnar and specialized time series databases – and choose a
few most suitable of them for further evaluation.
The oldest and most traditional class of databases is relational databases, it is still the most widely used class as
of May 2019 [13]. It is based on a mathematical relational model and stores data in tables by rows. Microsoft SQL
Server is one of the most popular relational databases. It represents all the advantages of relational databases, such
as the ability to store data with any sort of complex structure and compliance with ACID properties to allow reliable
transactions. But it also has serious disadvantages that affect working with time series data. As most of relational
databases, it has some problems with handling large monolithic tables. Microsoft SQL Server is also a commercial
product, so scaling is not free. There are also no special mechanisms for storing time series data in the database.
Oracle database is the main competitor of Microsoft SQL Server. It has similar pros and cons, but it has specific data
types for time series and there are examples of usage of Oracle for time series storage and processing [4]. MySQL
is one of the most popular open-source relational database. Being a good choice for small or medium applications,
it is widely used by web servers of such size, but it has problems with scaling and cant handle big amounts of data
with suitable speed. Nevertheless, there are some attempts to use MySQL for time series data [19]. PostgreSQL is
another popular open-source relational database. It suits better than MySQL for working with big amounts of data and
even competes with mentioned commercial products, although it has less functionality. PostgreSQL has an extension
named TimescaleDB [18] and created specially for working with time series. TimescaleDB introduces hypertables,
which are designed considering features of time series data and allow higher time series ingestion rates and greater
performance for some queries.
Columnar databases is a much newer class. Such databases store data in columns instead of rows, they are intended
to work with big amounts of data under a high load of analytical queries. Columnar databases work fast with aggrega-
tion queries, with queries that work with a few columns from a table with many of them. Compression and scalability
are also strong sides of columnar databases. The weak sides are searching or updating a few rows in a large table, also
columnar databases usually dont support transactions. One of the most popular columnar databases is HBase. It is
developed as part of the Hadoop project and works only on the HDFS cluster, so its application area is strictly limited.
It is open-source and possesses advantages and disadvantages of columnar class. For time series it is used in conjunc-
tion with OpenTSDB, which is reviewed later. Another open-source columnar database is MonetDB. As all columnar
databases, it is oriented on analyzing process on big amounts of data, but also supports features more characteristic
for relational databases like ACID transactions. It is also can be used for storing time series [16]. Clickhouse is also
a popular representative of columnar databases. It also has all the benefits of columnar architecture and capable of
working with big amounts of data. Importantly, Clickhouse is widely used for storing time series. For example, it has
an application [9] for integration with Graphite visualization tool for time series metrics.
Specialized time series databases is the newest of all classes. Such databases have no ability to store any other kind
of data, but they allow working with time series more effectively. Prometheus is not only a time series database, it is
also a monitoring system. It collects time series data from various sources and stores it. Architecture of storage as-
Alexey Struckov et al. / Procedia Computer Science 156 (2019) 19–28 21
Alexey Struckov et al. / Procedia Computer Science 00 (2019) 000–000 3

sumes ingestion of only new data with timestamps from the recent few hours, which is a common case for monitoring
systems. OpenTSDB is another popular open-source time series database. It uses reviewed before HBase as its data
storage layer. So it can be used in the Hadoop ecosystem, but nowhere else. The accent of OpenTSDB architecture is
scalability. Currently, the most popular TSDB is InfluxDB. It is highly optimized for working with time series data
and has good scalability. It is also open source and has integration with different monitoring and visualization tools.
For further evaluation we selected the following solutions: TimescaleDB, because it is the only relational database
that is greatly tailored for working with time series; Clickhouse since it is widely used for time series processing;
InfluxDB, the most popular time series database; and OpenTSDB that can work in Hadoop ecosystem, which is
common case for time series data.

3. Theoretical comparison

3.1. Criteria description

For theoretical comparison, we looked for criteria that reflect important aspects of working with time series, such
as ability to work efficiently with big amounts of data, efficient usage of memory space and convenience of working
with a database. Each of the following criteria represents one of these aspects.

• Scalability. This criterion characterizes the ability of parallel query execution in the database. The main types
of scalability are vertical and horizontal. Vertical scalability is an ability to process multiple requests on a sin-
gle node using several CPU cores. It involves the implementation of parallelism through the usage of multiple
threads, coroutines or processes on systems with shared memory. Horizontal scalability involves the use of
distributed computing technologies (including synchronization and consensus) and the use of a cluster of com-
puters, where different nodes provide storage or coordination functions. Nowadays horizontal scalability is a
more effective type of scalability.
• Reliability. This criterion characterizes the ability to provide uninterrupted interaction with the database in
case of failure of an individual node or network partitioning for horizontally scalable systems. Fault tolerance is
provided mainly by replication of various components, including data replication. A critical place in horizontally
scalable systems is usually the coordination component. If it is possible to keep an idle replica of this component
and quickly switch to it in case of a failure, then such a database can be considered highly reliable. In some
databases, higher reliability is also achieved by Write Ahead Log(WAL) which helps to restore data after some
node failure. Another reliability mechanism is transactions that satisfy ACID (Atomicity, Consistency, Isolation,
Durability) properties which are very important in some spheres like bank transfers.
• Supported query types. This criterion characterizes the functionality of the database, provided to the user mainly
for reading and analyzing data. Depending on the implemented data model, types of queries can vary signifi-
cantly from the most simple and common such as CRUD (create, read, update, delete) or sampling by individual
values (primary key) or scan queries, to specialized queries, such as sampling by time and a set of geolocations.
Specialized queries usually indicate the presence of certain optimizations that allow judging about the internal
structure of the database and optimal use-cases.
• Indexing. This criterion characterizes the speed of access to data through index, which is a special data structure
containing pointers to sections in memory for fast access to them. Often the indexing is made on the primary key
(in case of time series, the timestamp plays this role). In addition, indexing can be done using several fields, e.g.
timestamp and tag. It is important in cases where there are several dimensions, in which points (combinations
of fields in records) possess the concept of proximity and are often chosen jointly in certain queries.
• Storage. This criterion describes the storage layer of the database. It can use its own implementation of storage or
depend on some external towards the database storage system, which can lay some limitations like the inability
to use the database without this system and also have its benefits.
• Data compression. This criterion describes the capabilities of the database to reduce the amount of stored data.
Data compression algorithms can be general or specific for some data type. Data compression can be applied
based on timestamp (old data is compressed and sent to the permanent storage) or based on the statistics of
references to them.
22 Alexey Struckov et al. / Procedia Computer Science 156 (2019) 19–28
4 Alexey Struckov et al. / Procedia Computer Science 00 (2019) 000–000

• Database interaction interface. This criterion indicates types of application interfaces available to work with the
database, like, for example, the HTTP interface, which is not only an important factor for convenience but also
can have an influence on the speed of ingestion of big amounts of data.
• Query language. This criterion indicates what query language is supported by the database. The important
difference here is whether the database support some dialect of SQL or not, which can sometimes have a big
influence on database choice.
• Internal monitoring. This criterion indicates the ability of the database to save statistics about its own working
process, which can be useful in case of a database failure, performance problems or some other problems in the
working process.
• Administration instruments. This criterion characterizes the convenience of working with the database for
database administrator by indicating the presence of administration instruments for a variety of purposes.

3.2. Database comparison

TimescaleDB is an extension of popular open-source relational database PostgreSQL for working with time series.
TimescaleDB mainly solves a big problem of relational databases – low ingestion rates for big amounts of data. For
this purpose it introduces hypertable – an entity, which has the same interface for the user as a simple table, but
internally is an abstraction consisting of many tables called chunks. Each chunk corresponds to the specific time
interval, that is implemented as a standard PostgreSQL table. This mechanism allows avoiding slow disk operations
by storing in memory only the chunk of the latest period. It is only effective when the data being ingested has the
latest timestamps, but it is the most common scenario for time series data.
Besides good ingestion rates, TimescaleDB inherits from PostgreSQL all advantages of mature relational databases
with decades of development. It has high reliability and security, supports all the same features of SQL language as
PostgreSQL and works with all variety of tools created for PostgreSQL. Concerning scalability, currently, open-source
version is only scalable within a single node.
Clickhouse is a popular open-source columnar database for online analytical processing. It is designed for efficient
work in the following conditions: most of the queries are read queries; data is ingested in the database by large batches,
not by single rows; data cannot be modified; read queries involve big number of rows, but only a few columns; tables
usually contain a lot of columns. The standard workload for Clickhouse is hundreds of queries per second per server.
For simple queries expected latency reaches about 50 ms. Transactions are not supported and there is no compliance
with ACID properties.
Among Clickhouse features, there should be noted good scalability, both vertical and horizontal. It can scale both
on multiple cores and multiple servers. Clickhouse also supports multi-master replication of data, which makes it
reliable. Another feature is a query language that is an SQL dialect and supports most of SQL queries like GROUP
BY, ORDER BY, subqueries in FROM, IN, and JOIN clauses, and scalar subqueries. Although some language parts
like dependent subqueries and window functions are not supported.
OpenTSDB is an open-source scalable, distributed time series database written in Java and built on top of HBase.
OpenTSDB is not a standalone database, so it relies on HBase or compatible storage as its data storage layer.
OpenTSDB Time Series Daemons works as a query engine without sharing the state between instances.
Data is stored as a set of tuples which contains UNIX timestamp, value, name of a metric and pairs of tag keys and
tag values. Tags are used for filtering data by features like hostname or application name. For each metric, tag key and
tag value OpenTSDB produce unique numeric ID. Each ID is encoded by 3 bytes. Series of IDs and timestamp build a
rowkey for a value. This way to index data helps to perform a more efficient search by combinations of metrics, tags,
and timestamps. To query data from database client has to set a metric name and a time range. Data also can be filtered
and aggregated by tags. OpenTSDB also supports the downsampling mechanism to reduce the number of receiving
data points. The rate conversion function calculates the rate of change in values over time. This will transform counters
into lines with spikes to show when the activity occurred.
To reduce load and increase performance OpenTSDB provides rollups and pre-aggregated tables. A rollup is de-
fined in OpenTSDB as a single time series aggregated over time. It may also be called a time-based aggregation.
Rollups help to solve the problem of analyzing data within large time spans. While rollups help with large time span
queries, user can still run into query performance issues with small ranges if the metric has high cardinality (i.e. the
Alexey Struckov et al. / Procedia Computer Science 156 (2019) 19–28 23
Alexey Struckov et al. / Procedia Computer Science 00 (2019) 000–000 5

unique number of time series for the given metric). If users are often fetching the group by of large sets like this then
it makes sense to store the aggregate and query that instead, fetching much fewer data.
InfluxDB is an open-source time series data storage written in Go built to handle tasks with a large amount of time
series such as DevOps monitoring, IoT sensors, and real-time data analysis.
InfluxDB uses its own query language based on SQL-syntax. All data stores in measurements as data points.
Data point consists of a timestamp, named fields and named tags for indexing and grouping. An index is built from
series which are basically combinations of measurement name, timestamps, and tags. Data points can be separated by
retention policies which define the lifetime of data and have its own index and sharing period. This approach helps
to store multi-frequency data measurements with same names but different set of series and clean out unnecessary
data. It’s also possible to automatically downsample frequent data in order to save space and reduce query time with
continuous queries.
The community version of InfluxDB cannot be scaled or replicated without external utilities. These features sup-
ported only by the enterprise version of storage. However, storage architecture is based on shards, which are presented
as a group of files. Those files could be moved or copied.
It is worth mentioning that InfluxDB is very easy to embed into most of the technology stacks. It provides its own
REST API and also supports protocols of OpenTSDB, Prometheus and Graphite to be used as a replacement without
rewriting an application or a monitoring agent.
Table 1 shows comparison of selected databases using chosen criteria.

Table 1: Databases comparison using chosen criteria

Criterion TimescaleDB Clickhouse InfluxDB OpenTSDB

Scalability vertical vertical, horizontal vertical vertical, horizontal

replication, replication,
Reliability WAL, ACID, replication sequential consistency - immediate consistency

CRUD, search,
CRUD, search, aggregations, CRUD, search,
aggregations, except no single row aggregations, CRUD, aggregation,
Query types all SQL features update or delete some of SQL features no search by value

B+tree, hash, one index per table Merge Tree by time, index by metrics name
Indexing custom indices on any set of fields Columnar index by tags and tags

Storage FS storage of PostgreSQL Own FS storage Own FS storage HDFS storage of HBase

Data compression TOAST columnar type dependent columnar

Database interface HTTP, own protocol HTTP, own protocol HTTP Telnet, HTTP

Query language SQL SQL SQL-like own languages

Internal monitoring yes yes yes yes

some tools some tools


for main use-cases for main use-cases
Administration big variety of tools and big variety of own tools and libraries and libraries
instruments libraries for PostgreSQL and libraries for most popular languages for most popular languages

From the table we see that TimescaleDB, being extension of a mature relational database with many years of
evolution, is the most convenient choice, but Clickhouse and Influx also quite proficient, support some dialect of SQL,
provide HTTP interfaces and have enough tools for a convenient work, so for them usability might rarely be a critical
24 Alexey Struckov et al. / Procedia Computer Science 156 (2019) 19–28
6 Alexey Struckov et al. / Procedia Computer Science 00 (2019) 000–000

factor. TimescaleDB also has the architecture developed for general purposes, unlike other databases that are designed
for storing time series. For example, supporting ACID transactions is quite useless in many scenarios of working with
time series and at the same time it negatively affects the overall performance.

4. Experimental evaluation

4.1. Benchmark description

Benchmark used in our experiments consists of a set of utilities designed to perform four stages of the testing. In
the benchmark, we aim to pre-generate all the data and queries to reduce on-the-fly parsing and generation as much
as possible and thus reduce benchmark overhead during test runs.
Components of the benchmark include: Data generator – produces data points for a specific time range, frequency
and number of metric sources. Generated data is structurally similar to the real data, e.g. generated metrics have
the same order of magnitude and order of changes as real metrics. It is important because storage architecture and
compression algorithms heavily depend on the data characteristics.
Data loader – uploads the data to databases. This component simulates the uploading from different metrics
sources by sending data from different threads simultaneously. Data is loaded by fixed-size batches.
Query generator – produces a fixed number of queries for databases under investigation and for specified use
cases, time range and amount of metric sources.
Query runner – runs queries generated by previous component distributing them between workers to simulate
querying by a specific number of clients.
Instances of databases were run on a hardware setup with the following characteristics:

• OS: Ubuntu 18.10


• CPU: Intel Core i7-8700K CPU @ 3.70GHz
• RAM: 64 GB
• Local storage: SSD 500 GB

Test cases. In order to evaluate the selected database in realistic scenarios, we created three test cases and three
datasets accordingly. The first test case – DevOps – simulates a production environment, where 100 hosts send their
monitoring characteristics to the database every 10 seconds. Dataset for this case covers the time span of two weeks.
The second test case – Internet of Things – simulates an operating of a more moderate environment of 10 hosts (e.g.
smart home or small facility) that send their data every 10 seconds. The time interval of this dataset is much longer
– one year. This allows checking how databases sustain queries execution on long intervals. The third test case –
meteorological analysis – was designed to reproduce the behavior of a meteorological sensor, like a weather station,
that reports a set of metrics once in an hour. This case is very different from the previous two and was designed to
analyze the performance of databases when working with a long set of very low-frequency data.

4.2. Data ingestion

In this experiment, we measured how fast can selected solutions upload the data in various scenarios and how
good can they compress the input data into the internal representation. During the DevOps experiment, it was en-
countered that both InfluxDB and OpenTSDB could not complete the ingestion. InfluxDB utilized all available RAM
and crashed with the out-of-memory exception. OpenTSDB, on the other hand, started processing and hang after 5
minutes of the ingestion execution. It can be seen that Clickhouse in this test performs uploading almost twice as fast
as TimescaleDB, which is quite impressive. In order to analyze a data size in the internal format for InfluxDB and
OpenTSDB, we uploaded their datasets using the reduced number of workers. It is clear that InfluxDB provides much
better compression than other solutions (at least 12 times) that is due to its time series tailored design.
In the Internet of Things scenario, OpenTSDB failed to ingest the generated dataset. We believe that it is related
to the high hardware requirements of OpenTSDB and the need to use a distributed cluster for such data volume.
Other databases had no problems with the ingestion, and Clickhouse again executed much faster than the other two
Alexey Struckov et al. / Procedia Computer Science 156 (2019) 19–28 25
Alexey Struckov et al. / Procedia Computer Science 00 (2019) 000–000 7

Table 2: Data ingestion time (seconds)

Database DevOps Internet of Things Meteorological analysis

Clickhouse 180 667.1 32.9


TimescaleDB 330 1643.8 68.3
InfluxDB - 795.2 31.1
OpenTSDB - - 85.2

databases. But the difference in sizes of internal representations between InfluxDB and others is dramatic, 2.5 GB vs
65 GB for TimescaleDB.
For the meteorological analysis test case, all solutions managed to perform ingestion and demonstrated the compa-
rable performance. InfluxDB was the best both in terms of execution speed and data compression.

Table 3: Data size inside database (GB)

Database DevOps Internet of Things Meteorological analysis

Clickhouse 43 28 0.5
TimescaleDB 25 65 0.6
InfluxDB 0.9 2.5 0.1
OpenTSDB 11 - 3.5

As we can see from these experiments Clickhouse works great when a client needs to upload a lot of data. It shows
itself stable no matter how many connections uploads the data, which means it is a solid solution in setups with a lot
of metric sources. TimescaleDB is able to handle many connections too but not so easily as Clickhouse. InfluxDB
is fast and easy to use on small amounts of data and connections and from the Table 3 it is the best among others
in data compression, which means InfluxDB fits good for small monitoring but in cases of more frequent data and a
large amount of connections InfluxDB uses too much CPU and RAM resources and becomes less stable and loses the
performance.

4.3. Search

Some use cases require to find time ranges for specific conditions without receiving data points. For example to
process each range sequentially and reduce the load on the database. For this case were created pre-aggregated tables
which contains minimum and maximum values for each hour interval and each metric source. As presented in Table
4, TimescaleDB is faster than the rest of the databases regardless of the number of simultaneous connections. This is
because, while specific TSDB mostly designed for search based on tags, PostgreSQL optimized for search even in not
indexed columns. But on wider time ranges TimescaleDB became less effective.
OpenTSDB does not have the functionality of filtering by value so it could not be compared.

4.4. Data extraction

Depending on the use case, database should perform different types of queries and with different frequency, so
these queries need to be tested separately.
The most common cases for continuous monitoring require grouping by time and tag. To test how databases behave
on different time ranges the following types have been generated: Group by time on 1 day, 3 days and 5 days ranges
and 1, 3 and 5 months for a larger interval and group by time and metric source. The most popular way to work with
metrics is to show it as a plot. To be representative graph data often should be downsampled by time interval and tags.
To check last state of a metric client need to receive last point of a time series. Such queries are popular between
automatic alerting systems. Those queries represent how database engine walks through existing time series. For ex-
ample, OpenTSDB has to aggregate all time series inside metric which means low performance. While other databases
using sharding or chunking mechanism could load last piece of data and find a point with maximum timestamp in it.
26 Alexey Struckov et al. / Procedia Computer Science 156 (2019) 19–28
8 Alexey Struckov et al. / Procedia Computer Science 00 (2019) 000–000

Table 4: Data search time (ms)

Connections Clickhouse TimescaleDB InfluxDB

DevOps

1 5.97 1.25 5.02


10 7.82 1.16 2.45
50 11.1 1.16 3.07

Internet of Things

1 49.09 37.28 138.9


10 305.65 60.71 68
50 91.92 676.11 55.97

Meteorological analysis

1 792.65 555.87 607.26

To find and effectively react to a problem, it is often needed to clean out unnecessary data by value filtering. For
example, to find hosts with high CPU load. Filtering by value is not very common for specific time series databases,
because they often designed to work more with aggregated data than with raw data points.
Table 5 presents benchmark results for dataset containing low time range data from a big amount of metric sources.
It shows that TimescaleDB is a solid solution for use cases when a lot of clients process a lot of metrics at once.

Table 5: Query timings of DevOps (ms)

Query type Clickhouse TimescaleDB InfluxDB OpenTSDB

1 connection

Group by time (1 day) 14.19 1.36 15.62 21.43


Group by time (3 days) 22.43 6.52 27.3 45.93
Group by time (5 days) 33.69 11.51 48.04 54.77
Group by time and tag 36.5 14.46 62.49 35.5
Filter by value 906.84 115.17 620.09 -
Last point 83.43 2.04 58.98 1630.44

10 connections

Group by time (1 day) 12.45 1.34 13.18 10.66


Group by time (3 days) 43.46 9.27 43.76 36.05
Group by time (5 days) 100.91 8.37 75.11 55.68
Group by time and tag 32.2 13.26 126.57 262.73
Filter by value 423.09 754.14 1321.55 -
Last point 269.18 2.15 100.54 7697.15

50 connections

Group by time (1 day) 22.24 1.42 18.44 63.22


Group by time (3 days) 21.25 8.33 72.9 230.55
Group by time (5 days) 35.08 7.98 230.84 399.16
Group by time and tag 83.63 22.91 587.89 1437.05
Filter by value 2062.45 2489.76 3685.72 -
Last point 1487.2 2.15 457.91 38802.39

Table 6 presents benchmark results for the dataset containing middle time range data from a small amount of metric
sources. InfluxDB and Clickhouse showed themselves as good solutions for small monitoring setups like Internet of
Things.
Alexey Struckov et al. / Procedia Computer Science 156 (2019) 19–28 27
Alexey Struckov et al. / Procedia Computer Science 00 (2019) 000–000 9

Table 6: Query timings of Internet of Things (ms)

Query type Clickhouse TimescaleDB InfluxDB OpenTSDB

1 connection

Group by time (1 day) 11.94 31.15 16.64 28.33


Group by time (3 days) 19.31 53.44 24.13 119.96
Group by time (5 days) 36.62 79.23 44.08 240.71
Group by time and tag 9.75 27.05 6.44 68.75
Filter by value 135.79 113.63 85.08 -
Last point 88.89 1.55 98.38 2315.49

10 connections

Group by time (1 day) 10.2 31.83 11.74 56.05


Group by time (3 days) 34.68 254.54 38.96 236.02
Group by time (5 days) 78.69 327.91 73.95 443.48
Group by time and tag 41.84 15.43 12.82 123.93
Filter by value 935.26 844.68 129.78 -
Last point 805.8 2.07 202.22 19350.49

50 connections

Group by time (1 day) 24.61 232.71 24.25 159.75


Group by time (3 days) 238.02 544.58 98.46 956.44
Group by time (5 days) 294.42 995.85 257.69 1740.44
Group by time and tag 256.45 25.45 69.95 498.76
Filter by value 2593.72 2723.56 355.43 -
Last point 4207.84 2.07 943.22 97805.07

Table 7 presents benchmark results for the dataset containing a large time range of meteorological data. Despite
different solutions demonstrating the best execution time, TimescaleDB overall demonstrated prominent results for all
types of queries.

Table 7: Query timings of Meteorological analysis (ms)

Query type Clickhouse TimescaleDB InfluxDB OpenTSDB

1 connection

Group by time (1 month) 9.59 8.89 92.87 21.55


Group by time (3 months) 22.19 22.56 252.12 18.21
Group by time (5 months) 20.93 32.48 422.73 45.4
Group by time and tag 10.6 1.43 1.97 1.36
Filter by value 26.79 1.34 1.27 -
Last point 39.34 11.29 59.43 809.63

5. Conclusion

In this paper we selected four databases with different architectures and conducted theoretical and experimental
evaluation of their suitability for different cases of working with time series data. The evaluation showed that for
different use cases and for different aspects the choice of the database should be different. So, the most important
conclusion of this work is the fact that to choose the most suitable database one should evaluate testing on their own
data, workloads, and types of queries that would be appropriate.
To choose the right instrument it is important to define a testing method. Data should be the same for each item as
close as possible to the real data because test results based on different data cannot be compared. Operations should
28 Alexey Struckov et al. / Procedia Computer Science 156 (2019) 19–28
10 Alexey Struckov et al. / Procedia Computer Science 00 (2019) 000–000

correspond with use cases for which the instruments are chosen because different use cases require a different set of
mechanisms inside of the instrument. And tester should understand how much wider use case could become in the
future.
Concerning reviewed databases, each one has its own purpose. InfluxDB due to the great compression mechanism
and storage architecture fits well for small monitoring systems. It is very easy to deploy, integrate and operate. Click-
house showed itself as a stable solid enterprise solution for systems with high write rates. It is more complicated to
integrate into an existing system but great stability worth it. TimescaleDB is good when the database requires more
queries than writes. It has all benefits of using PostgreSQL, but with more functionality, oriented for time series pro-
cessing and storing, as a bonus it could be installed on existing instances of PostgreSQL. OpenTSDB is basically
an extension for HBase systems, which means is could use the benefits of all HBase compatible solutions, such as
Google BigTable or HBase itself.
Acknowledgements. This work financially supported by Ministry of Education and Science of the Russian Feder-
ation, Agreement #14.575.21.0165 (26/09/2017). Unique Identification RFMEFI57517X0165.

References

[1] W Brian Arthur. 2018. Asset pricing under endogenous expectations in an artificial stock market. In The economy as an evolving complex
system II. CRC Press, 31–60.
[2] James Lopez Bernal, Steven Cummins, and Antonio Gasparrini. 2017. Interrupted time series regression for the evaluation of public health
interventions: a tutorial. International journal of epidemiology 46, 1 (2017), 348–355.
[3] Ledion Bitincka, Archana Ganapathi, Stephen Sorkin, Steve Zhang, et al. 2010. Optimizing Data Analysis with a Semi-structured Time Series
Database. SLAML 10 (2010), 7–7.
[4] Yves Colin. 2015. Inserting million of time series data (rows) per second inside oracle database. (2015). https://ycolin.wordpress.
com/2015/05/03/inserting-millions-of-time-series-datas-rows-per-second-inside-oracle-database/
[5] Carlo Curino, Evan PC Jones, Samuel Madden, and Hari Balakrishnan. 2011. Workload-aware database monitoring and consolidation. In
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 313–324.
[6] Manuel Dı́az, Cristian Martı́n, and Bartolomé Rubio. 2016. State-of-the-art, challenges, and open issues in the integration of Internet of things
and cloud computing. Journal of Network and Computer applications 67 (2016), 99–117.
[7] Chris C Funk, Pete J Peterson, Martin F Landsfeld, Diego H Pedreros, James P Verdin, James D Rowland, Bo E Romero, Gregory J Husak,
Joel C Michaelsen, Andrew P Verdin, et al. 2014. A quasi-global precipitation time series for drought monitoring. US Geological Survey Data
Series 832, 4 (2014).
[8] Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. Acm
Sigact News 33, 2 (2002), 51–59.
[9] Graphouse. 2019. Graphouse. (2019). https://github.com/yandex/graphouse
[10] Anton Gusarov, Anna Kalyuzhnaya, and Alexander Boukhanovsky. 2017. Spatially adaptive ensemble optimal interpolation of in-situ obser-
vations into numerical vector field models. Procedia computer science 119 (2017), 325–333.
[11] FJ Meng, Mark N Wegman, JM Xu, X Zhang, P Chen, and G Chafle. 2017. IT troubleshooting with drift analysis in the DevOps era. IBM
Journal of Research and Development 61, 1 (2017), 6–62.
[12] Knowledge Base of Relational and NoSQL Database Management Systems. 2018. DBMS popularity broken down by database model. (2018).
https://db-engines.com/en/ranking_categories
[13] Knowledge Base of Relational and NoSQL Database Management Systems. 2019. DB-Engines Ranking. (2019). https://db-engines.
com/en/ranking
[14] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast,
scalable, in-memory time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816–1827.
[15] Sean Rhea, Eric Wang, Edmund Wong, Ethan Atkins, and Nat Storer. 2017. Littletable: A time-series database and its uses. In Proceedings of
the 2017 ACM International Conference on Management of Data. ACM, 125–138.
[16] Sahasranaman M S. 2016. Setting up MonetDB for loading trade/quote timeseries. (2016). https://sahas.ra.naman.ms/2016/06/09/
setting-up-monetdb-for-loading-tradequote-timeseries/
[17] Derick Swanepoel and Frans van den Bergh. 2017. Multi-threaded compression of Earth observation time-series data. International journal of
digital earth 10, 12 (2017), 1214–1230.
[18] Timescale. 2019. TimescaleDB homepage. (2019). https://www.timescale.com/
[19] TimeStored. 2017. Simple MySQL Time Series SQL Queries. (2017). http://www.timestored.com/time-series-data/
mysql-time-series-sql
[20] Xing Tong, Chongqing Kang, and Qing Xia. 2016. Smart metering load data compression based on load feature identification. IEEE Transac-
tions on Smart Grid 7, 5 (2016), 2414–2422.

You might also like