Prometheus 2.0 - New Storage Layer Dramatically Increases Monitoring Scalability For Kubernetes and Other Distributed Systems - CoreOS

6/7/2019 Prometheus 2.
heus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
CoreOS Blog
All CoreOS Posts (/blog) Technical Posts Announcements

(/blog/technical-posts) (/blog/announcements)
← Back to All Blogs (/blog)
Prometheus 2.0: New storage layer dramatically increases

monitoring scalability for Kubernetes and other distributed systems
May 24, 2017 • By Fabian Reinartz
Tags: Prometheus (/blog/prometheus) technical posts (/blog/technical-posts) Kubernetes

(/blog/kubernetes)
Prometheus is a monitoring system and time series database expressly designed for the highly
distributed, automated, and scalable modern cluster architectures orchestrated by systems like
Kubernetes. Prometheus has an operational model and a query language tailored for distributed,
dynamic environments.
Nevertheless, the nature of these systems enables increasing dynamism in application loads and node
membership that can strain the monitoring system itself. As Prometheus has been adopted
throughout the Kubernetes ecosystem, even powering the monitoring services on Tectonic, CoreOS’s
enterprise Kubernetes platform, performance demands on the Prometheus storage subsystem have
compounded. Prometheus 2.0’s new storage layer is designed to accommodate these demands into
the future.
Problem space
Time series databases, like that underpinning Prometheus, face some basic challenges. A time series
system collects data points over time, linked together into a series. Each data point is a numeric value
associated with a timestamp. A series is defined by a metric name and labeled dimensions, and serves
to partition raw metrics into fine-grained measurements.
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 1/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
For example, we can measure the total number of received requests, divided by request path , method
, and server instance in the following series:
requests_total{path="/status", method="GET", instance="10.0.0.1:80"}

requests_total{path="/status", method="POST", instance="10.0.0.3:80"}
requests_total{path="/", method="GET", instance="10.0.0.2:80"}
Prometheus allows us to query those series by their metric name and by specific labels. Querying for
requests_total{path=”/status”} will return the first two series in the above example. We can
constrain the number of data points we want to receive for those series to an arbitrary time range, for
example the last two hours, or the last 30 seconds.
Any storage layer supporting collection of metrics series at the rate and volumes seen in cloud
environments thus faces two key design challenges.
Vertical and horizontal

We can think of such data as a two-dimensional plane where the vertical dimension represents all of
the stored series, while the horizontal dimension represents time through which samples are spread.
series
^
│ . . . . . . . . . . . . . . . . . . . . . . request_total{path="/status",method="G
ET"}
│ . . . . . . . . . . . . . . . . . . . . . . request_total{path="/",method="POST"}
│ . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . . . errors_total{path="/status",method="PO
ST"}
│ . . . . . . . . . . . . . . . . . errors_total{path="/health",method="GE
T"}
│ . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . .
v
<-------------------- time --------------------->
Prometheus periodically collects new data points for all series, which means it must perform vertical
writes at the right end of the time axis. When querying, however, we may want to access rectangles of
arbitrary area anywhere across the plane.
This means time series read access patterns are vastly different from write patterns. Any useful storage
layer for such a system must deliver high performance for both cases.
Series churn
Series are associated with deployment units, such as a Kubernetes pod. When a pod starts, new series
are added into our data pool. If a pod shuts down, its series stops receiving new samples, but the
existing data for the pod remains available. A new series begins for the pod automatically spun up to
replace the terminated pod. Auto-scaling and rolling updates for continuous deployment cause this
instance churn to happen orders of magnitude more often than in conventional environments. While
Prometheus may usually be collecting data points for a roughly fixed number of active series, the total
number of series in the database grows linearly over time.
Because of this, the series plane in dynamic environments like Kubernetes clusters typically looks
more like the following:
series
^
│ . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . .
│ . . . . .
│ . . . . .
v
<-------------------- time --------------------->
To be able to find queried series efficiently, we need an index. But the index that works well for five
million series may falter when dealing with 200 million or more.
New storage layer design

The Prometheus 1.x storage layer deals well with the vertical write pattern. However, environments
imposing more and more series churn started to expose a few shortcomings in the index. In version
2.0, we’re taking the opportunity to address earlier design decisions that could lead to reduced
predictability and inefficient resource consumption in such massively dynamic environments.
We designed a new storage layer that aims to address these shortcomings to make it even easier to
run Prometheus in environments like Kubernetes, and to prepare Prometheus for the proliferating
workloads of the future.
Sample compression
The sample compression feature of the existing storage layer played a major role in Prometheus’s
early success. A single raw data point occupies 16 bytes of storage. When Prometheus collects a few
hundred thousand data points per second, this can quickly fill a hard drive.
However, samples within the same series tend to be very similar, and we can exploit this fact to apply
efficient compression to samples. Batch compressing chunks of many samples of a series, in memory,
squeezes each data point down to an average 1.37 bytes of storage.
This compression scheme works so well that we retained it in the design of the new version 2 storage
layer. You can read more on the specifics of the compression algorithm Prometheus storage employs
in Facebook’s “Gorilla” paper (http://www.vldb.org/pvldb/vol8/p1816-teller.pdf).
Time sharding
We had the profound realization that rectangular query patterns are served well by a rectangular
storage layout. Therefore, we have the new storage layer divide storage into blocks, each of which
holds all the series from a range of time. Each block acts as a standalone database.
t0 t1 t2 t3 now
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ │ │ │ │ │ │ │ ┌────────────┐
│ │ │ │ │ │ │ mutable │ <─── write ──── ┤ Prometheus │
│ │ │ │ │ │ │ │ └────────────┘
└───────────┘ └───────────┘ └───────────┘ └───────────┘ ^
└──────────────┴───────┬──────┴──────────────┘ │
│ query
│ │
merge ─────────────────────────────────────────────────┘
This allows a query to examine only the subset of blocks within the requested time range, and trivially
addresses the problem of series churn. If the database only considers a fraction of the total plane,
query execution time naturally decreases.
This layout also makes it trivially easy to delete old data, which was an expensive procedure in the old
storage layer. Once a block’s time range completely falls behind the configured retention boundary, it
can be dropped entirely.
|
┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . .
└────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘
|
|
retention boundary
The Index
While reducing the queried data set is very efficient, it makes improving the overall index increasingly
crucial as we expect series churn behavior to only intensify.
We want to query series by their metric names and labels. Those are completely arbitrary, user-
configured, and vary by application and by use case. A column index like those in common SQL
databases cannot be used. Instead, the new Prometheus storage layer borrows the inverted index
concept (https://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-
index-1.html) used in full-text search engines. The inverted index method can efficiently retrieve
documents by matching any of the words inside of them. The new storage layer treats each series
descriptor as a tiny document. The series name and each label pair are then words in that document.
For example, the series requests_total{path="/status", method="GET", instance="10.0.0.1:80"} is

a document containing the following words:
1. __name__="requests_total"
2. path="/status"
3. method="GET"
4. instance="10.0.0.1:80"
For each unique label word across the series, the index now maintains a sorted list of IDs of the series
in which it occurs. Those lists can be efficiently merged and intersected to quickly find the right time
series for complex queries. For example, the selector
requests_total{path!="/status",instance=~".*:80"} , retrieves all series for the requests_total
metric of instances listening on port 80, excluding those requests for the /status path.
Benchmarks
Design and development are just one part of the equation. To reliably measure whether our efforts are
paying off, we need to verify our theory using benchmarks. What could be better than simulating
those series-churning Kubernetes workloads that initially motivated the new storage design? We
wrote a benchmarking suite that spins up Kubernetes clusters and runs a microservice workload on
them, with pods rapidly starting and stopping. Prometheus servers with different versions and
configurations can monitor this workload, which allows for direct A/B testing of Prometheus
performance.
But collecting data is just one part of what a time series database storage layer does. We also wanted
to compare how the same queries perform on the old and new storage layers, respectively. To
benchmark this case, Prometheus servers are continuously hit with queries that reflect typical use
cases.
Overall, the benchmark simulates a workload far below the potential maximum sample throughput,
but a lot more demanding in terms of querying and series churn than most real-world setups today.
This may show larger performance improvements than typical production workloads, but it ensures
the new storage layer can handle the next order of scale.
So what are the results? Memory consumption has historically been a pain point of Prometheus
deployments. The new storage layer, queried as well as un-queried, shows significant memory savings
as well as more stable, predictable allocation. This is especially useful in container environments
where admins want to apply reasonable resource limitations.
Storage system CPU consumption also decreases in Prometheus 2.0 servers. Servers show increased
usage when being queried, which mostly stems from the computations in the query engine itself,
rather than the storage mechanism. This is a potential target for future optimization.
While we haven’t addressed it, we optimized the disk writing behavior of the storage layer. This is
particularly useful for Kubernetes environments, where we are often using network volumes with
limited I/O bandwidth. Overall, the bytes written to disk per second decreased by a factor of about 80.
As a surprising side effect, the new file layout and index structure also significantly decreases our
storage size in addition to our previous gains.
Finally, what about the queries? The benchmarking environment causes significant series churn on
top of the expensive queries we are running. The new storage layer aims to handle the linear increase
of time series in storage without an equivalent increase in query latency. As shown in the graph below,
Prometheus 1.5.2 exhibits this linear slowdown, whereas Prometheus 2.0 enters a stable state on
startup with minimal spikes.
Try the Prometheus v2 alpha with the new storage layer

We have seen why today’s orchestrated container cluster environments like Kubernetes and CoreOS
Tectonic bring new challenges to monitoring, and especially to efficiently storing and accessing
masses of monitoring data. With the redesigned version 2 storage layer, Prometheus is prepared to
handle monitoring workloads in increasingly dynamic systems for years to come.
Try out the third alpha release of Prometheus 2.0

(https://github.com/prometheus/prometheus/releases/tag/v2.0.0-alpha.2) today. You can participate
in version 2.0 development in many ways, but as we work toward a stable release, testing the alpha
releases and filing issues (github.com/prometheus/prometheus/issues) you encounter along the way
are great ways to get started as a Prometheus contributor.
The Prometheus Operator (https://coreos.com/operators/prometheus/docs/latest), which makes

deployment and configuration of Prometheus-powered monitoring in Kubernetes easy, also supports
this version 2.0 alpha release. Just follow the Prometheus Operator getting started guide
(https://coreos.com/operators/prometheus/docs/latest/user-guides/getting-started.html) and set the
Prometheus version to v2.0.0-alpha.2 .
There's a long list of people who contributed to this work through development, testing, and technical
expertise. We deeply thank everyone involved in this critical next step for the Prometheus project.
(https://www.coreos.com)
COMPANY
About (https://coreos.com/about/)
Privacy Policy (https://www.redhat.com/en/about/privacy-policy)
Blog (https://www.coreos.com/blog)
PRODUCTS
Tectonic Enterprise (https://coreos.com/tectonic/)
Premium Managed Linux (https://coreos.com/products/container-linux-subscription/)
Quay Enterprise (https://tectonic.com/quay-enterprise/)
CONTACT & SUPPORT

Contact Us (https://coreos.com/contact)
General Mailing List (https://groups.google.com/forum/#!forum/coreos-user)
Developer Mailing List (https://groups.google.com/forum/#!forum/coreos-dev)
IRC #coreos (irc://irc.freenode.org:6667/#coreos)
@CoreOS (https://twitter.com/coreos/)
DOCS
Docs (https://www.coreos.com/docs)
Resources (https://www.coreos.com/resources)
Security (https://www.coreos.com/security)

Prometheus 2.0 - New Storage Layer Dramatically Increases Monitoring Scalability For Kubernetes and Other Distributed Systems - CoreOS

Uploaded by

Copyright:

Available Formats

You might also like

Prometheus 2.0 - New Storage Layer Dramatically Increases Monitoring Scalability For Kubernetes and Other Distributed Systems - CoreOS

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prometheus 2.0 - New Storage Layer Dramatically Increases Monitoring Scalability For Kubernetes and Other Distributed Systems - CoreOS

Uploaded by

Copyright:

Available Formats

6/7/2019 Prometheus 2.

All CoreOS Posts (/blog) Technical Posts Announcements

← Back to All Blogs (/blog)

Prometheus 2.0: New storage layer dramatically increases

Tags: Prometheus (/blog/prometheus) technical posts (/blog/technical-posts) Kubernetes

, and server instance in the following series:

requests_total{path="/status", method="GET", instance="10.0.0.1:80"}

Vertical and horizontal

New storage layer design

For example, the series requests_total{path="/status", method="GET", instance="10.0.0.1:80"} is

Try the Prometheus v2 alpha with the new storage layer

Try out the third alpha release of Prometheus 2.0

The Prometheus Operator (https://coreos.com/operators/prometheus/docs/latest), which makes

Privacy Policy (https://www.redhat.com/en/about/privacy-policy)

Premium Managed Linux (https://coreos.com/products/container-linux-subscription/)

Quay Enterprise (https://tectonic.com/quay-enterprise/)

CONTACT & SUPPORT

General Mailing List (https://groups.google.com/forum/#!forum/coreos-user)

Developer Mailing List (https://groups.google.com/forum/#!forum/coreos-dev)

IRC #coreos (irc://irc.freenode.org:6667/#coreos)

You might also like