Professional Documents
Culture Documents
Prometheus 2.0 - New Storage Layer Dramatically Increases Monitoring Scalability For Kubernetes and Other Distributed Systems - CoreOS
Prometheus 2.0 - New Storage Layer Dramatically Increases Monitoring Scalability For Kubernetes and Other Distributed Systems - CoreOS
Prometheus 2.0 - New Storage Layer Dramatically Increases Monitoring Scalability For Kubernetes and Other Distributed Systems - CoreOS
heus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
CoreOS Blog
Prometheus is a monitoring system and time series database expressly designed for the highly
distributed, automated, and scalable modern cluster architectures orchestrated by systems like
Kubernetes. Prometheus has an operational model and a query language tailored for distributed,
dynamic environments.
Nevertheless, the nature of these systems enables increasing dynamism in application loads and node
membership that can strain the monitoring system itself. As Prometheus has been adopted
throughout the Kubernetes ecosystem, even powering the monitoring services on Tectonic, CoreOS’s
enterprise Kubernetes platform, performance demands on the Prometheus storage subsystem have
compounded. Prometheus 2.0’s new storage layer is designed to accommodate these demands into
the future.
Problem space
Time series databases, like that underpinning Prometheus, face some basic challenges. A time series
system collects data points over time, linked together into a series. Each data point is a numeric value
associated with a timestamp. A series is defined by a metric name and labeled dimensions, and serves
to partition raw metrics into fine-grained measurements.
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 1/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
For example, we can measure the total number of received requests, divided by request path , method
Prometheus allows us to query those series by their metric name and by specific labels. Querying for
requests_total{path=”/status”} will return the first two series in the above example. We can
constrain the number of data points we want to receive for those series to an arbitrary time range, for
example the last two hours, or the last 30 seconds.
Any storage layer supporting collection of metrics series at the rate and volumes seen in cloud
environments thus faces two key design challenges.
series
^
│ . . . . . . . . . . . . . . . . . . . . . . request_total{path="/status",method="G
ET"}
│ . . . . . . . . . . . . . . . . . . . . . . request_total{path="/",method="POST"}
│ . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . . . errors_total{path="/status",method="PO
ST"}
│ . . . . . . . . . . . . . . . . . errors_total{path="/health",method="GE
T"}
│ . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . .
v
<-------------------- time --------------------->
Prometheus periodically collects new data points for all series, which means it must perform vertical
writes at the right end of the time axis. When querying, however, we may want to access rectangles of
arbitrary area anywhere across the plane.
This means time series read access patterns are vastly different from write patterns. Any useful storage
layer for such a system must deliver high performance for both cases.
Series churn
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 2/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
Series are associated with deployment units, such as a Kubernetes pod. When a pod starts, new series
are added into our data pool. If a pod shuts down, its series stops receiving new samples, but the
existing data for the pod remains available. A new series begins for the pod automatically spun up to
replace the terminated pod. Auto-scaling and rolling updates for continuous deployment cause this
instance churn to happen orders of magnitude more often than in conventional environments. While
Prometheus may usually be collecting data points for a roughly fixed number of active series, the total
number of series in the database grows linearly over time.
Because of this, the series plane in dynamic environments like Kubernetes clusters typically looks
more like the following:
series
^
│ . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . .
│ . . . . .
│ . . . . .
v
<-------------------- time --------------------->
To be able to find queried series efficiently, we need an index. But the index that works well for five
million series may falter when dealing with 200 million or more.
We designed a new storage layer that aims to address these shortcomings to make it even easier to
run Prometheus in environments like Kubernetes, and to prepare Prometheus for the proliferating
workloads of the future.
Sample compression
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 3/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
The sample compression feature of the existing storage layer played a major role in Prometheus’s
early success. A single raw data point occupies 16 bytes of storage. When Prometheus collects a few
hundred thousand data points per second, this can quickly fill a hard drive.
However, samples within the same series tend to be very similar, and we can exploit this fact to apply
efficient compression to samples. Batch compressing chunks of many samples of a series, in memory,
squeezes each data point down to an average 1.37 bytes of storage.
This compression scheme works so well that we retained it in the design of the new version 2 storage
layer. You can read more on the specifics of the compression algorithm Prometheus storage employs
in Facebook’s “Gorilla” paper (http://www.vldb.org/pvldb/vol8/p1816-teller.pdf).
Time sharding
We had the profound realization that rectangular query patterns are served well by a rectangular
storage layout. Therefore, we have the new storage layer divide storage into blocks, each of which
holds all the series from a range of time. Each block acts as a standalone database.
t0 t1 t2 t3 now
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ │ │ │ │ │ │ │ ┌────────────┐
│ │ │ │ │ │ │ mutable │ <─── write ──── ┤ Prometheus │
│ │ │ │ │ │ │ │ └────────────┘
└───────────┘ └───────────┘ └───────────┘ └───────────┘ ^
└──────────────┴───────┬──────┴──────────────┘ │
│ query
│ │
merge ─────────────────────────────────────────────────┘
This allows a query to examine only the subset of blocks within the requested time range, and trivially
addresses the problem of series churn. If the database only considers a fraction of the total plane,
query execution time naturally decreases.
This layout also makes it trivially easy to delete old data, which was an expensive procedure in the old
storage layer. Once a block’s time range completely falls behind the configured retention boundary, it
can be dropped entirely.
|
┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . .
└────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘
|
|
retention boundary
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 4/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
The Index
While reducing the queried data set is very efficient, it makes improving the overall index increasingly
crucial as we expect series churn behavior to only intensify.
We want to query series by their metric names and labels. Those are completely arbitrary, user-
configured, and vary by application and by use case. A column index like those in common SQL
databases cannot be used. Instead, the new Prometheus storage layer borrows the inverted index
concept (https://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-
index-1.html) used in full-text search engines. The inverted index method can efficiently retrieve
documents by matching any of the words inside of them. The new storage layer treats each series
descriptor as a tiny document. The series name and each label pair are then words in that document.
1. __name__="requests_total"
2. path="/status"
3. method="GET"
4. instance="10.0.0.1:80"
For each unique label word across the series, the index now maintains a sorted list of IDs of the series
in which it occurs. Those lists can be efficiently merged and intersected to quickly find the right time
series for complex queries. For example, the selector
requests_total{path!="/status",instance=~".*:80"} , retrieves all series for the requests_total
metric of instances listening on port 80, excluding those requests for the /status path.
Benchmarks
Design and development are just one part of the equation. To reliably measure whether our efforts are
paying off, we need to verify our theory using benchmarks. What could be better than simulating
those series-churning Kubernetes workloads that initially motivated the new storage design? We
wrote a benchmarking suite that spins up Kubernetes clusters and runs a microservice workload on
them, with pods rapidly starting and stopping. Prometheus servers with different versions and
configurations can monitor this workload, which allows for direct A/B testing of Prometheus
performance.
But collecting data is just one part of what a time series database storage layer does. We also wanted
to compare how the same queries perform on the old and new storage layers, respectively. To
benchmark this case, Prometheus servers are continuously hit with queries that reflect typical use
cases.
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 5/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
Overall, the benchmark simulates a workload far below the potential maximum sample throughput,
but a lot more demanding in terms of querying and series churn than most real-world setups today.
This may show larger performance improvements than typical production workloads, but it ensures
the new storage layer can handle the next order of scale.
So what are the results? Memory consumption has historically been a pain point of Prometheus
deployments. The new storage layer, queried as well as un-queried, shows significant memory savings
as well as more stable, predictable allocation. This is especially useful in container environments
where admins want to apply reasonable resource limitations.
Storage system CPU consumption also decreases in Prometheus 2.0 servers. Servers show increased
usage when being queried, which mostly stems from the computations in the query engine itself,
rather than the storage mechanism. This is a potential target for future optimization.
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 6/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
While we haven’t addressed it, we optimized the disk writing behavior of the storage layer. This is
particularly useful for Kubernetes environments, where we are often using network volumes with
limited I/O bandwidth. Overall, the bytes written to disk per second decreased by a factor of about 80.
As a surprising side effect, the new file layout and index structure also significantly decreases our
storage size in addition to our previous gains.
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 7/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
Finally, what about the queries? The benchmarking environment causes significant series churn on
top of the expensive queries we are running. The new storage layer aims to handle the linear increase
of time series in storage without an equivalent increase in query latency. As shown in the graph below,
Prometheus 1.5.2 exhibits this linear slowdown, whereas Prometheus 2.0 enters a stable state on
startup with minimal spikes.
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 8/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
There's a long list of people who contributed to this work through development, testing, and technical
expertise. We deeply thank everyone involved in this critical next step for the Prometheus project.
(https://www.coreos.com)
COMPANY
About (https://coreos.com/about/)
Blog (https://www.coreos.com/blog)
PRODUCTS
Tectonic Enterprise (https://coreos.com/tectonic/)
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 9/10
6/7/2019 Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed sy…
@CoreOS (https://twitter.com/coreos/)
DOCS
Docs (https://www.coreos.com/docs)
Resources (https://www.coreos.com/resources)
Security (https://www.coreos.com/security)
https://coreos.com/blog/prometheus-2.0-storage-layer-optimization 10/10