Professional Documents
Culture Documents
Kubernetes Monitoring
Kubernetes Monitoring
com/danielfm/prometheus-for-developers
Production metrics with Prometheus
Prometheus is an open source monitoring and time-series database (TSDB)
Prometheus Server
which is the component responsible for periodically collecting and
storing metrics from various targets (e.g. the services you want to collect metrics
from).
Prometheus also provides a basic Web UI for running queries on the stored
data, as well as integrations with popular visualization tools, such as Grafana.
Push vs Pull
Metrics Endpoint
By default, Prometheus gets metrics via the /metrics endpoint in each target
Prometheus provides a facility for defining alerting rules that, when
triggered, will notify Alertmanager, which is the component that takes care of
deduplicating, grouping, and routing them to the correct receiver integration
Configuring Alertmanager to send metrics to PagerDuty, or Slack
Instrumenting Your Applications
Measuring Request Durations
We can measure request durations with percentiles or averages.
Measuring Throughput
If you are using a histogram to measure request duration, you can
use the <basename>_count timeseries to measure throughput without having to
introduce another metric.
Measuring Memory/CPU Usage
Measuring SLOs and Error Budgets
SLOs, or Service Level Objectives, is one of the main tools
employed by Site Reliability Engineers (SREs) for making data-driven decisions
about reliability.
SLOs are based on SLIs, or Service Level Indicators, which are
the key metrics that define how well (or how poorly) a given service is operating.
Common SLIs would be the number of failed requests, the number of
requests slower than some threshold, etc.
Availability
The proportion of successful requests; any HTTP status
other than 500-599 is considered successful
Latency
The proportion of requests with duration less than or equal
to 100ms
The difference between 100% and the SLO is what we call the Error
Budget.
The error budget for 95% SLOs is 5%;
if the application receives 1,000 requests during the SLO window
it means that 50 requests can fail and we'll still meet our SLO.
Monitoring Applications Without a Metrics Endpoint
Prometheus needs all applications to expose a /metrics HTTP endpoint
for it to scrape metrics.
To monitor a MySQL instance, which does not provide a Prometheus
metrics endpoint we use exporters
https://www.replex.io/blog/kubernetes-in-production-the-ultimate-guide-to-
monitoring-resource-metrics-with-grafana
Setting up Grafana
Grafana is a part of the Prometheus operator project.
Install the Prometheus operator
helm install --name prom-operator stable/prometheus-operator --
namespace monitoring
This will install the Prometheus operator in the namespace monitoring.
You can see the Grafana instance running in this namespace using:
kubectl --namespace kube-system get pods