Professional Documents
Culture Documents
12 Immutable Rules For O11y
12 Immutable Rules For O11y
Observability
Introduction
Speed defines success in today’s digital economy. With customers teams understand and explain unexpected behavior, so they
expecting flawless digital experiences and competition hovering can effectively and proactively manage the performance of
just a click away, companies turn to cloud-native technologies distributed microservices running on ephemeral infrastructure.
like microservices, containers and Kubernetes to accelerate The right observability strategy and solution translates into
innovation, build applications faster and improve performance. more reliability, better customer experience and higher team
However, moving to cloud-native technologies and distributed productivity.
architectures introduces new challenges around speed, scale
Here, we provide 12 immutable rules for observability for
and complexity of data — challenges that traditional monitoring
ongoing success no matter the complexity of your environment.
solutions simply weren’t designed to handle.
The more observable a system is, the quicker we can
This is where observability comes in. understand why it’s acting up and fix it — which is critical when
meeting service-level indicators (SLIs) and objectives (SLOs)
Organizations must deliver high quality code and differentiated
and, ultimately, accelerating business results.
user experiences — fast. Observability lets DevOps and SRE
The only way to troubleshoot a needle-in-a-haystack unknown applications because they use head-based probabilistic sampling,
failure condition and optimize an application’s behavior is to which randomly samples traces and often misses the ones that you
instrument and collect all the data about your environment at full care about (unique transactions, anomalies, outliers, etc.).
fidelity and without sampling. It’s the only way to guarantee no
When assessing observability solutions, look for those that do not
visibility gaps. You have the data you need, when you need it.
sample and also retain all your traces — you can decide which
Service-oriented distributed architectures create more complex traces you want to keep — as well as populate dashboards, service
cross-service boundary interactions, dependencies and error maps and trace navigations with meaningful information that will
propagation that result in highly unpredictable systems with a actually help you monitor and troubleshoot your application.
long tail of infrequent but severe issues. Traditional observability
solutions can fall short for monitoring microservices-based
Different use cases with varying degrees of criticality require that instantiate for only seconds, you’ll need much finer granularity
different resolutions. The resolution at which you collect data from (one-second resolution) to effectively monitor the performance of
your monolithic application will most likely not be sufficient as you your application and infrastructure.
start to collect data from more dynamic microservices running
As you begin to adopt microservices, it’s a good idea to err on
on ephemeral containers and serverless functions. For instance,
having higher resolution rather than lower, since the process of
if you’re measuring the performance of a mature monolithic
re-architecting an application or building a net new application
application running on over-provisioned virtual machines (VMs)
natively in the cloud can often involve trial and error.
with a relatively consistent number of users, it may be sufficient to
have relatively coarse visibility (minute-level resolution) into your In other words, you need observability that operates in the same
infrastructure. On the other hand, if you have microservices that are time scale as your software-defined, ephemeral infrastructure.
running on short-lived, Kubernetes-orchestrated containers that
spin up and down automatically in minutes, or serverless functions
Don’t make teams repeat forensic steps. Why context and unified metrics,
traces and logs are important in a troubleshooting workflow:
Observability is not just for operations and should start during synthetic monitoring, analysis of real-user transactions, log
development. analytics and metrics tracking, so teams can understand the
state of their code from development through deployment. This
Shifting left — which means starting DevOps processes like testing
understanding offers the depth that teams need to gain visibility
earlier in the pipeline — has become a popular strategy for teams
into the state of each release across the development life cycle.
working to find and fix issues sooner.
Reliability is critical when it comes to delivering high performing It’s important to understand the KPIs by which your business is
applications and flawless customer experiences, but you can’t measured and how the teams within your organization will consume
do reliability without observability. Without observability, how the data. You’ll be able to:
do you know where to invest time and resources? If you’re not • Anticipate what dimensions your monitoring data needs to have.
measuring availability, how do you know your reliability? If you’re
• Correlate data throughout your stack — from the underlying
not measuring performance, how do you know how well you’re
infrastructure to your applications and microservices.
really doing? These measurements need to go from development
all the way through to production. In the data age, you need to know • Correlate data across your entire digital business.
what’s going on at every stage of delivery.
As a simple example, consider an application running on AWS that
Observability provides a window beyond CPU utilization and has users around the world. In order to get a complete picture of
basic metrics into every layer of the stack, as well as outputs the end-user experience in each geographical region, it will be
like user experience, SLX performance and other key metrics important to be able to slice-and-dice users, microservices and
tailored to your business needs. In cloud-native environments, infrastructure based on AWS region or availability zone. Knowing
small upticks in a service can spiral into increased latency for ahead of time what kind of metadata is required will help you set up
even a specific customer. your visualizations right, the first time.
While incidents may be inevitable, a strong observability solution Other best practices to consider as you aim for observability:
can mitigate downtime and, perhaps, prevent it entirely — saving
• Evolve from tribal knowledge and hero-based incident
businesses money and improving the quality of life for on-call
resolution to standardized runbooks and knowledge bases.
engineers. But most organizations haven’t realized they control
Providing all on-call engineers easy access to detailed context
their preparation for recovery. To respond to and resolve issues
and suggestions about what has resolved similar issues in the
quickly (especially in a high-velocity deployment environment),
past is a critical part of information sharing, collaboration and
you’ll need tools that facilitate efficient collaboration and speedy
reducing MTTR.
notification. Observability solutions should include automated
incident response capabilities to engage the right expert to the • Allow seamless access to third-party tools (e.g., error tracking)
right issue at the right time, all to significantly reduce downtime. via web links so that your on-call engineers can use them.
This way, engineers can seamlessly carry the context of the
incident to other systems and continue troubleshooting
without missing a beat.
Splunk, Splunk>, Data-to-Everything, D2E and Turn Data Into Doing are trademarks and registered trademarks of Splunk Inc. in the United States and
other countries. All other brand names, product names or trademarks belong to their respective owners. © 2021 Splunk Inc. All rights reserved.
21-17608-SPLK-12-Immutable-Rules-for-O11y-108