Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

Observability

Demystified
Cory Watson
Technical Evangelist | Splunk
© 2019 SPLUNK INC.

Forward-Looking Statements
During the course of this presentation, we may make forward-looking statements regarding future events or
the expected performance of the company. We caution you that such statements reflect our current
expectations and estimates based on factors currently known to us and that actual events or results could
differ materially. For important factors that may cause actual results to differ from those contained in our
forward-looking statements, please review our filings with the SEC.

The forward-looking statements made in this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, this presentation may not contain current or accurate
information. We do not assume any obligation to update any forward-looking statements we may make. In
addition, any information about our roadmap outlines our general product direction and is subject to change
at any time without notice. It is for informational purposes only and shall not be incorporated into any contract
or other commitment. Splunk undertakes no obligation either to develop the features or functionality
described or to include any such feature or functionality in a future release.
Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the
United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2019 Splunk Inc. All rights reserved.
© 2019 SPLUNK INC.

Cory Watson
Why listen to me?
▶ Remote, from Nashville, TN
▶ Technical Director, Office of the CTO @ SignalFx
▶ Principal Engineer, Observability Lead @ Stripe
▶ SRE, Engineer Manager, Observability @ Twitter
▶ 7 years in observability, > 20 years experience
© 2019 SPLUNK INC.

But first, Complexity


Why do we observe?
© 2019 SPLUNK INC.

Making Things
With the best of intentions!

▶ You make a simple thing

▶ Dependencies get added

▶ Customers ask for more

▶ Scale and resiliency are needed

Credit: Ava Sol


© 2019 SPLUNK INC.

Complexity Happens
That escalated quickly

▶ Diverse technologies

▶ Interaction of protections

▶ More humans involved

Credit: Linh Ha
© 2019 SPLUNK INC.

A Good Paper On Complexity


Seriously, go read it.
© 2019 SPLUNK INC.

How Complex Systems Fail


Richard I. Cook, MD

“Human practitioners
“Complex systems are are the adaptable “Failure free operations
intrinsically element of complex require experience
hazardous…” systems” with failure”
© 2019 SPLUNK INC.

1. Complexity is inevitable if you are


succeeding
Key
Takeaways 2. Failure is inevitable
This is where the
subtitle goes
3. People are safety generators

4. People learn with practice


© 2019 SPLUNK INC.

Now, Observability
What is observability?
© 2019 SPLUNK INC.

“Observability helps you understand


what your systems are doing and
why.”
For when things get weird.
© 2019 SPLUNK INC.

Monitoring vs Observability
Monitoring is a subset

Monitoring
Knowns

Known Unknowns

Unknown Unknowns

Observability
© 2019 SPLUNK INC.

The “Three Pillars” of Observability


Most common components

Logs Metrics Traces


© 2019 SPLUNK INC.

No pillars, only inputs


Season to taste
© 2019 SPLUNK INC.

Events
Maybe logs? Definitely actions.
© 2019 SPLUNK INC.

Tools
What you’ll actually use

Monitoring Dashboards Rules, Runbooks Other activities


and Processes
© 2019 SPLUNK INC.

1. Visibility into systems

Key 2. Logs, events, metrics,


traces
Takeaways
3. Tools make the data accessible
This is where the
subtitle goes
4. Design and investment is needed
5. Like 9s or the speed of light, can only be
approached
© 2019 SPLUNK INC.

Digression, Risk
Everything is terrifying
© 2019 SPLUNK INC.

Systems Change
Complexity is added, etc
▶ Growth
▶ New operators
▶ New customers
▶ Improvements
▶ etc
© 2019 SPLUNK INC.

“All
practitioner
actions
are
gambles”
Richard I. Cook, MD
© 2019 SPLUNK INC.

100% Success Is Not The Goal


Shoot your shot

▶ Failure is intrinsic

▶ Complexity added, actions taken

▶ Some systems warrant risk

▶ Process and practice help

Credit: George Sultan


© 2019 SPLUNK INC.

1. It’s all risky


Key
Takeaways 2. Reward comes from
risk
This is where the
subtitle goes 3. Since we will fail, we must mitigate

4. System investment requires thinking of


risk and reward
© 2019 SPLUNK INC.

Ok, Deployment
How do we do this?
© 2019 SPLUNK INC.

Common Advice
Applicable to all the things
▶ Use what you have
▶ Leverage common frameworks, libraries, middleware
▶ Publish conventions and guidance
▶ Monitor usage and control output
▶ Offer help
© 2019 SPLUNK INC.

Logs
Low effort, high volume

▶ Consider as a product
▶ Structure, organize
▶ Add levels and criticality

Credit: Khari Hayden


© 2019 SPLUNK INC.

Metrics
Medium effort, medium volume

▶ Relate to logging
▶ Use common patterns like RED
and USE

Credit: Marina Hinic


© 2019 SPLUNK INC.

Tracing
High effort, maximum awareness

▶ Enrich with cool data (SQL


queries, customers, etc)
▶ Consider sampling, help with
determinism
▶ Leverage for budgets (time and
resources)
© 2019 SPLUNK INC.

Control Rate Events


High effort, immense value

▶ High correlation
▶ CI/CD Pipeline
▶ Feature flags
▶ Internal company tools
▶ ???
© 2019 SPLUNK INC.

Congrats, you’ve got data!


Now to use it.
▶ This is designed, got IA people?
▶ Consider convention over configuration
▶ Offer “products” to your teams
▶ Incentivize use
© 2019 SPLUNK INC.

Visualize
Descend as needed
High Level Indicators

▶ Can’t show them everything!


▶ Start with RED, USE, etc
▶ Lead user to next steps
▶ Jump to ad-hoc when needed
More Specific Assets

Ad Hoc Investigation
© 2019 SPLUNK INC.

Dashboard Design
More than just boxes
© 2019 SPLUNK INC.

Chart Design
Every chart is a story
© 2019 SPLUNK INC.

Deep Dive / Ad Hoc


Making everything available is impossible
© 2019 SPLUNK INC.

“Change introduces new forms of


failure.”
Richard I. Cook, MD
© 2019 SPLUNK INC.

Monitor Symptoms
Causes Change
Flag Unsafe Situations
▶ (Again) Start with RED, USE, etc
▶ Deal with actionable problems
▶ Give aid to human, support
adaptability (RTO, SLA, comms)
Support Humans
▶ Learn and improve

Learn and Improve


© 2019 SPLUNK INC.

1. Rollout & tools needs design


Key
Takeaways 2. Use one or more of inputs
This is where the
subtitle goes 3. Think about your resulting
products
4. Incentivize the “paved
roads”
© 2019 SPLUNK INC.

Wait, Learning
Some guidance on that
© 2019 SPLUNK INC.

Teach
Get the basics

▶ Rarely a skill we hire for or test


▶ Add to your curriculum
▶ Gets snippets in brains
▶ Not a substitute for practice
© 2019 SPLUNK INC.

Practice
Humans learn by doing

▶ Use often
▶ Gamedays, chaos, experiments
▶ Hypotheses
▶ After-action, etc

Credit: Mídia
© 2019 SPLUNK INC.

Failure
Humans do it too

▶ Things will go wrong


▶ Blameless is good
▶ Much to learn
© 2019 SPLUNK INC.

“Catastrophe requires multiple failures


— single point failures are not
enough.”
Richard I. Cook, MD
© 2019 SPLUNK INC.

Learning
Diverse Studies

▶ Diverse cases
▶ Successes and failures
▶ Halo and horn effects
▶ Never stop
© 2019 SPLUNK INC.

1. We can teach the basics


Key
Takeaways 2. Growth requires practice and
This is where the reward
subtitle goes 3. Organizations can learn
too
4. Examine successes and
failures
© 2019 SPLUNK INC.

Wrap Up
Some guidance on that
© 2019 SPLUNK INC.

I believe in you!
This is an investment in people.
© 2019 SPLUNK INC.

Value You Can Expect


Measure and monitor these

MTT MTT Impa Quali


D R ct ty

Detect Remedy Decrease Ship faster, Happier


problems impact of higher quality customers
faster problems problems and
faster employees
© 2019 SPLUNK INC.

1. Observability is a trait of your


systems,
for when things get weird.
2. Standards, processes, and incentives all
Key help.
Takeaways
This is where the 3. Focus on valuable, risky areas
subtitle goes first.
4. Needs to be designed.
5. Context is key, practice is needed
6. Learn and repeat
© 2019 SPLUNK INC.

Thank
You!
Go to the .conf19 mobile app to
RATE THIS
SESSION

You might also like