W Defa3473

BROUGHT TO YOU IN PARTNERSHIP WITH
Table of Contents
HIGHLIGHTS AND INTRODUCTION
03 Welcome Letter
Lindsay Smith, Senior Publications Manager at DZone
04 About DZone Publications
DZONE RESEARCH
05 Key Research Findings

AN ANALYSIS OF RESULTS FROM DZONE'S 2022 PERFORMANCE AND OBSERVABILITY SURVEY
Sarah Davis, Guest Writer & Former Publications Editor at DZone
FROM THE COMMUNITY
17 Performance Engineering Powered by Machine Learning

Joana Carvalho, Performance Engineer at Postman
24 A Primer on Distributed Systems Observability

Boris Zaikin, Software & Cloud Architect at Nordcloud GmbH
31 A Deep Dive Into Distributed Tracing

Yitaek Hwang, Software Engineer at NYDIG
35 Building an Open-Source Observability Toolchain

Sudip Sengupta, Principal Architect & Technical Writer at Javelynn
39 Creating an SRE Practice: Why and How

Greg Leffler, Observability Practitioner & Director at Splunk
43 Learning From Failure With Blameless Postmortem Culture

HOW TO CONDUCT AN EFFECTIVE INCIDENT RETROSPECTIVE
Alireza Chegini, DevOps Architect at Smartwyre
ADDITIONAL RESOURCES
46 Diving Deeper Into Performance and Site Reliability

47 Solutions Directory
DZONE TREND REPORT | PERFORMANCE AND SITE RELIABILITY PAGE 2

Welcome Letter
By Lindsay Smith, Senior Publications Manager at DZone
A state-of-the-art, high-powered jet couldn’t make it 10 While this all sounds great, actual implementation proves
feet off the ground without an adequate amount of fuel, more complicated, so we studied what this looks like across
assessment of landing gear, and everything else found on organizations by surveying the DZone audience.
the FAA’s preflight checklist. If one small piece of the larger
aircraft engine is missing or broken, the entire plane cannot Our findings were quite remarkable from last year’s survey of
fly. Or worse, there are serious consequences should the our performance experts — a lot can change in a year.
proper procedure and testing not occur.

We provided assessments on application performance, site
The same goes for software — the most sophisticated, reliability, and observability for distributed systems in our
cutting-edge application cannot run without the proper contributor insights, covering everything from patterns for
maintenance and monitoring processes in place. Application building distributed systems to developing your own SRE
performance and site reliability have become the wings to program. As always, DZone partners with an exceptional
testing, deployment, and maintenance of our software. group of expert contributors who really put these topics and
findings into perspective.
Because the performance and monitoring landscape
continues to evolve, so must our approach and technology. So, as you prepare your application for takeoff, make your
In this report, we wanted to assess the topic of application preflight checks, and ensure cabins are clear, be sure you
performance but take it one level deeper; we wanted to assess your dashboard and check telemetry data to ensure a
further explore the world of site reliability and observability smooth flight, and finally, let the following pages guide you
systems. towards your dream performance destination.
As tools and technologies guide us to identifying root Enjoy the flight,
causes and specific problems in our code like CPU thrashing Lindsay Smith
and legacy code, we can better pilot and build resilient

software. Enter observability: Observability means analyzing
performance metrics and data to better understand your
system’s health. Observability is all about the why.
When a plane is not functioning properly, there are very

specific steps in place to remove the jet from flight, further
evaluate and assess the aircraft, and determine why it did
not fly properly.
Observability is like the conglomerate of lights and metrics

displayed on a pilot’s dashboard — it relies on telemetry to
gather and collect data automatically so that application
performance is upheld in real time.

ABOUT
DZone Publications
Meet the DZone Publications team! Publishing DZone Mission Statement
Refcards and Trend Reports year-round, this At DZone, we foster a collaborative environment that empowers
team can often be found reviewing and editing developers and tech professionals to share knowledge, build
contributor pieces, working with authors and skills, and solve problems through content, code, and community.
sponsors, and coordinating with designers. Part
We thoughtfully — and with intention — challenge the status
of their everyday includes collaborating across
quo and value diverse perspectives so that, as one, we can inspire
DZone's Production team to deliver high-quality
positive change through technology.
content to the DZone community.
Meet the Team
Caitlin Candelmo Lauren Forbes

Director, Content Products at DZone Content Strategy Manager at DZone
@CCandelmo on DZone @laurenf on DZone
@caitlincandelmo on LinkedIn @laurenforbes26 on LinkedIn
Caitlin works with her team to develop and execute a Lauren identifies and implements areas of improvement
vision for DZone's content strategy as it pertains to DZone when it comes to authorship, article quality, content
Publications, Content, and Community. For Publications, coverage, and sponsored content on DZone.com. She
Caitlin oversees the creation and publication of all DZone also oversees our team of contract editors, which includes
Trend Reports and Refcards. She helps with topic selection recruiting, training, managing, and fostering an efficient and
and outline creation to ensure that the publications released collaborative work environment. When not working, Lauren
are highly curated and appeal to our developer audience. enjoys playing with her cats, Stella and Louie, reading, and
Outside of DZone, Caitlin enjoys running, DIYing, living near playing video games.
the beach, and exploring new restaurants near her home.
Lucy Marcum
Melissa Habit Publications Coordinator at DZone
Senior Publications Manager at DZone
@LucyMarcum on DZone
@dzone_melissah on DZone @lucy-marcum on LinkedIn
@melissahabit on LinkedIn
Lucy manages the Trend Report author experience,
Melissa leads the publication lifecycles of Trend Reports from sourcing new contributors to editing their articles
and Refcards — from overseeing workflows, research, and for publication, and she creates different Trend Report
design to collaborating with authors on content creation components such as the Diving Deeper and the Solutions
and reviews. Focused on overall Publications operations Directory. In addition, she assists with the author sourcing
and branding, she works cross-functionally to help foster an for and editing of Refcards. Outside of work, Lucy spends her
engaging learning experience for DZone readers. At home, time reading, writing, running, and trying to keep her cat,
Melissa passes the days reading, knitting, and adoring her Olive, out of trouble.
cats, Bean and Whitney.
Lindsay Smith
Senior Publications Manager at DZone
@DZone_LindsayS on DZone
@lindsaynicolesmith on LinkedIn
Lindsay oversees the Publication lifecycles end to end,

delivering impactful content to DZone's global developer
audience. Assessing Publications strategies across Trend
Report and Refcard topics, contributor content, and
sponsored materials — she works with both DZone authors
and Sponsors. In her free time, Lindsay enjoys reading,
biking, and walking her dog, Scout.

ORIGINAL RESEARCH
Key Research Findings

An Analysis of Results from DZone's 2022 Performance
and Observability Survey
By Sarah Davis, Guest Writer & Former Publications Editor at DZone
In September and October of 2022, DZone surveyed software developers, architects, and other IT professionals in order to
understand how applications are being designed, released, and tuned for performance. We also sought to explore trends from
data gathered in previous Trend Report surveys, most notably for our 2021 report on application performance management.
This year, we added new questions to our survey in order to gain more of an understanding around monitoring and
observability in system and web application performance.
Major research targets were:

1. Understanding the root causes of common performance issues and the emerging use of artificial intelligence (AI)
2. The monitoring and observability landscape beyond measuring performance
3. Measuring performance and the evolution of self-healing techniques
Methods: We created a survey and distributed it to a global audience of software professionals. Question formats included
multiple choice, free response, and ranking. Survey links were distributed via email to an opt-in subscriber list, popups on
DZone.com, the DZone Core Slack workspace, and various DZone social media channels. The survey was open from September
13–October 3, 2022 and recorded 292 full and partial responses.
In this report, we review some of our key research findings. Many secondary findings of interest are not included here.
Research Target One: Understanding the Root Causes of Common Performance Issues
and Their Solutions
Motivations:
1. Discovering the causes and correlations around the most prevalent system and web application performance
issues and general developer attitudes toward common problems
2. Understanding how AI helps developers and site reliability engineers (SREs) discover and understand web
performance degradation
REASONS FOR WEB PERFORMANCE ISSUES AND GENERAL SURROUNDING ATTITUDES

Solving performance issues is the end goal for any performance architect and SRE when problems arise — but without
understanding why the issues occur, they're bound to occur again, bringing frustration to developers, clients, and end users
alike. However, for a significant portion of developers, that doesn't always happen. In fact, 36.9% of respondents said that at least
50% of the time, they solve performance and site reliability problems without identifying the root cause to their satisfaction.
A major factor in having the ability to understand the root cause of performance issues is having the right tools in place to do
so (or tools to begin with). Only around two in five (39.4%) of respondents said their organization has implemented observability
and can quickly establish root cause and understand business impact. Without the organization supporting a DevOps team's
efforts to optimize performance with the tools and technologies to help the team discover why issues occur, many developers
are left frustrated with the lack of answers they're able to get — and the issues are that much more likely to happen again.
Of course, most developers do have an idea about what the underlying causes of their web performance issues are to some
extent. To discover common causes and correlations around the most prevalent web performance issues, we asked the
following question:
How often have you encountered the following root causes of web performance degradation?

Results:
Figure 1
FREQUENCY OF WEB PERFORMANCE DEGRADATION ROOT CAUSES
Often Sometimes Rarely Never
CPU thrashing 26.4% 56.2% 15.5% 1.9%
Database reorganization 21.6% 34.0% 39.8% 4.6%
Deadlocks or thread starvation 14.1% 44.9% 37.9% 3.1%
Excessive algorithmic complexity due to bad code  23.0% 44.4% 28.2% 4.4%
Garbage collection 17.4% 48.2% 30.8% 3.6%
Geographic location lag   17.7% 33.1% 34.3% 15.0%
High CPU load 19.8% 50.6% 26.8% 2.7%
I/O bottleneck 20.2% 41.6% 35.0% 3.1%
Load balancing lag 15.3% 38.8% 39.2% 6.7%
Log rotation batch 18.0% 34.4% 39.6% 8.0%
Memory exhausted (paging) 14.7% 44.6% 36.7% 4.0%
Misuse of language features 16.7% 38.9% 33.7% 10.7%
Network backup  12.4% 43.8% 34.3% 9.6%
Network bottleneck  18.7% 45.1% 29.7% 6.5%
Selective/rolling deployment lag  10.8% 38.6% 41.4% 9.2%
Slow disk I/O due to bad code or configuration 20.6% 33.5% 41.5% 4.4%
Too many disk I/O operations due to bad code or configuration 17.4% 46.2% 32.4% 4.0%
Observations:
1. CPU thrashing is the most common reason for declines in web performance, with 26.4% of respondents saying they've
encountered this as a root cause of web performance degradation. In 2021, however, only 15.7% said CPU thrashing
is often the root cause. Last year, respondents said that they encountered high CPU load, memory exhaustion, I/O or
network bottlenecks, or excessive disk I/O operations or algorithmic complexity due to bad code all more often than CPU
thrashing. Last year, 35.3% of respondents said CPU thrashing is rarely the cause of their web performance degradation;
this year, that number was just 15.5%.
Why did that number more than double? Some developers may be misusing the resources at their disposal, although
through no fault of their own. CPU thrashing can indicate a misuse of resources since the computer's memory becomes
overwhelmed. As users and developers move forward into a world where online performance has spiked in importance
due to the growth of digital, it will only become more crucial for servers to be able to handle a high level of complexity.
2. Bad code is another common root cause of web performance degradation. 67.4% of respondents said that excessive
algorithmic complexity due to bad code is often or sometimes the root cause of web performance degradation,
compared to just 61.2% last year.
As algorithms become more complicated, ensuring that your code is accurate and optimized will be critical to
maintaining optimal web performance. Last year, just 12.2% of developers said that slow disk I/O due to bad code
or configuration was often the root cause of their web performance degradation. This year, that number has
grown to 20.6%.

3. As seen in Table 1 below, we also asked developers about the factors they often blame for poor performance. The top
factor that respondents blamed poor performance on is database misconfiguration, followed by bad code that others
wrote, insufficient memory, and network issues. Developers are more likely to blame performance issues on bad code
that others wrote compared to bad code that they wrote themselves.
We asked:
Table 1
In relation to the software you have worked on, rank the

HIGH-LEVEL CAUSES OF POOR SOFTWARE PERFORMANCE
following factors of poor performance from those you
blame most (top) to blame least (bottom). Factor Overall Rank Score n=
Database misconfiguration 1 1,193 159

Results are noted in Table 1.
Bad code that others wrote 2 1,090 148
AI'S ROLE IN DETECTING AND SOLVING
Insufficient memory 3 1,055 148
PERFORMANCE ISSUES
Data is the most logical and accurate crux that Network issues 4 983 148
businesses have to lean on, and AI can be a tremendous
Slow CPU 5 938 150
support in making the analysis of its data more efficient
— in addition to AI's ability to uncover issues and insights Bad code that I wrote 6 913 140
that teams may not have been able to glean without it, Slow disk read/write 7 806 148
at least not without spending a significant number of
Slow GPU 8 688 132
hours doing so.
Slow I/O 9 662 138
As the capabilities of artificial intelligence continue
to expand, entirely new ways of understanding Other 10 231 113
and tracking performance are being introduced.

For developers who work on web and application performance, AI can be a great boon in not only preventing issues from
occurring, but also accurately understanding why they occur in the first place.
As we'll explore in a later section, different organizations operate at varied levels of observational maturity as far as being
able to understand why performance issues occur. However, most organizations that leverage AI to assist their performance
optimization efforts aren't proactively using the capabilities. In fact, only 3.5% of developers said their organization leverages
AIOps to proactively prevent issues.
We wanted to understand what the organizations that do utilize AI for performance are using it for. Specifically, we wanted to
know how artificial intelligence is being leveraged for highly accurate monitoring and observability purposes. So we asked:
In what ways is your organization adopting AI for monitoring and observability? Select all that apply.
Results:
Figure 2
WHY ORGANIZATIONS ADOPT AI FOR MONITORING AND OBSERVABILITY
40
30
20
10
0
Anomaly Causality Correlation Historical Performance My organization
detection determination and analysis analysis doesn’t use AI
contextualization for monitoring
and observabilty

Observations:
1. While only 3.5% of developers use AIOps to proactively prevent issues, 84.1% said that their organization has adopted
AI for monitoring and observability. Adding AI to your monitoring and observability toolbox can help carry your
organization light years forward in terms of optimal web performance, and keep you steps ahead of any issues that
may occur. In fact, more than one in four (26.4%) of respondents said that they use AI-based anomaly detection for
monitoring and observability.
2. It's one thing to have data to analyze — it's another to know how to understand it. Most developers use AI after the fact
as a tool to analyze issues and determine why they occurred. That being said, it comes as no surprise that correlation and
contextualization make up the top way that developers use AI for monitoring and observability (37.2%), followed closely by
causality determination (36.8%).
3. Historical analysis, which can be an incredibly valuable tool to proactively work against performance issues (saving
organizations a lot of time, headaches, and money) comes not far behind. 34.5% said that their organization leverages AI
for historical analysis in relation to monitoring and observability, followed by performance analysis (32.2%).
Research Target Two: Beyond Measuring Performance — The Monitoring and

Observability Landscape
Motivations:
1. Understanding how monitoring and the growing importance of issue observability are transforming the current state of
monitoring and observability
2. Exploring the tools and performance patterns that software professionals successfully implement for monitoring and
observability
3. Comprehending the most common solutions that developers and SREs implement for web and application performance
THE CURRENT STATE OF MONITORING AND OBSERVABILITY

Monitoring involves collecting and displaying data in order to spot issues, while observability takes things a step further by
analyzing that data to understand your system's health and why issues are happening. To gain clarity on where developers, in
general, are currently in terms of adopting observability practices, we asked:
How would you categorize your organization's state of observability maturity? We have implemented...
Results:
Table 2
LEVELS OF OBSERVABILITY MATURITY IN ORGANIZATIONS
We have implemented... % n=
Monitoring, and we know whether monitored components are working 25.1% 65
Observability, and we know why a component is working 30.9% 80
Observability, and we can quickly establish root cause and understand business impact 39.4% 102
Observability, and we use AIOps capabilities to proactively prevent occurrences of issues 3.5% 9
Other- write in 1.2% 3
Observations:
1. 73.8% of respondents said that they have implemented observability, while 25.1% are still in the monitoring stage.
Observability maturity will be crucial to organizational growth as the prevalence of performance issues continues to rise.
In another question, we asked developers whether their organization currently utilizes any observability tools. While
nearly three in four respondents said that they have implemented observability, just 51% said that their organization
does use observability tools. According to our research, 32% of organizations are currently in the observability tool
consideration stage.

2. According to another question we asked in the survey, a majority (51.8%) of respondents said that at least 50% of the
time, they solve performance and site reliability problems later than they would prefer to. Observability tools can save
developers — and, thus, entire organizations — lots of valuable time.
Opposite of a world where developers must spend hours and hours diving into a performance issue to identify the
root cause and understand why a component isn't working, implementing an observability tool can help teams
quickly tackle these tasks to prevent future occurrences instead of continuing to experience the same issue without
a known cause.
3. As far as the specific tools that organizations use, of the respondents who plan to or currently leverage observability
tools, the majority (81.7%) use as-a-Service tools as opposed to self-managed. Most run observability tools as a service
in a private cloud (31.0%), followed by on-premises (26.8%) and in a public cloud (23.9%). We see similar trends among
respondents whose tools are self-managed, with the majority of these users running their tool on-premises (11.7%)
followed by private and public clouds, each at 3.3%.
TOOLS AND PERFORMANCE PATTERNS SUCCESSFULLY IMPLEMENTED FOR MONITORING

AND OBSERVABILITY
Logs, metrics, and traces are the three pillars of observability. Observability tools use algorithms to understand the relationships
between all parts of an organization's IT infrastructure, providing insight into their health and detecting any abnormalities that
may arise. But being able to leverage tools effectively to improve your web and app performance means picking the correct
tool for your needs.
In order to understand what functionality is most important to developers in choosing an observability tool, we asked:
When it comes to selecting observability tools, rank the following capabilities from most important (top) to least
important (bottom).
Results:
Table 3
IMPORTANCE OF OBSERVABILITY TOOL CAPABILITIES
Capability Overall Rank Score n=
Integration with existing tools and/or stack 1 748 141
Maturity of features 2 692 139
Complexity of use 3 675 140
Open source 4 646 132
Cost 5 570 136
Ability to purchase commercial support 6 464 128
Other 7 118 95
Observations:
1. When it comes to observability tool adoption, it's crucial for developers that the tool can integrate with existing tools
and/or their current tech stack, which ranked as the number one capability overall.
2. According to our research, the tool's cost is the second-to-least important consideration, revealing that developers are
willing to pay a price to get the right tool. Other top functionality include maturity of features and complexity of use.
Good things usually don't come easily, though. Developers face their fair share of challenges in adopting observability practices.
And in order to further understand the full scope of these challenges in relation to tool adoption, we also asked:
In your opinion, what are your organization's greatest challenges in adopting observability projects, tools, and practices?
Select all that apply.

Results:
Figure 3
TOP OBSERVABILITY ADOPTION CHALLENGES
50
40
30
20
10
0
Complexity Lack of Lack of Limited No clear Organization Teams Reduces Other -
knowledge/ leadership resources strategy and the using cost write in
skills creation of multiple
team silos tools
Observations:
1. 43.4% of respondents named a lack of knowledge and skills as the greatest challenge for their organization in adopting
observability projects, tools, and practices. 39.1% said that they have limited resources to implement observability, and
34.9% said that a lack of leadership is a challenge.
2. 22.1% reported that teams using multiple tools is a challenge their organization faces when implementing observability.
Cross-departmental alignment — and even cross-dev team alignment — remains a key sticking point for many
organizations. This can be an issue not just for gaining a comprehensive understanding of a system's state but also for
organizational efficiency and avoiding double-dipping into budgets. More cross-team alignment about the tools being
used makes things easier for everyone. That 34.9% of respondents who said that lack of leadership is a challenge in their
organization's adoption of observability practices could play a part here.
3. According to another survey question that we asked, only 3.9% of respondents said that their organization doesn't
conduct observability work. The building and testing stages are the most common parts of the software development
lifecycle (SDLC) where developers conduct observability work (52.7% and 54.3%, respectively). At all other stages of the
SDLC, respondents reported conducting observability work during planning (30.6%), design (42.6%), deployment (35.3%),
and maintenance (26.7%).
For developers, observability can provide a valuable sense of relief. 45.5% of respondents said that observability helps them detect
hard-to-catch problems, 38.7% said it enables more reliable development, and 36% said it builds confidence. The benefits aren't
just emotional — nearly half (49.4%) reported that observability increases automation, 37.5% said it leads to more innovation, and
21.3% said it reduces cost. These benefits strongly outweigh the downsides of lackluster organizational alignment.
A community that understands the field of web performance and observability can be a great resource for developers. A Center
of Excellence (CoE) for performance and reliability can provide shared resources and highly valuable, exclusive insights from
developers who are subject matter experts in the field. 60.7% of respondents say their organization has a Center of Excellence
for performance and reliability.
THE MOST COMMON PERFORMANCE PATTERN SOLUTIONS

Looking at the most common performance patterns that developers and SREs implement can provide valuable insight into
whether your organization is aligned with, or even ahead of the curve compared to, the rest of the industry. On top of that,
however, exploring how trends with the most common solutions have changed over the past year can reveal interesting
insights into the state of the overall web performance space.

To gain those insights, we asked:
Which of the following software performance patterns have you implemented?
Definitions:
• Express Train – for some tasks, create an alternate path that does only the minimal required work (e.g., for data
fetches requiring maximum performance, create multiple DAOs — some enriched, some impoverished)
• Hard Sequence – enforce sequential completion of high-priority tasks, even if multiple threads are available (e.g., chain
Ajax calls enabling optional interactions only after minimal page load, even if later calls do not physically depend on
earlier returns)
• Batching – chunk similar tasks together to avoid spin-up/spin-down overhead (e.g., create a service that monitors a
queue of vectorizable tasks and groups them into one vectorized process once some threshold count is reached)
• Interface Matching – for tasks commonly accessed through a particular interface, create a coarse-grained object that
combines anything required for tasks defined by the interface (e.g., for an e-commerce cart, create a CartDecorated
object that handles all cart-related calculations and initializes with all data required for these calculations)
• Copy-Merge – for tasks using the same physical resource, distribute multiple physical copies of the resource and
reconcile in a separate process if needed (e.g., database sharding)
• Calendar Control – when workload timing can be predicted, block predicted work times from schedulers so that
simultaneous demand does not exceed available resources
Results, 2022:
Figure 4
IMPLEMENTATION OF SOFTWARE PERFORMANCE PATTERNS
Express Hard Interface Calendar

Batching Copy-Merge
Train Sequence Matching Control
Often 23.2% 22.8% 17.9% 20.2% 16.5% 20.2%
Sometimes 54.8% 37.3% 44.5% 43.7% 47.7% 40.7%
Rarely 17.1% 34.2% 35.7% 30.4% 26.9% 30.6%
Never 4.9% 5.7% 1.9% 5.7% 8.8% 8.5%
n= 263 263 263 263 260 258
Results, comparing 2022 vs. 2021 for respondents who implemented every performance pattern:
Table 4
IMPLEMENTATION OF ALL PERFORMANCE PATTERNS
Pattern 2022 2021 % Change
Express Train 95.1% 86.8% +8.3%
Hard Sequence 94.3% 83.9% +10.4%
Batching 98.1% 93.2% +4.9%
Interface Matching 94.3% 89.3% +5.0%
Copy-Merge 91.1% 82.9% +8.2%
Calendar Control 91.5% 83.4% +8.1%

Observations:
1. Developers' implementation of all pattern types increased in 2022. With the influx in online users brought on by the
COVID-19 pandemic, site reliability and speed have only increased in importance. Even batching, which nearly doubled in
the number of developers saying they implement it only rarely, increased in overall use by around 5%. Still, that was the
lowest increase among all performance patterns listed in the survey.
2. Batching is still the most common software performance pattern, with 98.1% of respondents saying they implement it.
However, just 62.4% of respondents said they implement batching often or sometimes compared to 78% saying they
implement express train often or sometimes. In fact, in 2022, 35.7% of respondents said they rarely implement batching
compared to express train at 17.1%. Last year, 75.1% of respondents said they implement batching often or sometimes, with
just 18.1% saying they did so rarely.
The nearly twofold increase in respondents saying they rarely implement batching may be because batching is one
of the broadest and least complex software performance patterns. It is similar to basic principles like data locality and
is one of the first patterns that most developers learn. Express train, on the other hand, involves creating an alternate
path that does only the minimal required work to achieve maximum performance.
3. The performance pattern that saw the largest increase in overall usage is hard sequence, which enforces the sequential
completion of high-priority tasks, even if multiple threads are available. The significant increase in this could be due
to hard sequence being a more efficient option. Hard sequence allows SREs to choose to force a process to happen
sequentially rather than submit tasks concurrently and let them complete in whatever order they finish on the server.
Research Target Three: Performance Measurement and Self-Healing

Motivations:
1. Understanding how developers and SREs implement self-healing for automatic recovery from application failure
2. Learning how developers and SREs measure and manage performance through service-level objectives (SLOs)
APPROACHES TO SELF-HEALING
Self-healing is important in modern software development for three primary reasons:
1. Even simple programs written in modern high-level languages can quickly explode in complexity beyond the
human ability to understand execution (or else formal verification would be trivial for most programs, which is
far from the case). This means that reliably doing computational work requires robustness against unimagined
execution sequences.
2. Core multiplication, branch prediction (and related machine-level optimizations), and horizontal worker scaling — often
over the internet — have borne the brunt of the modern extension of Moore's Law. This means that communication
channels have high cardinality, grow rapidly, may not be deterministic, and are likely to encounter unpredictable
partitioning. A lot of low-level work is handled invisibly to the programmer, which means that modern programs must
be able to respond increasingly well to situations that were not considered the application's own formal design.
3. Mobile and IoT devices are increasingly dominant platforms — and they are constantly losing radio connections to the
internet. As demands for high bandwidth require higher-frequency radio signals, network partitioning worsens and
becomes harder to hand-wave away.
We wanted to know what techniques software professionals are using for automatic recovery from application failure, so
we asked:
Which of the following approaches to self-healing have you implemented? Select all that apply.
Results:
SEE FIGURE 5 ON NEXT PAGE

Figure 5
IMPLEMENTATION OF SELF-HEALING TECHNIQUES
Backoff
Chaos engineer
Circuit breaker
Client throttling
Compensating
transactions
Critical resource
isolation
Database
snapshotting
Failover
Graceful degradation
Leader election
Long-running
transaction checkpoints
Queue-based
load leveling
Reserved healing layer
Retry
None
0 10 20 30 40
Observations:
1. In 2021, the top self-healing approach implemented by developers was retry, coming in at 60%. However, this year, that
number has shrunk tremendously to just 12.3%. Instead, the number one self-healing approach that developers have
implemented is client throttling (35.4%), followed closely by database snapshotting (35.0%).
2. Additionally, failover was named as a primary self-healing approach by 53.8% of developers and circuit breaker was named
by 46.5%. The main takeaway is that this year, results are much more spread out among approaches. Circuit breaker,
client throttling, critical resource isolation, database snapshotting, and failover were all ranked between 30-35.4% and
make up the top five responses.
As expectations grow for developers' ability to create programs that never fail (even without the full context of what
the program is supposed to do), methods for self-healing are all only growing in importance and prevalence, and it's
clear that developers must be open to trying new approaches to self-healing.
PERFORMANCE MEASUREMENT AND SLOs

Implementing the right tools to prevent and detect performance issues is only one piece of the puzzle. To understand the
extent of your efforts, knowing the best way to gauge performance for your organization is crucial. In order to understand how
SREs measure performance, we asked:
Which of the following metrics does your organization use to measure performance? Select all that apply.
Results:

Figure 6
METRICS USED TO MEASURE PERFORMANCE
Any garbage
collection metric
Apdex score
Average database
query response time
Average server
response time
Concurrent users
CPU usage
Error rate
I/O response time
I/O access rate
Longest running
processes
Maximum heap size/heap

size distribution
Number of autoscaled
application instances
Number of database queries
Request queue size
Request rate
Time to first byte
Uptime
User satisfaction
Other - write in
0 10 20 30 40 50
Observation: CPU usage is the most common measurement method, with 43.1% of respondents saying their organization
measures performance this way, followed by average server response time at 40.1%. 37.8% said they measure performance
based on average database query response time, and 36.3% said they do so based on the number of concurrent users. Just
14.2% of respondents said that their organization measures performance based on user satisfaction, and 24.7% based on error
rate. Monitoring is a critical way that developers identify issues, namely infrastructure monitoring, log monitoring, and end-
user monitoring.
Service level objectives are agreements that are made, for example, by a DevOps team to a business customer about specific
uptimes or response times that are expected to be maintained. SLOs can cause strife for many developers because they can
be vague, unrealistic, or difficult to measure. SLOs can be helpful, however, when they are tied directly to the business goal. To
learn how developers are defining success, we first asked:
Does your organization have any service-level objectives (SLOs)?
Results:

Figure 7
PREVALENCE OF ORGANIZATIONS USING SLOS
9.6%
Yes
33.1%
No
I don’t know
57.3%
Observation: The number of respondents who reported that their organization does not use SLOs has grown significantly
over the past year. In fact, 38.7% of respondents to our 2021 survey said that their organization has SLOs, compared to 57.3%
this year. Also significant to note is that last year, 31.5% of respondents said that they didn't know whether their organization
had SLOs. Last year, results were nearly split evenly between respondents saying their organization does use (29.8%), their
organization doesn't use (38.7%), or they don't know if their organization uses SLOs (31.5%). This indicated that perhaps SLOs
didn't contribute to these respondents' decision-making.
However, this year, many of the respondents who previously indicated that they didn't know if their organization implemented
SLOs moved into the "does not use" bucket. This year, only 9.6% of respondents said that they don't know if their organization
has SLOs. This suggests that more developers may be aware of these performance tracking metrics, but perhaps the metrics
haven't proved to be as fruitful as they could be and, thus, aren't worth spending time on.
To understand exactly how success is measured according to organizations' SLOs, we then asked a free-response question:
"What SLOs does your organization have?" For many respondents, mean time to resolution (MTTR) was one of the top tracked
objectives. This is also the primary metric that organizations look at in order to define success. SLOs can often be project-
specific (as many of our respondents also noted), but it’s refreshing to see the same metric staying of high importance across
both clients and organizations.
Further, we wanted to understand not only the metrics that are important to customers but also the metrics that are
important to the organization providing the service. So we asked:
What metrics does your organization use to define success? This can be at the individual or team level. Select all that apply.
Results are noted in Table 5. Table 5
Observation: MTTR is by far the number one metric that METRICS USED TO MEASURE SUCCESS
organizations use to define success, with 64% of respondents
% n=
saying their organization measures it. From the end user's
perspective, the most important factor aside from completely Frequency of deploys 34.5% 89
avoiding performance issues is to resolve them quickly. Mean time
Number of incidents 40.3% 104
between failures falls second (40.7%), nearly tied with number of
incidents (40.3%). In other words, failure frequency in general is a Mean time to resolve 64.0% 165
top metric that organizations track to determine success. Mean time between failure 40.7% 105
Frequency of deploys is still a top tracked metric, with 34.5% of Revenue 20.2% 52
respondents saying their organization uses this metric to define We don't use any metrics 5.4% 14
success. Organizations should be careful to rely too heavily on this
metric, as quantity shouldn't be favored over quality.

Future Research
Application performance is a vast topic that touches all aspects of software development and operations, requires expertise in
the most mathematically and physically oriented aspects of computer science and engineering, and often elicits philosophical
and ethical debates during technical decision-making processes — all while significantly impacting end users.
We have begun to address this topic in our research but should note that this survey included questions that we did not have
room to analyze here, including:
• The factors that developers most often blame for poor performance
• Self-managed vs. as-a-Service vs. cloud tool management
• How observability fits into the different stages of the software development lifecycle
We intend to analyze this data in future publications. Please contact publications@dzone.com if you would like to discuss any
of our findings or supplementary data.
Sarah Davis, Guest Writer & Former Publications Editor at DZone

@sarahdzone on DZone | @sarahcdavis on LinkedIn | @SarahDavis816 on Twitter
Sarah is a writer, researcher, and marketer who regularly analyzes survey data to provide valuable, easily
digestible insights to curious readers. She currently works in B2B content marketing but previously
worked for DZone for three years as an editor and publications manager. Outside of work, Sarah enjoys
being around nature, reading books, going to yoga classes, and spending time with her cat, Charlie.

CONTRIBUTOR INSIGHTS
Performance Engineering
Powered by Machine Learning
By Joana Carvalho, Performance Engineer at Postman
Software testing is straightforward — every input => known output. However, historically, a great deal of testing has been
guesswork. We create user journeys, estimate load and think time, run tests, and compare the current result with the baseline.
If we don't spot regressions, the build gets a thumbs up, and we move on. If there is a regression, back it goes. Most times,
we already know the output even though it needs to be better defined — less ambiguous with clear boundaries of where a
regression falls. Here is where machine learning (ML) systems and predictive analytics enter: to end ambiguity.
After tests finish, performance engineers do more than Figure 1: Overall confidence in performance metrics
look at the result averages and means; they will look
at percentages. For example, 10 percent of the slowest
requests are caused by a system bug that creates a
condition that always impacts speed.
We could manually correlate the properties available in

the data; nevertheless, ML will link data properties quicker
than you probably would. After determining the conditions
that caused the 10 percent of bad requests, performance
engineers can build test scenarios to reproduce the
behavior. Running the test before and after the fix will
assert that it's corrected.
Performance With Machine Learning

and Data Science
Machine learning helps software development evolve,
making technology sturdier and better to meet users' Source: Data from TechBeacon
needs in different domains and industries. We can expose
cause-effect patterns by feeding data from the pipeline and environments into deep learning algorithms. Predictive analytics
algorithms, paired with performance engineering methodologies, allow more efficient and faster throughput, offering insight
into how end users will use the software in the wild and helping you reduce the probability of defects reaching production. By
identifying issues and their causes early, you can make course corrections early in the development lifecycle and prevent an
impact on production. You can draw on predictive analytics to improve your application performance in the following ways.
Identify root causes. You can focus on other areas needing attention using machine learning techniques to identify root
causes for availability or performance problems. Predictive analytics can then analyze each cluster's various features, providing
insights into the changes we need to make to reach the ideal performance and avoid bottlenecks.
Monitor application health. Performing real-time application monitoring using machine-learning techniques allows
organizations to catch and respond to degradation promptly. Most applications rely on multiple services to get the complete
application's status; predictive analytics models will correlate and analyze the data when the application is healthy to identify
whether incoming data is an outlier.
Predict user load. We have relied on peak user traffic to size our infrastructure for the number of users accessing the
application in the future. This approach has limitations as it does not consider changes or other unknown factors. Predictive
analytics can help indicate the user load and better prepare to handle it, helping teams plan their infrastructure requirements
and capacity utilization.

Predict outages before it's too late. Predicting application downtime or outages before they happen helps to take preventive
action. The predictive analytics model will follow the previous outage breadcrumbs and continue monitoring for similar
circumstances to predict future failures.
Stop looking at thresholds and start analyzing data. Observability and monitoring generate large amounts of data that can
take up to several hundred megabytes a week. Even with modern analytic tools, you must know what you're looking for in
advance. This leads to teams not looking directly at the data but instead setting thresholds as triggers for action. Even mature
teams look for exceptions rather than diving into their data. To mitigate this, we integrate models with the available data
sources. The models will then sift through the data and calculate the thresholds over time. Using this technique, where models
are fed and aggregate historical data, provides thresholds based on seasonality rather than set by humans. Algorithm-set
thresholds trigger fewer alerts; however, these are far more actionable and valuable.
Analyze and correlate across datasets. Your data is mostly time series, making it easier to look at a single variable over time.
Many trends come from the interactions of multiple measures. For example, response time may drop only when various
transactions are made simultaneously with the same target. For a human, that's almost impossible, but properly trained
algorithms will spot these correlations.
THE IMPORTANCE OF DATA IN PREDICTIVE ANALYTICS

"Big data" often refers to datasets that are, well, big, come in at a fast pace, and are highly variable in content. Their analysis
requires specialized methods so that we can extract patterns and information from them. Recently, improvements in storage,
processors, parallelization of processes, and algorithm design enabled the processing of large quantities of data in a reasonable
time, allowing wider use of these methods. And to get meaningful results, you must ensure the data is consistent.
For example, each project must use the same ranking system, so if one project uses 1 as critical and another uses 5 — like when
people use DEFCON 5 when they mean DEFCON 1 — the values must be normalized before processing. Predictive algorithms
are composed of the algorithm and the data it's fed, and software development generates immense amounts of data that,
until recently, sat idle, waiting to be deleted. However, predictive analytics algorithms can process those files, for patterns we
can't detect, to ask and answer questions based on that data, such as:
• Are we wasting time testing scenarios that aren't used?
• How do performance improvements correlate with user happiness?
• How long will it take to fix a specific defect?
These questions and their answers are what predictive analytics is used for — to better understand what is likely to happen.
THE ALGORITHMS
The other main component in predictive analysis is the algorithm; you'll want to select or implement it carefully. Starting
simple is vital as models tend to grow in complexity, becoming more sensitive to changes in the input data and distorting
predictions. They can solve two categories of problems: classification and regression (see Figure 2).
• Classification is used to forecast the result of a set by classifying it into categories starting by trying to infer labels from
the input data like "down" or "up."
• Regression is used to forecast the result of a set when the output variable is a set of real values. It will process input
data to predict, for example, the amount of memory used, the lines of code written by a developer, etc. The most used
prediction models are neural networks, decision trees, and linear and logistic regression.
Figure 2: Classification vs. regression

NEURAL NETWORKS
Neural networks learn by example and solve problems using historical and present data to forecast future values. Their
architecture allows them to identify intricate relations lurking in the data in a way that replicates how our brain detects
patterns. They contain many layers that accept data, compute predictions, and provide output as a single prediction.
DECISION TREES
Figure 3: Decision tree example
A decision tree is an analytics method that presents the
results in a series of if/then choices to forecast specific
options' potential risks and benefits. It can solve all
classification problems and answer complex issues.
As shown in Figure 3, decision trees resemble an upside-

down tree produced by algorithms identifying various ways
of splitting data into branch-like segments that illustrate a
future decision and help to identify the decision path.
One branch in the tree might be users who abandoned the

cart if it took more than three seconds to load. Below that
one, another branch might indicate whether they identify as
female. A "yes" answer would raise the risk as analytics show
that females are more prone to impulse buys, and the delay
creates a pause for pondering.
LINEAR AND LOGISTIC REGRESSION

Regression is one of the most popular statistical methods. It
is crucial when estimating numerical numbers, such as how
many resources per service we will need to add during Black
Friday. Many regression algorithms are designed to estimate
the relationship among variables, finding key patterns in
big and mixed datasets and how they relate. It ranges from simple linear regression models, which calculate a straight-line
function that fits the data, to logistic regression, which calculates a curve (Figure 4).
Table 1
OVERVIEW OF LINEAR AND LOGISTIC REGRESSION
Linear Regression Logistic Regression
Used to define a value on a continuous It's a statistical method where the parameters are predicted based
range, such as the risk of user traffic peaks on older sets. It best suits binary classification: datasets where y = 0
in the following months. or 1, where 1 represents the default class. Its name derives from its
transformation function being a logistic function.
It's expressed as y = a + bx, where x is an It's expressed by the logistic function, p(x) = 1
input set used to determine the output y. ——————
Coefficients a and b are used to quantify 1 + e –(β0 + β1 x)
the relation between x and y, where a is where β0 is the intercept and β1 is the rate. It uses training data
the intercept and b is the slope of the line. to calculate the coefficients, minimizing the error between the
predicted and actual outcomes.
The goal is to fit a line nearest to most It forms an S-shaped curve where a threshold is applied to transform
points, reducing the distance or error the probability into a binary classification.
between y and the line.

Figure 4: Linear regression vs. logistic regression
These are supervised learning methods, as the algorithm solves for a specific property. Unsupervised learning is used when you
don't have a particular outcome in mind but want to identify possible patterns or trends. In this case, the model will analyze as
many combinations of features as possible to find correlations from which humans can act.
Figure 5: Supervised vs. unsupervised learning
Shifting Left in Performance Engineering

Using the previous algorithms to gauge consumer sentiment on products and applications makes performance engineering
more consumer centric. After all the information is collected, it must be stored and analyzed through appropriate tools
and algorithms. This data can include error logs, test cases, test results, production incidents, application log files, project
documentation, event logs, tracing, and more. We can then apply it to the data to get various insights to:
• Analyze defects in environments

• Estimate the impact on customer experience
• Identify issue patterns
• Create more accurate test scenarios, and much more

This technique supports the shift-left approach in quality, allowing you to predict how long it will take to do performance
testing, how many defects you are likely to identify, and how many defects might make it to production, achieving better
coverage from performance tests and creating realistic user journeys. Issues such as usability, compatibility, performance, and
security are prevented and corrected without impacting users.
Here are some examples of information that will improve quality:
• Type of defect
• In what phase was the defect identified
• What the root cause of the defect is
• Whether the defect is reproducible
Once you understand this, you can make changes and create tests to prevent similar issues sooner.
Conclusion
Software engineers have made hundreds and thousands of assumptions since the dawn of programming. But digital users are
now more aware and have a lower tolerance for bugs and failures. Businesses are also competing to deliver a more engaging
and flawless user experience through tailored services and complex software that is becoming more difficult to test.
Today, everything needs to work seamlessly and support all popular browsers, mobile devices, and apps. A crash of even a few
minutes can cause a loss of thousands or millions of dollars. To prevent issues, teams must incorporate observability solutions
and user experience throughout the software lifecycle. Managing the quality and performance of complex systems requires
more than simply executing test cases and running load tests. Trends help you tell if a situation is under control, getting better,
or worsening — and how fast it improves or worsens. Machine learning techniques can help predict performance problems,
allowing teams to course correct. To quote Benjamin Franklin, "An ounce of prevention is worth a pound of cure."
Joana Carvalho, Performance Engineer at Postman

@radra on DZone | @jc-performance on LinkedIn | @radra on Twitter
Joana has been a performance engineer for the last 11 years. She analyzed root causes from user
interaction to bare metal, performance tuning, and new technology evaluation. Her goal is to create
solutions to empower the development teams to own performance investigation, visualization, and
reporting so that they can, in a self-sufficient manner, own the quality of their services. At Postman, she mainly implements
performance profiling, evaluation, analysis, and tuning.

Let's clear up
the confusion
about observability
PARTNER OPINION
Four Reasons Observability and APM

Are Different
By Martin Mao, CEO & Co-Founder at Chronosphere
Observability is trending, and for good reason. Modern observability solutions — focused on outcomes — drive innovation,
exceptional experience, and ultimately, competitive edge. Because engineers now need to focus more on making the systems
they build easier to observe, traditional monitoring software vendors (plus the technology industry analysts and influencers
advising them) have rushed to offer their takes on observability. It's not hard to understand why creating confusion is in their
interests. Don't be misled.
Four Reasons Observability Is Different Than Monitoring for Cloud Native

1. DATA VOLUME
With application performance monitoring (APM) tools, data collection (particularly high-cardinality data collection and
analysis) is limited. Containers and VMs produce the same volume of telemetry data. Scaling from thousands of VMs to millions
of containers delivers an order of magnitude increase in observability data to collect and analyze. Observability solutions are
optimized to handle data at scale.
2. EPHEMERALITY
Tools from APM vendors weren't designed for dynamic environments. Yet containers are dynamic. There are so many of them
and containers may only live for a few minutes while VMs may exist for many months. Observability solutions maximize the
value of data in dynamic environments by providing flexibility and control of data for both short-and long-term use cases.
3. INTERDEPENDENCE
APM tools are good at handling potential, anticipated issues. Observability solutions are too, but they do so much more.
Relationships between apps and infrastructure are predictable for organizations running only monolithic apps and VMs.
Contrast those with relationships between microservices and containers in the cloud era that are much more fluid and
complex. With cloud environments, data cardinality is higher as well, making it much more challenging for teams to make
associations between applications, infrastructure, and business metrics. Observability connects the dots.
4. DATA FORMATS
Observability solutions ensure freedom of choice. APM tools lock users in because their appointed agents only ingest and store
data in proprietary formats that the vendors decide. And managing those silos inhibits collaboration while increasing costs.
With observability solutions, organizations get the compatibility with open-source standards and data ownership they want
and need. Teams can also share and access data across domains to better collaborate, which leads to faster detection and issue
resolution.
The Bottom Line: Observability Is the New Operational Paradigm

Acting as traditional technology for the cloud, APM tools alert organizations that there is a problem. Modern technology
for cloud native means that observability delivers detailed data in context for fast remediation. Even though APM vendors
are adding support for the three pillars of observability — logs, traces, and metrics — don't be fooled. True cloud-native
observability focuses on business outcomes: knowing, triaging, and understanding issues so that teams improve mean time to
remediate (MTTR) and mean time to detect (MTTD) while achieving KPIs such as boosting experience.
Download the full ebook here.

A Primer on Distributed
Systems Observability
By Boris Zaikin, Software & Cloud Architect at Nordcloud GmbH
In the past few years, the complexity of systems architectures drastically increased, especially in distributed, microservices-
based architectures. It is extremely hard and, in most cases, inefficient to debug and watch logs, particularly when we have
hundreds or even thousands of microservices or modules. In this article, I will describe what observability and monitoring
systems, the patterns of a good observability platform, and the observability subsystem may look like.
Observability vs. Monitoring

Before we jump directly to the point, let's describe what observability is, what components it includes, and how it differs from
monitoring. Observability allows us to have a clear overview of what happens in the system without knowing the details or
domain model. Moreover, observability lets us efficiently provide information about:
• The overall system, separate service failures, and outages

• The behavior of the general system and services
• The overall security and alerts
We know what functions should cover the observability system. Below we can see what information should be gathered to
properly design an observability and monitoring platform.
• Metrics – Data collection allows us to understand the application and infrastructure states — for example, latency and the
usage of CPU, memory, and storage.
• Distributed traces – Allows us to investigate the event or issue flow from one service to another.
• Logs – This is a message with a timestamp that contains information about application- or service-level errors, exceptions,
and information.
• Alerting – When an outage occurs or something goes wrong with one or several services, alerts notify these problems via
emails, SMS, chats, or calls to operators. This allows for quick action to fix the issue.
• Availability – Ensures that all services are up and running. The monitoring platform sends prob messages to some service
or component (to the HTTP API endpoint) to check if it responds. If not, then the observability system generates an alert
(see the bullet point for alerting).
Also, some observability and monitoring platforms may include user experience monitoring, such as heat maps and user
action recording.
Observability and monitoring follow the same principles and patterns and rely primarily on toolsets, so in my opinion, the
differentiation between the two is made for marketing purposes. There is no clear definition of how observability differs from
monitoring; all definitions are different and high-level.
Observability Patterns
All complex systems based on microservices have recommendations and patterns. This allows us to build a reliable system
without reinventing the wheel. Observability systems also have some essential patterns. The following sections discuss five of
the most important patterns.
LOG AGGREGATION PATTERN

In distributed systems, logging can be difficult. Each microservice can produce a lot of logs, and it can be a nightmare to find
and analyze errors or other log messages of each microservice. Therefore, the log aggregation pattern helps us here. It contains

the central log aggregation service as a central Figure 1: Log aggregation pattern
log storage. Also, the service provides options to

label, index, categorize, search, and analyze all
logs. There are a few examples of log aggregation
platforms like Grafana Loki, Splunk, Fluentd, and
the ELK stack.
HEALTH CHECK PATTERN

Imagine you have multiple services or
microservices, and you need to know their
current state. Of course, you can go to the logging
aggregation service and check logs. But services
may not produce logs in a starting state. Also, it
may be the case where logging in is unavailable
when the services fail.
In all these instances, you need to implement

health check patterns. You just need to create a
health (or ping) endpoint in your service and point
your log aggregation system to check and collect the checks of each service. You can also set up notifications or alerting when
the service is unavailable — it saves a lot of time recognizing what service failed to start or went down.
Figure 2: Health check pattern
DISTRIBUTED TRACING PATTERN Figure 3: Distributed tracing pattern

Imagine this scenario: you have multiple components, modules,
and libraries in one or several microservices. You need to check the
whole history of component execution or send the request to one
microservice, and you need to check the execution history from one
service component list to another.
To do this, you need to have some distributed system that will collect
and analyze all tracing data. Some open-source services allow you to
do so, such as Jaeger, OpenTelemetry, and OpenCensus. Check out
the Istio documentation for an example that demonstrates distributed
tracing in action.
APPLICATION METRICS PATTERN

Having distributed logging and tracing is essential; however, without
application metrics, your observability system will not be complete. You
may need to collect infra- and application-level metrics, such as:
• CPU
• Memory
• Disc use
• Services requests/response time
• Latency

Collecting these metrics will not only help you understand what infrastructure size you need but will also help you save money
on cloud providers. It also helps you to quickly mitigate outages caused by a lack of CPU or memory resources.
Below is an example of the service that has a proxy agent. The proxy agent aggregates and sends telemetry data to the
observability platform.
Figure 4: Application metrics pattern
SERVICE MESH FOR OBSERVABILITY Figure 5: Service mesh as observability

A service mesh not only provides a central
management control plane for microservices
architecture but also provides a single
observability subsystem.
Instead of installing a separate tool for

gathering metrics, distributed traces, and
logs, we can just use one. For example, Azure
provides an integrated service mesh add-on
that can be set up in a minute.
There is also an option to use Istio service

mesh, which contains all features required for
a proper observability subsystem. Moreover,
it can gather metrics, logs, and traces for the
control plane.
For example, when we set up Grafana,

Loki, or other tools, we also need to enable
observability for them, as they may also fail
while working or during the deployment
process; therefore, we need to troubleshoot.
Observability Architecture for Microservices

As an example of the observability architecture, I'm going to use a smart heating system. Smart heating is an essential part of
each home (or even smart home) that allows owners to:
• Manually manage heating in the apartment with an application.

• Automatically adjust heating depending on time and the temperature outside and inside.
In addition, the system can do the following actions to help the owner:
• Turn on/off the heating when people are about to arrive at the apartment.
• Notify, alert, or just ask if something requires human attention or if something is wrong.

Figure 6: Microservices architecture with an observability subsystem
In Figure 6, you can see an architecture that is based on the microservices pattern, since it serves best and represents all
system components. It contains main and observability subsystems. Each microservice is based on Azure Functions and
deployed to the Azure Kubernetes Cluster. We deploy functions to Kubernetes using the KEDA framework. KEDA is an open-
source, Kubernetes-based event autoscaling that allows us to automatically deploy and scale our microservices functions.
Also, KEDA provides the tools to wrap functions to the Docker containers. We can also deploy microservices functions directly
without KEDA and Kubernetes if we don't have a massive load and don't need the scaling options. The architecture contains
the following components that represent the main subsystem:
• Azure operating as a microservice

• Azure Service Bus (or Azure IoT Hub) as a central messaging bus that microservices use to communicate
• Azure API Apps providing an API for mobile/desktop applications
The essential part here is an observability subsystem. A variety of components and tools represent it. I've described all
components in Table 1 below:
Table 1
COMPONENTS OF THE OBSERVABILITY SYSTEM
Tool Description
Prometheus Prometheus is an open-source framework to collect and store logs and telemetry as time series data. Also, it provides
alerting logic. Prometheus proxy or sidecar integrates with each microservice to collect all logs, telemetry, and tracing data.
Grafana Loki Grafana Loki is an open-source distributed log aggregation service. It's based on a labeling algorithm. It's not indexing
the logs; rather, it's assigning labels to each log domain, subsystem, or category.
Jaeger Jaeger is an open-source framework for distributed tracing in microservices-based systems. It also provides search and
data visualization options. Some of the high-level use cases of Jaeger include:
1. Performance and latency optimization
2. Distributed transaction monitoring
3. Service dependency analysis
4. Distributed context propagation
5. Root cause analysis
Grafana (Azure Grafana is also an open-source data visualization and analytics system. It allows the collection of traces, logs, and other
Managed telemetry data from different sources. We are using Grafana as a primary UI "control plane" to build and visualize data
Grafana) dashboards that will come from Prometheus, Loki, and Grafana Loki sources.

Let's summarize our observability architecture. Logging is covered by Prometheus and Grafana Loki, and distributed tracing is
covered by Jaeger. All these components report to Grafana, which provides UI data dashboards, analytics, and alerts.
We can also use OpenTelemetry (OTel) framework. OTel is an open-source framework that was created, developed, and
supported by the Cloud Native Computing Foundation (CNCF). The idea is to create a standardized vendor-free observability
language specification, API, and tool. It is intended to collect, transform, and export telemetry data. Our architecture is based
on the Azure cloud, and we can enable OpenTelemetry for our infrastructure and application components. Below you can see
how our architecture can change with OpenTelemetry.
Figure 7: Smart heating with an observability subsystem and OpenTelemetry
It is also worth mentioning that we do not necessarily need to add OTel, as it may add additional complexity to the system. In
the figure above, you can see that we need to forward all logs from Prometheus to OTel. Also, we can use Jaeger as a backend
service for OTel. Grafana Loki and Grafana will get data from OTel.
Conclusion
In this article, we demystified observability and monitoring terms, and we walked through examples of microservices
architecture with observability subsystems that can be used not only with Azure but also with other cloud providers. Also,
we defined the main difference between monitoring and observability, and we walked through essential monitoring and
observability patterns and toolsets. Developers and architects should understand that an observability/monitoring platform is a
tooling or a technical solution that allows teams to actively debug their system.
Boris Zaikin, Software & Cloud Architect at Nordcloud GmbH

@borisza on DZone | @boris-zaikin on LinkedIn | boriszaikin.com
I'm a certified senior software and cloud architect who has solid experience designing and developing
complex solutions based on the Azure, Google, and AWS clouds. I have expertise in building distributed
systems and frameworks based on Kubernetes and Azure Service Fabric. My areas of interest include
enterprise cloud solutions, edge computing, high load applications, multitenant distributed systems, and IoT solutions.

© 2022 StackState
PARTNER CASE STUDY
Case Study: Nationale-Nederlanden Bank

Driving Business Performance With Observability in Financial Services
Challenge COMPANY
Banks rely on high availability to provide customers 24x7 access to their Nationale-Nederlanden Bank
funds. With an uptime of 97.57 percent and a 4-6 hour mean time to repair
COMPANY SIZE
(MTTR) per incident, Nationale-Nederlanden Bank (NN Bank) was off its SLO
10,000+ employees
— negatively impacting customer experience and customer NPS.
INDUSTRY
The problem was it took too long to identify the root cause of outages. NN
Banking/Financial Services
Bank had 20+ IT teams, using several monitoring solutions. Data from these
systems was forwarded to a central data lake which did not correlate the data,
PRODUCTS USED
nor show how all the systems and their components were interrelated. When
StackState, Splunk, SolarWinds,
an issue occurred, root cause analysis (RCA) was challenging.
Prometheus, AWS
Given NN Bank's dynamic, hybrid environment, it was clear that the IT teams
PRIMARY OUTCOME
needed a way to quickly integrate massive amounts of data from siloed
By implementing observability, NN
systems, correlate it, and get a unified view of the overall IT environment.
Bank sped up the identification of root
cause, decreased MTTR, and — most
Solution
importantly — increased customer
StackState auto-discovered the bank's IT environment, generating a visual
satisfaction.
topology and mapping dependencies between components. StackState
also tracks changes over time, in real-time. The benefits of this functionality
were substantive. For NN Bank, auto-discovery of the IT environment was key.
"If you have ever had to do a root cause
Teams were able to see the full IT stack, which enabled them to determine
analysis, you know it’s not so easy.
RCA more quickly.
Management often tells you it should
By implementing the StackState platform, NN Bank was able to pinpoint be and asks, ‘Why are you taking such a
where events were occurring and instantly visualize upstream and long time?’ But it’s really difficult."
downstream dependencies. As a result, the bank achieved targeted insights
— Scrum Master
to help focus remediation efforts. The bank's IT teams eliminated time-
Platform Services,
consuming discussions around what was happening, where, what caused it,
NN Bank
and the systems impacted. Instead, everyone had a single view into where a
problem was occurring and how it impacted connected systems.
Results CREATED IN PARTNERSHIP WITH
The team's efforts to implement observability made a significant difference

in NN Bank's business results. StackState helped the bank increase uptime
and reduce MTTR, resulting in customer NPS jumping almost 100 percent.
With automation of the root cause analysis process and implementation of
predictive monitoring, NN Bank experienced:
• An increase in availability from 97.5 to 99.8 percent

• MTTR that went from 4-6 hours to less than one hour
• A dramatic boost in customer satisfaction, as measured by NPS scores

A Deep Dive Into

Distributed Tracing
By Yitaek Hwang, Software Engineer at NYDIG
Distributed tracing, as the name suggests, is a method of tracking requests as it flows through distributed applications. Along
with logs and metrics, distributed tracing makes up the three pillars of observability. While all three signals are important to
determine the health of the overall system, distributed tracing has seen significant growth and adoption in recent years.
That's because traces are a powerful diagnostic tool to paint how requests propagate between services and uncover issues
along the boundaries. As the number of microservices grows, the complexity in observing the entire lifespan of requests
inevitably increases as well. Logs and metrics can certainly help with debugging issues stemming from a single service, but
distributed tracing will tie contextual information from all the services and surface the underlying issue.
Instrumenting for observability is an ongoing challenge for any enterprise as the software landscape continues to evolve.
Fortunately, distributed tracing provides the visibility companies need to operate in a growing microservice ecosystem. In
this article, we'll dive deep into the components of distributed traces, reasons to use distributed tracing, considerations for
implementing it, as well as an overview of the popular tools in the market today.
Components of Distributed Tracing

Distributed tracing breaks down into the following components:
• Spans – smallest unit of work captured in observing a request (e.g., API call, database query)
• Traces – a collection of one or more spans
• Tags – metadata associated with a span (e.g., userId, resourceName)
To illustrate, let's walk through a distributed tracing scenario for a system with a front end, simple web server, and a database.
Tracing begins when a request is initiated (e.g., clicking a button, submitting a form, etc.). This creates a new trace with a
unique ID and the top-level span. As the request propagates to a new service, a child span is created. In our example, this would
happen as the request hits the web server and when a query to the database is made. At each step, various metadata is also
logged and tied to the span as well as the top-level trace.
Once all the work is complete for the corresponding request, all of the spans are aggregated with associated tags to assemble
the trace. This provides a view of the system, following the lifecycle of a request. This aggregated data is usually presented as a
flame graph with nested spans over time.
Figure 1: Flame graph of traces

Visualizing traces this way helps to reveal performance bottlenecks (i.e., longest span in the trace) as well as map out each
interaction with the microservices.
Why Use Distributed Tracing

For legacy applications largely running in a monolithic manner, logs and metrics were often sufficient for observability.
Detailed logging provides a point-in-time snapshot of that service, leaving a record of all the code execution. Metrics gather
statistical information about the system and expose the general health of that service. For monolithic applications, combining
the two provided the necessary visibility.
However, in a microservices world, problems can occur not just inside a single application (which logs and metrics can reveal),
but also at the boundaries of those services. To respond to an incident or to debug a performance degradation, it's important to
understand how the requests are flowing through one service to another.
With that in mind, the benefits of distributed tracing include:
• Visualizing service relationships – By inspecting the spans within a trace from the flame graph, developers can map out
all the service calls and their request flow. This helps to paint a global picture of the system, providing contextual data to
identify bottlenecks or ramifications from design changes.
• Pinpointing issues faster – When the engineer on-call is paged from an incident, traces can quickly surface the issue
and lead to reduced mean time to detect (MTTD) and repair (MTTR). This is a big win for the developer experience while
maintaining SLA commitments.
• Isolating specific requests – Since traces document the entire lifecycle of a request, this information can be used to
isolate specific actions such as user behavior or business logic to investigate.
Despite these benefits, adoption numbers for distributed tracing pale in comparison to logging and metrics as distributed
tracing comes with its fair share of challenges. First off, distributed tracing is only useful for the components that it touches.
Some tracing tools or frameworks don't support automatic injection or some languages (especially front-end components).
This would result in missing data and added work to piece together the details. Also, depending on the application, tracing can
generate a significant amount of data. Dealing with the scale and surfacing the important signals can be a challenge.
Considerations for Implementation

To maximize the benefits from distributed tracing, several factors must be considered:
• Automatic instrumentation – Most modern tracing tools support automatic injection of tracing capabilities without
significant modifications to the underlying codebase. Some languages or frameworks may not be fully supported in
some cases but opt for using automated tooling instead of wasting valuable developer time.
• Scalable data capture – To deal with massive amounts of tracing data, some tools opt to downsample, which may result
in missing or unrepresentative data. Choose tools that can handle the volume and intelligently surface important signals.
• Integrations – Traces are one part of the observability stack. Traces will be more useful if they can be easily tied to existing
logs or metrics for a comprehensive overview. The goal should be to leverage the power of tracing, alongside other
signals, to get to actionable insights and proactive solutions rather than collect data for retroactive analysis only.
Popular Tools
The original infrastructure for supporting internet-scale distributed tracing can be attributed to Dapper, Google's internal tool
announced in 2010. Since then, there's been a proliferation of both open-source and enterprise-grade SaaS tools in the market.
OPEN-SOURCE TOOLS
The open-source ecosystem for distributed tracing is fairly mature with a lot of the projects backed by large tech companies.
Each tool listed below supports most programming languages and flexible deployment options:
• Zipkin – one of the oldest and popular tools open-sourced by Twitter

• Jaeger – a Cloud Native Computing Foundation (CNCF) project donated by Uber that builds on ideas from Dapper and
Zipkin
• OpenTelemetry – an industry-leading observability framework developed by the CNCF that aims to standardize how to
instrument and export telemetry data, including traces

COMMERCIAL TOOLS
If enterprise-grade support is required, commercial tools from hyperscalers and observability platforms are also readily available.
The benefit of choosing a commercial tool would be easier integrations with existing tooling and infrastructure. For example,
hyperscalers such as AWS and Google provide their own flavor of tracing solutions such as AWS X-Ray and Cloud Trace.
Conclusion
Distributed tracing, when implemented properly with logs and metrics, can provide tremendous value in surfacing how
requests move in a complex, microservices-based system. Traces uncover performance bottlenecks and errors as requests
bounce from one service to another, mapping out a global view of the application. As the number of services grows alongside
the complexity that follows with it, a good distributed tracing system will become a necessity for any organization looking to
upgrade their observability platform.
While implementing tracing requires some planning, with a growing number of robust open-source and commercial tools
available, organizations can now easily adopt tracing without a significant engineering overhaul. Invest in a good distributed
tracing infrastructure to reduce MTTD/MTTR and improve the developer experience at your organization.
Yitaek Hwang, Software Engineer at NYDIG

@yitaek on DZone | @yitaekhwang on LinkedIn | yitaekhwang.com
Yitaek Hwang is a software engineer at NYDIG working with blockchain technology. He often writes about
cloud, DevOps/SRE, and crypto topics.

A Simple 1-Step
Solution for
OpenTelemetry
TelemetryHub is a full-stack observability tool that

provides reliable transparency into your distributed
systems without a complex deployment process.
Simple Efficient Affordable
Unify all traces, metrics Manage your data costs Take advantage of our free
and logs into a single pane and get a streamline view trial period and unlimited
of glass to easily derive of your systems by only seats. A tool for your
insights from complex ingesting the signal data entire team without
telemetry data. you need. paying extra.
Get a Free Trial

Building an Open-Source
Observability Toolchain
By Sudip Sengupta, Principal Architect & Technical Writer at Javelynn
Open-source software (OSS) has had a profound impact on modern application delivery. It has transformed how we think
about collaboration, lowered the cost to maintain IT stacks, and spurred the creation of some of the most popular software
applications and platforms used today.
The observability landscape is no different. In fact, one could argue that open-source observability tools have been even more
transformative within the world of monitoring and debugging distributed systems. By making powerful tools available to
everyone — and allowing anyone to contribute to their core construct — open-source observability tools allow organizations
of all sizes to benefit from their powerful capabilities in detecting error-prone patterns and offering insights of a framework's
internal state.
In this article, we will discuss the benefits of building an open-source toolchain for the observability of distributed systems,
strategies to build an open-source observability framework, best practices in administering comprehensive observability, and
popular open-source tools.
Embracing an Open-Source Observability Framework

In spite of their individual benefits, observability tools have limited scope and are mostly focused to monitor only one of the
key pillars of observability. Adopting multiple tools also discourages the concept of a single source of truth for comprehensive
observability. As an alternative to using individual tools for observing indicators, an observability platform helps with contextual
analysis by enriching the data collected by monitoring, logging, and tracing tools. A single observability platform also spans the
full scope of an organization's distributed systems to give you a comprehensive view of a cluster state.
Figure 1: An observability framework in distributed systems

Observability frameworks are typically categorized into:
• Centralized frameworks – These are typically designed for large enterprises that consume a lot of resources and need
to monitor numerous distributed systems at once. As such frameworks are supported by a lot of hardware and software,
they are expensive to set up and maintain.
• Decentralized frameworks – These frameworks are preferred for use cases that do not immediately require as much
equipment or training and that do require a lower up-front investment towards software licenses. As decentralized
frameworks aid collaboration and allow enterprises to customize source code to meet specific needs, these are
considered to be one of the popular choices when building an entire tech stack from scratch.
BENEFITS OF USING OPEN-SOURCE OBSERVABILITY TOOLS

Observability in itself is built around the open-source concept that relies on the decentralized access of key indicators for
collaborative action and performance enhancement. Building an open-source framework for observability relies on fewer
dependencies than centralized, proprietary software solutions.
Many organizations use OSS because it’s free and easy to use, but there’s more to it than that. Open-source tools also offer
several advantages over proprietary solutions to monitor how your applications are performing. Beyond monitoring application
health, open-source observability tools enable developers to retrofit the system for ease of use, availability, and security. Using
OSS tools for observability offers numerous other benefits, including:
• Easily extensible for seamless integration with most modern stacks

• Being vendor-agnostic, which helps observe multi- or hybrid-cloud setups
• Easy customization to support various use cases
• Enhanced visibility and alerting by factoring custom anomalies
• Accelerated development workflows by using pre-built plugins and code modules
• Low operating investment by saving on license costs
• Community contributions for enhancements and support
STRATEGIES TO BUILD AN OPEN-SOURCE OBSERVABILITY FRAMEWORK

In order to get the most out of an open-source observability framework, it is important to embrace the principles of openness
and collaboration. For comprehensive observability, it is also important to factor in crucial considerations when building an
observability framework with open-source tools.
Some recommended strategies to build an open-source observability framework include proactive anomaly detection, time-
based event correlation, shift-left for security, and adopting the right tools.
PROACTIVE ANOMALY DETECTION

Figure 2: Key pillars of observability
An optimally designed observability framework helps
predict the onset of potential anomalies without being
caught off-guard. It is important to be able to identify
the root cause and fix the problem before it impacts the
cluster performance or availability.
A distributed system's observability strategy should be

built upon the four golden signals: latency, saturation,
errors, and traffic. These signals are key representations
of the core aspects of a cluster state’s services that
collectively offer a contextual summary of its functioning
and performance issues.
Although the high-level information produced by these

signals might not be granular on its own, when combined
with other data of the key pillars, such as event logs,
metrics, or traces, it's easier to pinpoint the source of a
problem (see Figure 2).

TIME-BASED EVENT CORRELATION
Event logs offer rich insights to identify anomalies within distributed systems. Use open-source tools that help capture
occurrences, such as when an application process was executed successfully or a major system failure occurred. Contextual
analysis of such occurrences helps developers quickly identify faulty components or interactions between endpoints that
need attention.
Logs should also combine timestamps and sequential records of all cluster events. This is important because a time-series data
helps correlate events by pinpointing when something occurred, as well as the specific events preceding the incident.
SHIFT LEFT FOR SECURITY

Open-source tools are often considered vulnerable to common attack patterns. As a recommended strategy, open-source
tools should be vetted for inherent flaws and potential configuration conflicts they may introduce to an existing stack. The
tools should also support the building of an observability framework that complements a shift-left approach for security, which
eliminates the need for reactive debugging of security flaws in production environments.
Beyond identifying the root cause of issues, the toolchain should enrich endpoint-level event data through continuous
collection and aggregation of performance metrics. This data offers actionable insights to make distributed systems self-
healing, thereby eliminating manual overheads to detect and mitigate security and performance flaws.
ADOPTING THE RIGHT TOOLS

Observing distributed systems extensively relies on a log and metrics store, a query engine, and a visualization platform. There
are different observability platforms that focus on measuring these indicators individually. Though they work independently,
several of them work together extremely well, creating comprehensive observability setups tailored to an organization’s
business objectives.
Along with considerations for the observability components, consider what it means to observe a system by factoring in
scalability. For instance, observing a multi-cloud, geographically distributed setup would require niche platforms when
compared to monitoring a monolithic, single-cluster workload.
Administering Observability for Performance and Site Reliability

By providing in-depth insights of software processes and resources, observability allows site reliability engineers (SREs) to
assure optimum performance and health of an application. However, the challenges of observing the state and behavior of
a distributed system are often more complex than assumed. While it is important to inspect the key indicators, it is equally
important to adopt the right practices and efficient tools that support observability to collectively identify what is happening
within a system, including its state, behavior, and interactions with other components.
Some recommended best practices to enable effective observability in distributed systems include:
• Enforce the use of service-level agreements (SLAs) in defining performance indicators

• Use deployment markers for distributed tracing
• Set up alerts only for critical events
• Centralize and aggregate observability data for context analysis
• Implement dynamic sampling for optimum resource usage and efficient pattern sampling
POPULAR OPEN-SOURCE TOOLS FOR OBSERVABILITY OF DISTRIBUTED SYSTEMS

There are several open-source observability tools that can provide insight into system performance, identify and diagnose
problems, and help you plan capacity upgrades. While each tool comes with its own strengths and weaknesses, there are a
few that stand out above the rest. It is also a common approach to use them together to solve different complexities. The table
below outlines some popular open-source observability tools, their core features, benefits, and drawbacks to help understand
how they differ from each other.
SEE TABLE 1 ON NEXT PAGE

Table 1
Tools Best Known For Benef its Drawbacks
LogStash Log collection and • Native ELK stack integration for comprehensive observability Lacks content routing
aggregation • Offers filter plugins for the correlation, measurement, and capabilities
simulation of events in real time
• Supports multiple input types
• Persistent queues continue to collect data when nodes fail
Fluentd Collection, • Supports both active-active and active-passive node configurations Adds an intermediate
processing, and for availability and scalability layer between
exporting of logs • Inbuilt tagging and dynamic routing capabilities log sources and
destinations, eventually
• Offers numerous plugins to support data ingestion from slowing down the
multiple sources observability pipeline
• Supports seamless export of processed logs to different third-
party solutions
• Easy to install and use
Prometheus Monitoring and • PromQL query language offers flexibility and scalability in fetching Lacks long-term
with Grafana alerting and analyzing metric data metric data storage
• Combines high-end metric collection and feature-rich visualizations for historical and
contextual analysis
• Deep integration with cloud-native projects enables holistic
observability of distributed DevOps workflows
OpenTelemetry Observability • Requires no performance overhead for generation and Does not provide a
instrumentation management of observability data visualization layer
• Enables developers to switch to new back-end analysis tools by
using relevant exporters
• Requires minimal changes to the codebase
• Efficient utilization of agents and libraries for auto-instrumentation
of programming languages and frameworks
Summary
Observability is a multi-faceted undertaking that involves distributed, cross-functional teams to own different responsibilities
before they can trust the information presented through key indicators. Despite the challenges, observability is essential for
understanding the behavior of distributed systems. With the right open-source tools and practices, organizations can build an
open-source observability framework that ensures systems are fault tolerant, secure, and compliant. Open-source tools help
design a comprehensive platform that is flexible and customizable to an organization’s business objectives while benefiting
from the collective knowledge of the community.
Sudip Sengupta, Principal Architect & Technical Writer at Javelynn

@ssengupta3 on DZone | @ssengupta3 on LinkedIn | www.javelynn.com
Sudip Sengupta is a TOGAF Certified Solutions Architect with more than 17 years of experience working for
global majors such as CSC, Hewlett Packard Enterprise, and DXC Technology. Sudip now works as a full-
time tech writer, focusing on Cloud, DevOps, SaaS, and cybersecurity. When not writing or reading, he’s
likely on the squash court or playing chess.

Creating an SRE Practice:

Why and How
By Greg Leffler, Observability Practitioner & Director at Splunk
Site reliability engineering (SRE) is the state of the art for ensuring services are reliable and perform well. SRE practices power
some of the most successful websites in the world. In this article, I'll discuss who site reliability engineers (SREs) are, what they
do, key philosophies shared by successful SRE teams, and how to start migrating your operations teams to the SRE model.
Who Are SREs?

SREs operate some of the busiest and most complex systems in the world. There are many definitions for an SRE, but a good
working definition is a superhuman-merged-engineer who is a skilled software engineer and a skilled operations engineer.
Each of these roles alone are difficult to hire, train, and retain — and finding people who are good enough at both roles to excel
as SREs is even harder. In addition to engineering responsibilities, SREs also require a high level of trust, a keen eye for software
quality, the ability to handle pressure, and a little bit of thrill-seeking (in order to handle being on call, of course).
THE VARIANCE IN SRES

There are many different job descriptions that are used when hiring SREs. The prototypical example of SRE hiring is Google,
who has SRE roles in two different job families: operations focused SREs and "software engineer" SREs. The interview process
and career mobility for these two roles is very different despite both roles having the SRE title and similar responsibilities on
the job.
In reality, most people are not equally skilled at operations work and software engineering work. Acknowledging that different
people have different interests within the job family is likely the best way to build a happy team. Offering a mix of roles and job
descriptions is a good idea to attract a diverse mix of SRE talent to your team.
What Do SREs Do?

As seen in Figure 1, the SRE's work consists of five tasks, often done
Figure 1: SRE responsibility cycle
cyclically, but also in parallel for several component services.
Depending on the size and maturity of the company, the roles of SRE
vary, but at most companies they are responsible for these elements:
architecture, deployment, operations, firefighting, and fixing.
ARCHITECT SERVICES
SREs understand how services actually operate in production, so they
are responsible for helping design and architect scalable and reliable
services. These decisions are generally sorted into design-related and
capacity-related decisions.
DESIGN CONSIDERATIONS
This aspect focuses on reviewing the design of new services and involves
answering questions like:
• Is a new service written in a way that works with our other services?
• Is it scalable?
• Can it run in multiple environments at the same time?
• How does it store data/state, and how is that synchronized across other environments/regions?

• What are its dependencies, and what services depend on it?
• How will we monitor and observe what this service does and how it performs?
CAPACITY CONSIDERATIONS
In addition to the overall architecture, SREs are tasked with figuring out cost and capacity requirements. To determine these
requirements, questions like these are asked:
• Can this service handle our current volume of users?

• What about 10x more users? 100x more users?
• How much is this going to cost us per request handled?
• Is there a way that we can deploy this service more densely?
• What resource is bottlenecking this service once deployed?
OPERATE SERVICES
Once the service has been designed, it must be deployed to production, and changes must be reviewed to ensure that those
changes meet architecture goals and service-level objectives.
DEPLOY SOFTWARE
This part of the job is less important in larger organizations that have adopted a mature CI/CD practice, but many organizations
are not yet there. SREs in these organizations are often responsible for the actual process of getting binaries into production,
performing a canary deployment or A/B test, routing traffic appropriately, warming up caches, etc. At organizations without CI/
CD, SREs will generally also write scripting or other automation to assist in this deployment process.
REVIEW CODE
SREs are often involved in the code review process for performance-critical sections of production applications as well as for
writing code to help automate parts of their role to remove toil (more on toil below). This code must be reviewed by other
SREs before it is adopted across the team. Additionally, when troubleshooting an on-call issue, a good SRE can identify faulty
application code as part of the escalation flow or even fix it themselves.
FIREFIGHT
While not glamorous, firefighting is a signature part of the role of an SRE. SREs are an early escalation target when issues are
identified by an observability or monitoring system, and SREs are generally responsible for answering calls about service issues
24/7. Answering one of these calls is a combination of thrilling and terrifying: thrilling because your adrenaline starts to kick in
and you are "saving the day" — terrifying because every second that the problem isn't fixed, your customers are unhappy. SREs
answering on-call pages must identify the problem, find the problem in a very complicated system, and then fix the problem
either on their own or by engaging a software engineer.
Figure 2: The on-call workflow
For each on-call incident, SREs must identify that an issue exists using metrics, find the service causing the issue using traces,
then identify the cause of the issue using logs.
FIX, DEBRIEF, AND EVALUATE INCIDENTS

As the on-call incidents described above are stressful, SREs have a strong interest in making sure that incidents do not
repeat. This is done through post-incident reviews (sometimes called "postmortems"). During these sessions, all stakeholders
for the service meet and figure out what went wrong, why the service failed, and how to make sure that exact failure never
happens again.

Not listed above, but sometimes an SRE's responsibility is building and maintaining platforms and tooling for developers. These
include source code repositories, CI/CD systems, code review platforms, and other developer productivity systems. In smaller
organizations, it is more likely that SREs will build and maintain these systems, but as organizations grow, these tasks generally
grow in scale to where it makes sense to have a separate (e.g., "developer productivity") team handle them.
SRE Philosophies
One of the most common questions asked is how SREs differ from other operations roles. This is best illustrated through SRE
philosophies, the most prevalent of which are listed below. While any operations role will likely embrace at least some of these,
only SREs embrace them all.
• "Just say no" to toil

– Toil is the enemy of SREs and is described as "tedious, repetitive tasks associated with running a production
environment" by Google. You eliminate toil by automating processes so that manual work is eliminated.
– One philosophy around toil held by many SREs is to try to "automate yourself out of a job" (though there will always
be new services to work on, so you never quite get there).
• Cattle, not pets

– In line with reducing toil and increasing automation, an important philosophy for SREs is to treat servers,
environments, and other infrastructure as disposable. Small organizations tend to take the opposite approach —
treating each element of the application as something precious, even naming it. This doesn't scale in the long run.
– A good SRE will work to have the application's deployment fully automated so that infrastructure and code are
stored in the same repositories and deploy at the same time, meaning that if the entire existing infrastructure was
blown away, the application could be brought back up easily.
• Uptime above all

– Customer-facing downtime is not acceptable. The storied "five nines" of uptime (less than six minutes down per
year) should be a baseline expectation for SREs, not a maximum. Services must have redundancy, security, and other
defenses so that customer requests are always handled.
• Errors will happen

– Error budgeting and the use of service-level indicators is the secret sauce behind delivering exceptional customer-
facing uptime. By accepting some unreliability in services and architecting their dependent services to work around
this, customer experience can be maintained.
• Incidents must be responded to

– An incident happening once sometimes happens. The same incident happening twice is beyond the pale. A
thorough, blameless post-incident review process is essential to the goal of steadily increasing reliability and
performance over time.
How to Migrate an Ops Team to SRE

Moving from a traditional operations role to an SRE practice is challenging and often seems overwhelming. Small steps add up
to big impact. Adopting SRE philosophies, advancing the skill set of your team, and acknowledging that mistakes will occur are
three things that can be done to start this process.
ADOPT SRE PHILOSOPHIES

The most important first step is to adopt the SRE philosophies mentioned in the previous section. The one that will likely have
the fastest payoff is to strive to eliminate toil. CI/CD can do this very well, so it is a good starting point. If you don't have a robust
monitoring or observability system, that should also be a priority so that firefighting for your team is easier.
START SMALL: UPLEVEL EXPECTATIONS AND SKILLS

You can't boil the ocean. Everyone will not magically become SREs overnight. What you can do is provide resources to your
team (some are listed at the end of this article) and set clear expectations and a clear roadmap to how you will go from your
current state to your desired state.
A good way to start this process is to consider migrating your legacy monitoring to observability. For most organizations, this
involves instrumenting their applications to emit metrics, traces, and logs to a centralized system that can use AI to identify root

causes and pinpoint issues faster. The recommended approach to instrument applications is using OpenTelemetry, a CNCF-
supported open-source project that ensures you retain ownership of your data and that your team learns transferable skills.
ACKNOWLEDGE THERE WILL BE MISTAKES

Downtime will likely increase as you start to adopt these processes, and that must be OK. Use of SRE principles described in
this article will ultimately reduce downtime in the long run as more processes are automated and as people learn new skills. In
addition to mistakes, accepting some amount of unreliability from each of your services is also critical to a healthy SRE practice
in the long run. If the services are all built around this, and your observability is on-point, your application can remain running
and serving customers without the unrealistic demands that come with 100 percent uptime for everything.
Conclusion
SRE, traditionally, merges application developers with operations engineers to create a hybrid superhuman role that can do
anything. SREs are difficult to hire and retain, so it's important to embrace as much of the SRE philosophy as possible. By
starting small with one app or part of your infrastructure, you can ease the pain associated with changing how you develop
and deploy your application. The benefits gained by adopting these modern practices have real business value and will enable
you to be successful for years to come.
Resources:
• Site Reliability Engineering, Google
• "How to Run a Blameless Postmortem," Atlassian
• Implementing Service Level Objectives, Alex Hidalgo
Greg Leffler, Observability Practitioner & Director at Splunk

@gleffler on DZone | @gleffler on LinkedIn
Greg Leffler heads the Observability Practitioner team at Splunk and is on a mission to spread the
good word of observability to the world. Greg's career has taken him from the NOC to SRE, from SRE
to management, with side stops in security and editorial functions. Greg has experience as a systems
administrator at eBay Ads, and as an SRE and SRE Senior Manager at LinkedIn.

Learning From Failure With

Blameless Postmortem Culture
How to Conduct an Effective Incident Retrospective
By Alireza Chegini, DevOps Architect at Smartwyre
Site reliability engineering aims to keep servers and services running with zero downtime. However, outages and incidents are
inevitable, especially when dealing with a complex system that constantly gets new updates. Every company has a relatively
similar process to manage incidents, mitigate risks, and analyze root causes. This can be considered an opportunity to identify
issues and prevent them from happening, but not every company is successful at making it a constructive process.
In this article, I will discuss the advantage of the blameless postmortem process and how it can be a culture of change in a
company — a culture for a better change and not to blame!
An SRE's Role in Postmortem

Postmortem is a process in which a site reliability engineer (SRE) records an incident in detail. This information includes
the incident description, the impact of the incident on the system, and the actions taken to mitigate the issue. SREs are
engineers who are responsible for taking care of incidents. That’s why they are the ones who prepare most of the postmortem
information into a report, which not only addresses the root cause but also suggests possible actions to prevent the same
incident from occurring again. Therefore, a postmortem process for SREs is an opportunity to enhance the system.
THE STANDARD POSTMORTEM MEETING STRUCTURE

A postmortem meeting usually is arranged days after a team handles an incident. Let's look at the typical format for this meeting:
• Keep to a small group. Only related people from various roles and responsibilities are invited to this meeting. The group
stays small to ensure that the meeting will be short and productive.
• Start with facts. One important thing about this meeting is that there is no time for guessing. Instead, facts are shared
with the team to help people understand the issue and perhaps identify the root cause.
• Listen to stories. After highlighting the facts, there might be some extra discussion from team members who were
either involved in the incident process or might have some knowledge about that particular issue.
• Find out the reasons. Most of the time, the root cause is found before this meeting, but in cases where the root cause is
still unknown, there will be a discussion to plan for further investigations, perhaps involving a third party to help. However,
the incident might occur again since the root cause is not found yet, so extra measures will be taken to prepare for
possible incidents.
• Create action points. Depending on the outcome of the discussion, the actions will vary. If the root cause is known,
actions will be taken to avoid this incident. Otherwise, further investigations will be planned and assigned to a team to
find the root cause.
Why You Should Have a Blameless Postmortem

Traditionally, the postmortem process was about who made a mistake, and if there was a meeting, the manager would
use it as an opportunity to give individual warnings about the consequences of their mistakes. Such an attitude eliminates
opportunities to learn from mistakes, and facts would be replaced with who was behind the failure.
Sometimes a postmortem meeting turns into another retro in which team members start arguing with each other or discuss
issues that are not in the scope of the incident, resulting in people pointing at each other rather than discussing the root cause.
This damages the team morale, and such an unproductive manner leads to facing more failures in the future.

IT practitioners learned that failures are inevitable, but it is possible to learn from mistakes to improve the way of working and
the way we design systems. That’s why the focus turned to actual design and processes instead of the people. Today, most
companies are trying to move away from a conservative approach and create an environment where people can learn from
failures rather than blame.
That's why it is essential to have a blameless postmortem meeting to ensure people feel comfortable sharing their opinions
and to focus on improving the process. Now the question is, what does a blameless postmortem look like? Here is my recipe to
arrange a productive blameless postmortem process.
HOW TO CONDUCT A BLAMELESS POSTMORTEM PROCESS

Suppose an incident occurred in your company, and your team handled it. Let's look at the steps you need to take for the
postmortem process.
Figure 1: Blameless postmortem process
PREPARE BEFORE THE MEETING

Here you collect as much information as possible about the incident. Find the involved people and any third parties and add
their names to the report. You could also collect any notes from engineers who have supported this issue or made comments
on the subject in different channels.
SCHEDULE A MEETING WITH A SMALL GROUP

This means arranging a meeting, adding the involved people, and perhaps including stakeholders like the project manager,
delivery manager, or whoever should be informed or consulted for this particular issue. Make sure to keep the group small to
increase the meeting's productivity.
HIGHLIGHT WHAT WENT RIGHT

Now that you are in the meeting, the best thing to do is to start with a brief introduction to ensure everyone knows the
incident's story. Although this meeting is about failures, you need to highlight positive parts if there are any. Positives could be
good communication between team members, quick responses from engineers, etc.
FOCUS ON THE INCIDENT FACTS

To have a clear picture of what happened, you don’t want to guess or tell a story. Instead, focus on the precise information you
have. That’s why it is recommended to draw attention to facts, such as the order of events and how the incident was mitigated
at the end.
HEAR STORIES FROM RELATED PEOPLE

There might be other versions of the incident's story. You need to specify a time for people with comments or opinions about it
to speak. It is essential to create a productive discussion focused on the incident.
DIG DEEPER INTO THE ACTUAL ROOT CAUSE

After discussing all ideas and considering the facts, you can discuss the possible root cause. In many cases, the root cause
might have been found before this meeting, but you can still discuss it here.
DEFINE SOLUTIONS
If the root cause is known, you can plan with the team to implement a solution to prevent this incident from happening again.
If it is not known, it would be best to spend more time on the investigation to find the root cause and take extra measures or
workarounds to prepare for possible similar incidents.
DOCUMENT THE MEETING

One good practice is to document the meeting and share it with the rest of the company to make sure everyone is aware, and
perhaps other teams can learn from this experience.

Best Practices From Google
Today in modern companies, a blameless postmortem is a culture with more activities than the traditional postmortem
process. SREs at Google have done a great job implementing this culture by ensuring that the postmortem process is not only
one event. Let's review some of the best practices from Google that are complementary to your current postmortem process:
• No postmortem is left unreviewed. Arranging regular review sessions helps to look into outstanding postmortems and
close the discussions, collect ideas, and draw actions. As a result, all postmortems are taken seriously and processed.
• Introduce a postmortem culture. Using a collaborative approach with teams helps introduce postmortem culture to an
organization easier and faster by providing various programs, including:
– Postmortem of the month: This event motivates teams to conduct a better postmortem process. So every month, the
best and most well-written postmortem will be shared with the rest of the organization.
– Postmortem reading clubs: Regular sessions are conducted to review past postmortems. Engineers can see what
other teams faced in previous postmortems and learn from the lessons.
• Ask for feedback on postmortem effectiveness. From time to time, there is a survey for teams to share their
experiences and the feedback they have about the postmortem process. This helps evaluate the postmortem culture and
increase its effectiveness.
If you are interested in learning more about Google's postmortem culture, check out Chapter 15 of Google's book, Site
Reliability Engineering.
Conclusion
Site reliability engineers play an essential role in ensuring that systems are reliable, and keeping this reliability is a continuous
job. While developers are thinking of new features, SREs are thinking of a better and smoother process to release features.
Incidents are part of the software development lifecycle, but modern teams like SRE teams define processes to help turn those
incidents into opportunities to improve their systems. SREs know the importance of blameless postmortem meetings where
failures are accepted as part of development. That’s why they focus on reliability.
The future of incident management will be more automation and perhaps using artificial intelligence, where a system can fix
most of the issues itself. For now, SREs are using blameless postmortems to improve uptime, productivity, and the quality of
team relationships.
Alireza Chegini, DevOps Architect at Smartwyre

@allirreza on DZone | @alirezachegini on LinkedIn | @worldofalireza on Twitter | Alireza Chegini on YouTube
Alireza is a software engineer with more than 22 years of experience in software development. He started
his career as a software developer, and in recent years, he transitioned into DevOps practices. Currently,
he is helping companies and organizations move away from traditional development workflows and
embrace a DevOps culture. Additionally, Alireza is coaching organizations as Azure Specialists in their migration journey to
the public cloud.

Diving Deeper Into Performance

and Site Reliability
MULTIMEDIA REFCARDS
ITOps, DevOps, AIOps - All Things Ops Getting Started With Prometheus
Host Elias Voelker interviews senior IT executives Prometheus has become the de facto standard for the
and thought leaders in this podcast that covers, monitoring and alerting of distributed systems and
as its name suggests, all things ops. From architecture. In this Refcard, we explore the core components
timely topics such as how ITOps and AIOps can of the Prometheus architecture and key concepts — then
recession-proof the cost of IT to discussions around uptime in focus on getting up and running with Prometheus, including
the context of site reliability engineering, this podcast will help configuration and both collecting and working with data.
your day-to-day IT infrastructure operation and management.
Observability Maturity Model: Essentials for Greater
OpenObservability Talks IT Reliability
As its name suggests, this podcast serves Modern systems and applications are increasingly more
to amplify the conversation on open-source dynamic, distributed, and modular in nature. To support their
technologies and advance observability efforts systems' availability and performance, ITOps and SRE teams
for DevOps. Listen to industry leaders and need advanced monitoring capabilities. This Refcard reviews
contributors to projects like OpenTelemetry and Jaeger the distinct levels of observability maturity, key functionality
discuss their use cases, best practices, and vision for the space. at each stage, and next steps organizations should take to
enhance their monitoring practices.
TestGuild Performance Testing and Site
Reliability Podcast Continuous Delivery Patterns and Anti-Patterns
Since 2019, TestGuild has brought us 100 and This Refcard explains detailed patterns and anti-patterns for
counting episodes that cover a wide range of core areas of continuous delivery, including the delivery and
performance-related topics. Tune in with Joe deployment phases, rollbacks, pipeline observability and
Colantonio to learn about chaos engineering, test automation, monitoring, documentation, as well as communication across
API load testing, monitoring, site reliability, and (much) more. teams and within the organization.
k6
TREND REPORTS
Grafana Labs' YouTube channel for k6 is the perfect rabbit
hole to lose yourself in. Filled with educational videos covering Application Performance Management
observability, performance testing, open-source tools, and DZone's 2021 APM Trend Report dives deeper into the
more, you can build your knowledge base for application management of application performance in distributed
performance and reliability. systems, including observability, intelligent monitoring, and
rapid, automated remediation. It also provides an overview of
DevOps Pulse 2022: Challenges to the Growing
how to choose an APM tool provider, common practices for
Advancement of Observability
self-healing, and how to manage pain points that distributed
DZone's webinar with Logz.io takes a deep dive into the cloud-based architectures cause.
data and key takeaways from the DevOps Pulse Report. The
discussion spans topics such as the increased observability DevOps: CI/CD and Application Release Orchestration
tool sprawl that is driving up complexity, how distributed In DZone's 2022 DevOps Trend Report, we provide insight into
tracing remains nascent, and how rising costs and data how CI/CD has revolutionized automated testing, offer advice
volumes are hindering observability strategies. on why an SRE is important to CI/CD, explore the differences
between managed and self-hosted CI/CD, and more. Our
What Do IT and Engineering Leaders Need to Know
goal is to offer guidance to our global audience of DevOps
About Observability
engineers, automation architects, and all those in between
In this DZone webinar, sit down with Observe Founder and
on how to best adopt DevOps practices to help scale the
VP of engineering Jacob Leverich and Redmonk co-founder
productivity of their teams.
James Governor as they discuss observability, how it increases
organizations' troubleshooting competency, and the impact
of containerization and serverless platforms on observability.

Solutions Directory
This directory contains performance and site reliability tools to assist with management, monitoring,
observability, testing, and tracing. It provides pricing data and product category information gathered
from vendor websites and project pages. Solutions are selected for inclusion based on several impartial
criteria, including solution maturity, technical innovativeness, relevance, and data availability.
DZONE'S 2022 PERFORMANCE AND SITE RELIABILITY SOLUTIONS DIRECTORY
Company Product Purpose Availability Website
Chronosphere Observability
Chronosphere Cloud-native observability By request chronosphere.io/platform
2022 PARTNERS
Platform
telemetryhub.com/products/
Scout APM TelemetryHub Full-stack observability Trial period
telemetryhub
StackState Observability
StackState Observability Trial period stackstate.com/platform/overview
Platform
Amazon Web
AWS X-Ray Distributed tracing Free tier aws.amazon.com/xray
Services
Autonomous monitoring and

Anodot Anodot By request anodot.com
anomoly detection
Digital performance monitoring

Apica Apica Platform By request apica.io
and load testing
AppDynamics AppDynamics Observability Trial period appdynamics.com
Network monitoring and

Auvik Auvik Trial period auvik.com
management
BigPanda BigPanda AIOps By request bigpanda.io
BMC Helix Operations bmc.com/it-solutions/bmc-helix-

BMC Software Observability and AIOps Trial period
Management with AIOps operations-management
Network performance broadcom.com/products/software/

AppNeta
monitoring aiops-observability/appneta
Broadcom Application performance By request broadcom.com/products/software/
DX Application Performance
management, AIOps, and aiops-observability/application-
Management
observability performance-management
Catchpoint Catchpoint Observability Trial period catchpoint.com
System and telemetry

Circonus Circonus Platform By request circonus.com
monitoring
CloudFlare CDN Content delivery network Free tier cloudflare.com/cdn
ContainIQ ContainIQ Kubernetes monitoring Trial period containiq.com
Coralogix Coralogix Full-stack observability Trial period coralogix.com
Infrastructure and application

Datadog Datadog Trial period datadoghq.com
monitoring
Website monitoring and

Dotcom-Monitor Dotcom-Monitor Platform Trial period dotcom-monitor.com
performance testing
Dynatrace Dynatrace Software intelligence monitoring Trial period dynatrace.com

eG Innovations eG Enterprise IT performance monitoring Trial period eginnovations.com
LogStash Log management Free elastic.co/logstash

Elasticsearch
Observability, search, and
Elastic Cloud Trial period elastic.co/cloud
security
Load balancer, reverse proxy,

F5 NGINX Plus nginx.com/products/nginx
web server, and API gateway
Kubernetes network traffic nginx.com/products/nginx-ingress-

F5 NGINX Ingress Controller Trial period
management controller
F5
Application performance
BIG-IP f5.com/products/big-ip-services
management
F5 Distributed Cloud Networking and application

Free tier f5.com/cloud
Services management
Fluentd Fluentd Log management Open source fluentd.org
fortra.com/products/it-performance-
VCM Performance optimization Trial period
optimization-software
Fortra fortra.com/products/capacity-
Performance Navigator Performance monitoring Free tier planning-and-performance-
analysis-software
FusionReactor FusionReactor APM Trial period fusion-reactor.com
management
Cloud Trace Distributed tracing cloud.google.com/trace

Google Cloud Trial period
Network observability, cloud.google.com/network-
Network Intelligence Center
monitoring, and troubleshooting intelligence-center
Grafana Cloud Observability Free tier grafana.com/products/cloud
Grafana Labs Grafana Enterprise Stack Observability stack By request grafana.com/products/enterprise
Grafana Loki Log aggregation Free tier grafana.com/oss/loki
Honeycomb Honeycomb Full-stack observability Free tier honeycomb.io
IBM Instana Observability Observability Trial period ibm.com/products/instana
Application resource
IBM Turbonomic Sandbox ibm.com/products/turbonomic
IBM management
IBM Cloud Pak for ibm.com/products/cloud-pak-for-

AIOps By request
Watson AIOps watson-aiops
SQL Diagnostic Manager SQL Server performance idera.com/products/sql-diagnostic-

for SQL Server monitoring manager
Idera Trial period
SQL Diagnostic Manager MySQL and MariaDB idera.com/productssolutions/sql-
for MySQL performance monitoring diagnostic-manager-for-mysql
inetco.com/products-and-services/
INETCO INETCO Insight Performance monitoring By request inetco-insight-for-payment-
transaction-monitoring
ITRS Geneos On-prem and cloud monitoring itrsgroup.com/products/geneos
ITRS Opsview Infrastructure itrsgroup.com/products/

Infrastructure monitoring
ITRS Group Monitoring By request infrastructure-monitoring
itrsgroup.com/products/trade-
ITRS Trade Analytics Trade infrastructure monitoring
analytics
Jaeger Jaeger Distributed tracing Open source jaegertracing.io

JenniferSoft JENNIFER Trial period jennifersoft.com
management
Lightrun Lightrun Observability and debugging Free tier lightrun.com
Lightstep Lightstep Observability Free tier lightstep.com
Network monitoring and liveaction.com/products/livenx-

LiveAction LiveNX Trial period
management network-monitoring-software
LogicMonitor LogicMonitor Observability Trial period logicmonitor.com
Logz.io Logz.io Observability Trial period logz.io
Lumigo Lumigo Platform Observability and debugging Free tier lumigo.io
microfocus.com/en-us/products/
Operations Bridge AIOps
operations-bridge
Network performance
Network Node Manager i network-node-manager-i-network-
Micro Focus monitoring By request
management-software
Network Operations
Network management network-operations-management-
Management
suite
Microsoft System Center Infrastructure monitoring Trial period microsoft.com/en-us/system-center
azure.microsoft.com/en-us/
Azure Monitor Observability
products/monitor
Microsoft Azure Free tier
Network performance azure.microsoft.com/en-us/
Network Watcher
monitoring products/network-watcher
Moogsoft Moogsoft AIOps and observability By request moogsoft.com
Nagios Nagios Core Infrastructure monitoring Free nagios.com/products/nagios-core
Navigator X Middleware monitoring By request nastel.com

Nastel
CyBench API and code benchmarks Free cybench.io
Infrastructure monitoring and

Netreo Netreo Platform By request netreo.com
AIOps
nGeniusPULSE Business service monitoring netscout.com/product/ngeniuspulse

NETSCOUT By request
nGeniusONE Solution for netscout.com/product/ngeniusone-
APM and network monitoring
Enterprise platform
New Relic New Relic Observability Free tier newrelic.com
OpenTelemetry OpenTelemetry Observability framework Open source opentelemetry.io
poweradmin.com/products/server-
Power Admin Server Monitor Network monitoring By request
monitoring
Prometheus Prometheus Monitoring system Open source prometheus.io
SaaS-delivered observability riverbed.com/products/unified-

Alluvio IQ
service observability/alluvio-iq
By request
Network performance riverbed.com/products/network-
Riverbed Alluvio NPM
management performance-management
riverbed.com/products/npm/
Alluvio NetProfiler Network traffic monitoring Trial period
netprofiler.html
ScienceLogic ScienceLogic SL1 AIOps By request sciencelogic.com
Sentry Sentry Application monitoring Free tier sentry.io

App error monitoring and

Bugsnag bugsnag.com
observability
Smartbear Trial period
LoadNinja Performance testing loadninja.com
Full-stack performance
AppOptics solarwinds.com/appoptics
monitoring
Loggly Log management solarwinds.com/loggly
Pingdom Performance monitoring solarwinds.com/pingdom

SolarWinds Trial period
Network Performance Network performance solarwinds.com/network-
Monitor monitoring performance-monitor
solarwinds.com/solarwinds-
SolarWinds Observability Observability
observability
Website performance
SpeedCurve SpeedCurve Trial period speedcurve.com
monitoring
splunk.com/en_us/products/
Splunk APM Observability apm-application-performance-
Trial period monitoring.html
Splunk Splunk Infrastructure Infrastructure performance splunk.com/en_us/products/

Monitoring monitoring infrastructure-monitoring.html
splunk.com/en_us/products/it-
Splunk IT Service Intelligence AIOps By request
service-intelligence.html
Performance management and

Stackify Retrace Trial period stackify.com/retrace
observability
Cloud-native monitoring and

Sumo Logic Sumo Logic Trial period sumologic.com
observability
Kubernetes and cloud

Sysdig Sysdig Monitor Trial period sysdig.com/products/monitor
monitoring
Network performance thousandeyes.com/product/end-

ThousandEyes End User Monitoring Trial period
monitoring user-monitoring
Unravel Unravel Observability Free tier unraveldata.com
Virtana Multi-Cloud Insights virtana.com/products/multi-cloud-

Virtana Performance management Free tier
Platform management
VMware Aria Operations for Unified observability and tanzu.vmware.com/aria-operations-

VMware Trial period
Applications monitoring for-applications
xMatters xMatters Platform Service reliability Free tier xmatters.com
Network and application

Zabbix Zabbix Open source zabbix.com
monitoring
AIOps, full-stack monitoring,

Zenoss Zenoss Cloud By request zenoss.com
and observability
Zipkin Zipkin Distributed tracing Free tier zipkin.io
Network performance manageengine.com/network-

ManageEngine OpManager
monitoring monitoring
By request
ManageEngine Applications Application performance manageengine.com/products/
Zoho
Manager monitoring applications_manager
End-to-end performance
Site24x7 Trial period site24x7.com
monitoring

W Defa3473

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W Defa3473

Uploaded by

Copyright:

Available Formats

BROUGHT TO YOU IN PARTNERSHIP WITH

04 About DZone Publications

05 Key Research Findings

FROM THE COMMUNITY

17 Performance Engineering Powered by Machine Learning

24 A Primer on Distributed Systems Observability

31 A Deep Dive Into Distributed Tracing

35 Building an Open-Source Observability Toolchain

39 Creating an SRE Practice: Why and How

43 Learning From Failure With Blameless Postmortem Culture

46 Diving Deeper Into Performance and Site Reliability

DZONE TREND REPORT | PERFORMANCE AND SITE RELIABILITY PAGE 2

proper procedure and testing not occur.

systems. towards your dream performance destination.

As tools and technologies guide us to identifying root Enjoy the flight,

and legacy code, we can better pilot and build resilient

When a plane is not functioning properly, there are very

Observability is like the conglomerate of lights and metrics

DZONE TREND REPORT | PERFORMANCE AND SITE RELIABILITY PAGE 3

Meet the Team

Caitlin Candelmo Lauren Forbes

Lindsay oversees the Publication lifecycles end to end,

DZONE TREND REPORT | PERFORMANCE AND SITE RELIABILITY PAGE 4

Key Research Findings

By Sarah Davis, Guest Writer & Former Publications Editor at DZone

Major research targets were:

2. The monitoring and observability landscape beyond measuring performance

3. Measuring performance and the evolution of self-healing techniques

REASONS FOR WEB PERFORMANCE ISSUES AND GENERAL SURROUNDING ATTITUDES

DZONE TREND REPORT | PERFORMANCE AND SITE RELIABILITY PAGE 5

FREQUENCY OF WEB PERFORMANCE DEGRADATION ROOT CAUSES

Often Sometimes Rarely Never

CPU thrashing 26.4% 56.2% 15.5% 1.9%

Database reorganization 21.6% 34.0% 39.8% 4.6%

Deadlocks or thread starvation 14.1% 44.9% 37.9% 3.1%

Garbage collection 17.4% 48.2% 30.8% 3.6%

Geographic location lag 17.7% 33.1% 34.3% 15.0%

High CPU load 19.8% 50.6% 26.8% 2.7%

I/O bottleneck 20.2% 41.6% 35.0% 3.1%

Load balancing lag 15.3% 38.8% 39.2% 6.7%

Log rotation batch 18.0% 34.4% 39.6% 8.0%

Memory exhausted (paging) 14.7% 44.6% 36.7% 4.0%

Misuse of language features 16.7% 38.9% 33.7% 10.7%

Network backup 12.4% 43.8% 34.3% 9.6%

Network bottleneck 18.7% 45.1% 29.7% 6.5%

Selective/rolling deployment lag 10.8% 38.6% 41.4% 9.2%

DZONE TREND REPORT | PERFORMANCE AND SITE RELIABILITY PAGE 6

In relation to the software you have worked on, rank the

Database misconfiguration 1 1,193 159

and tracking performance are being introduced.

WHY ORGANIZATIONS ADOPT AI FOR MONITORING AND OBSERVABILITY

DZONE TREND REPORT | PERFORMANCE AND SITE RELIABILITY PAGE 7

Research Target Two: Beyond Measuring Performance — The Monitoring and

THE CURRENT STATE OF MONITORING AND OBSERVABILITY

LEVELS OF OBSERVABILITY MATURITY IN ORGANIZATIONS

Monitoring, and we know whether monitored components are working 25.1% 65

Observability, and we know why a component is working 30.9% 80

Other- write in 1.2% 3

DZONE TREND REPORT | PERFORMANCE AND SITE RELIABILITY PAGE 8

TOOLS AND PERFORMANCE PATTERNS SUCCESSFULLY IMPLEMENTED FOR MONITORING

IMPORTANCE OF OBSERVABILITY TOOL CAPABILITIES

Capability Overall Rank Score n=

Integration with existing tools and/or stack 1 748 141

Geographic location lag   17.7% 33.1% 34.3% 15.0%

Network backup  12.4% 43.8% 34.3% 9.6%

Network bottleneck  18.7% 45.1% 29.7% 6.5%

Selective/rolling deployment lag  10.8% 38.6% 41.4% 9.2%