Professional Documents
Culture Documents
House Dzone Refcard 369 Observability Distributed
House Dzone Refcard 369 Observability Distributed
Observability for
• What Is Observability?
− Pillars of Observability
Distributed Systems
an Entire Organization
− Graphs and Alerts
• Conclusion
− Additional Resources
Engineering is changing day by day at an increasing pace with the systems used to improve the way resources are consumed to
appearance of more advanced systems that adapt to the complexity attain the best performance possible. Achieving this is certainly
of the processes required for data transformation. Distributed not easy and requires coordination between development and
systems play a part in this evolution by offering a high degree of operations teams for proper management and consumption of
flexibility and complexity at multiple levels, requiring an appropriate available resources.
observability platform to benefit from many of its advantages.
BENEFITS OF A DISTRIBUTED SYSTEM
This Refcard is intended for anyone interested in learning more about A distributed system can be composed of multiple resources,
the management of distributed systems, and more specifically, how physical servers, virtual machines, and containers; anything with
to observe this type of platform for proper operation. compute capabilities available on the network can be used to
create a cohesive unit consumed by the end user. This kind of
WHAT IS A DISTRIBUTED SYSTEM? architecture, as illustrated in Figure 1 (see next page), has four global
Businesses need to be agile to survive in today’s volatile, ever- objectives at a company level: competitiveness, agility, velocity, and
changing marketplace, which means their applications' underlying speed to market.
architectures must also be agile. The five key architectural features
to support agility, speed to market, and ultimately, competitive
advantage in today's market are:
1
REFCARD | OBSERVABILIT Y FOR DISTRIBUTED SYSTEMS
The competitiveness of an enterprise can be measured according to It is, therefore, crucial to have a reliable observability platform
the level of its product availability. Today, this means an architecture in order to quickly access all the data needed for diagnosis
must be capable of scaling to the extent needed to support the and optimization across all areas and at multiple levels of the
product, ultimately, in order to meet both client and end-user distributed system.
demand at all times.
WHAT IS OBSERVABILITY?
Speed to market is the ability to react quickly to change, relying The software development industry is evolving rapidly, from system
on agility to do it right. Agility is needed in system management as architectures to user expectations. High availability, reliability, and
well as in workflows that consume resources to accelerate change. visibility are prerequisites for any company to remain competitive
Agility is achieved by integrating flexibility into deployments, tests, in their respective markets. Relying on legacy systems that only
and maintenance, which works to improve team performance, thus aggregate measurements does not provide the type of visibility
allowing an increase in velocity. needed to quickly identify and correct anomalies.
The design of distributed systems is tailored to meet all these criteria. New methods to efficiently find and solve problems are required
These systems' high performance and fault tolerance ensures for limiting downtime and mitigating the risk of negative business
business continuity of service and minimal impact on the end user. impact. Poor user experiences, for example, can result in harm
to a company's reputation or loss of prospects — and even loss of
Figure 1: Distributed system advantages at a business level
existing customers.
Any distributed system has its disadvantages. By design, a distributed must be designed and implemented during the initial
system is spread across multiple resources, which increases the system build, coding, and testing.
2
REFCARD | OBSERVABILIT Y FOR DISTRIBUTED SYSTEMS
METRICS TR ACES
Metrics are numerical measurements collected by traditional A trace can be seen as a graphical representation of event logs.
monitoring systems to represent the system state. Coupled with Distributed tracing tracks and observes service requests flowing
tags, metrics can be easily grouped, searched, and graphically through distributed systems by collecting data as the requests go
represented to understand and predict a system’s behavior over from one service to another. Put simply, traces represent the lifecycle
time. The measures, by design, have many advantages: This type of of a request across a distributed system. They are used by multiple
data is suitable for storage, processing, compression, and retrieval, teams — for instance:
making it easier to compact, store, and query data with dashboards
• Developers can measure and optimize least performant
that reflect historical trends.
calls in the code.
Metrics remain the entry point to any monitoring platform based on • SREs can identify potential security breaches.
the collection of CPU, memory, disk, network measures, etc. And as
• DBAs can detect long-running queries that slow down
such, they no longer belong solely to operators — a metric can be
the user experience.
created by anyone.
By analyzing traces, it is possible to measure the overall health of the
For example, a developer may choose to expose an application-
system, point out bottlenecks, discover and resolve problems faster,
specific set of measures such as the number of treatments performed,
and prioritize high-value areas for optimization and improvement.
the time required to complete them, and the status of those
While metrics, logs, and traces serve their own purpose, they all
treatments. Their objective is to link these data to different levels
work together to help you better understand the performance and
(system and application) to define an application profile in pursuance
behavior of your distributed systems.
of identifying the necessary architecture for the distributed system
itself. This results in improved performance, increased reliability, HOW OBSERVABILITY SUPPORTS AN
and security system wide. ENTIRE ORGANIZATION
One of the main goals of an observability platform is for the whole
Metrics used by development teams to identify improvement points
organization to adopt it. The collection of data is, indeed, an
in the source code can also be used by operators to determine the
important element, but without representation or interpretation
architecture needed to support user demand and by the executive
of this data, observability solutions, unfortunately, can lose value
team to control and improve the adoption and use of the application
for a majority of the organization. However, there are two primary
by customers.
ways to represent the data an observability platform obtains, in
LOGS part, for the purposes of demonstrating value through graphs
Logs are immutable timestamped records of events that happened and to receive notifications in case of a malfunction, thanks to alerts.
over time. Logs provide another point of view, in addition to metrics,
GRAPHS
of what occurred in the application at any given moment. There are
Graphs are one of the simplest methods for representing the meaning
three log file formats:
of any data collected, allowing everyone in the company to get a
• Plaintext – Most common format with no specific structure global vision in time of the application's performance, status, and
• Structured – Most recent format used with a specific reliability. This can range from representation of the distributed
structure (e.g., JSON files) system's performance needed by operators and developers to the
graphical representation of how users are utilizing the application.
• Binary – Format used by multiple applications to improve
the performance of data management (e.g., MySQL binlog
ALERTS
replication, system journal log, Avro)
Based on the representation of observability data over time, and
Logs provide more granular visibility of the actions performed by therefore, of any application profile built by such data, it is possible
an application. These are extremely important to development and to identify deviant behaviors across a distributed system. The ability
operations teams, as they often include elements necessary for to identify performance issues enables configuration of alerts that,
debugging and optimizing application performance. The complexity when triggered, allow operators and developers to avoid unforeseen
of distributed systems often creates various interconnected failures events. Teams can anticipate adverse events and consequently, take
that can be difficult to identify, so diligent log management is a proactive approach to any necessary resolutions.
3
REFCARD | OBSERVABILIT Y FOR DISTRIBUTED SYSTEMS
The creation of an observability platform requires a certain Security Is the platform compatible with your security
level of team maturity, both regarding data collection and data policies? How does it encrypt the data? Can
representation/interpretation, to simplify the adoption and use everyone access sensitive data? Is the platform
easily auditable?
across an organization.
Distributed storage Can you distribute storage to increase high
Figure 2: Detailed observability concept availability and potentially better manage data
retention?
Cost management How much time and money can the company
invest in an observability platform and its
implementation/management?
4
REFCARD | OBSERVABILIT Y FOR DISTRIBUTED SYSTEMS
A simple method for this process is to: distributed tracing from front-end devices to back
ends like databases.
1. Deploy an agent on each system to collect necessary data.
This agent plays an important role in platform observability • Implement a service mesh – allows automatic
and, therefore, must also be monitored to ensure it remains deployment of proxies between endpoints to track
operational and does not consume too many resources as requests and collect traces.
5
REFCARD | OBSERVABILIT Y FOR DISTRIBUTED SYSTEMS
YP6zACqjIAw2-vwa/t/observability