Professional Documents
Culture Documents
Definitive Guide To Application Performance Mgmt-Complete5.1.10
Definitive Guide To Application Performance Mgmt-Complete5.1.10
Definitive Guide To Application Performance Mgmt-Complete5.1.10
For several years now, Realtime has produced dozens and dozens of high-quality books
that just happen to be delivered in electronic format—at no cost to you, the reader. We’ve
made this unique publishing model work through the generous support and cooperation of
our sponsors, who agree to bear each book’s production expenses for the benefit of our
readers.
Although we’ve always offered our publications to you for free, don’t think for a moment
that quality is anything less than our top priority. My job is to make sure that our books are
as good as—and in most cases better than—any book that would cost you $40 or more.
I want to point out that our books are by no means paid advertisements or white papers.
We’re an independent publishing company, and an important aspect of my job is to make
sure that our authors are free to voice their expertise and opinions without reservation or
restriction. We maintain complete editorial control of our publications, and I’m proud that
we’ve produced so many quality books over the past years.
Don Jones
i
Table of Contents
Better Planning.......................................................................................................................................... 20
ii
Table of Contents
Survival ......................................................................................................................................................... 31
Awareness ................................................................................................................................................... 32
Committed ................................................................................................................................................... 32
Proactive ...................................................................................................................................................... 33
Service-Aligned ......................................................................................................................................... 33
Business-Partnership.............................................................................................................................. 34
iii
Table of Contents
iv
Table of Contents
v
Table of Contents
IT Management........................................................................................................................................ 129
Characterization...................................................................................................................................... 143
vi
Table of Contents
vii
Table of Contents
viii
Copyright Statement
Copyright Statement
© 2010 Realtime Publishers. All rights reserved. This site contains materials that have
been created, developed, or commissioned by, and published with the permission of,
Realtime Publishers (the “Materials”) and this site and any such Materials are protected
by international copyright and trademark laws.
THE MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE,
TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice
and do not represent a commitment on the part of Realtime Publishers its web site
sponsors. In no event shall Realtime Publishers or its web site sponsors be held liable for
technical or editorial errors or omissions contained in the Materials, including without
limitation, for any direct, indirect, incidental, special, exemplary or consequential
damages whatsoever resulting from the use of any information contained in the Materials.
The Materials (including but not limited to the text, images, audio, and/or video) may not
be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any
way, in whole or in part, except that one copy may be downloaded for your personal, non-
commercial use on a single computer. In connection with such use, you may not modify
or obscure any copyright or other proprietary notice.
The Materials may contain trademarks, services marks and logos that are the property of
third parties. You are not permitted to use these trademarks, services marks or logos
without prior written consent of such third parties.
Realtime Publishers and the Realtime Publishers logo are registered in the US Patent &
Trademark Office. All other product or service names are the property of their respective
owners.
If you have any questions about these terms, or if you would like information about
licensing materials from Realtime Publishers, please contact us via e-mail at
info@realtimepublishers.com.
ix
Chapter 1
You’ve seen these words before. Perhaps you were buying a just-released book or video
through an online Web site and it popped up in the middle of checking out. Maybe you were
trying to get tickets to that important sporting event or that one-night-only concert. What
about when the winter storm of the century hits your airport and thousands of people
scramble at once to find a new flight or a hotel room.
Each of these scenarios is strikingly similar to the others. An IT service struggles to keep up
with the load of its users, until that load finally overwhelms its capabilities. You, the end
consumer, are greeted with a pleasant message that effectively tells you…nothing. You
don’t know what happened. You don’t know the status of the problem or its resolution. You
don’t even know when that suggested “later” may be for you to try again. So, you—and
everyone else—find yourself hitting the Refresh button over and over again impatiently
waiting for a better response.
Or, in extreme situations, you stop doing business with that site entirely.
Each of these scenarios is also remarkable in how often they’re seen by the end customers
of Web and other IT-based services. When they work, the IT services used by businesses
are fantastically efficient in servicing customers. Yet when they don’t, the result is the
online equivalent of a “Closed for Business” sign hanging on the front door.
You might have experienced situations like this in other places. Perhaps the problem isn’t
in an online e-commerce system. Maybe a similar outage of service happened within an
internal business application, the functionality of which is critical to getting your job done.
Maybe an underlying IT infrastructure component such as name resolution or the network
itself experiences a problem. The result of that low-level issue manifests itself in ways that
are seemingly unrelated to the actual problem.
The central problem in all of these situations is an inability to properly manage application
performance.
1
Chapter 1
The problem is that the idea of a “service that is down” is often so much more than a simple
binary answer; on versus off, working versus not working. As you can see in Figure 1.1, IT
services are made up of many components that must work in concert. Servers require the
network for communication. Web servers get their information from application servers
and databases. Data and workflow integrations from legacy systems such as mainframes
must occur. These days, even data storage must be accessible over that same network.
2
Chapter 1
3
Chapter 1
For many years this binary view of the IT environment was sufficient for most businesses.
As long as services were available, users could complete their goals. However, as
businesses have over time grown more and more reliant on their IT services as a critical
function of business, this immature “is it on?” approach to services under management can
no longer be acceptable.
Cross-Reference
Chapter 2 will explore this history in greater detail and discuss how the
maturity level of an IT organization bears heavily into how it goes about
preparing for and solving problems.
Organizations that want to take advantage of APM must lay in place a workflow and
technology infrastructure (see Figure 1.2) that enables the monitoring of hardware,
software, business applications, and, most importantly, the end users’ experience. These
monitoring integrations must be exceptionally deep in the level of detail they elevate to the
attention of an administrator. They must watch for and analyze behaviors across a wide
swath of technology devices and applications, including networks, databases, servers,
applications, mainframes, and even the users themselves as they interact with the system.
Figure 1.2 shows an example of how such a system might look. There, you can see how the
major classes of an IT system—users, networks, servers, applications, and mainframes—
are centered under the umbrella of a unified monitoring system. That system gathers data
from each element into a centralized database. Also housed within that database is a logical
model of the underlying system itself, which is used to power visualizations, suggest
solutions to problems, and assist with the prioritization of responses.
4
Chapter 1
Figure 1.2: An APM solution leverages monitoring integrations and service model
logic to drive visualizations, prioritize problems, and suggest solutions.
With its monitoring integrations spread across the network, such a system can then assist
troubleshooting administrators with finding and resolving the problem’s root cause. In
situations in which multiple problems occur at once—not unheard of in IT environments—
an APM system can assist in the prioritization of problems. In short, an effective APM
system will drive administrators first to those problems that have the highest level of
impact on users.
So, with this in mind, why are you here? Why read this guide?
5
Chapter 1
These ten chapters are designed to help you understand what APM is and how it helps IT
better align with the needs of its business. They will help you recognize the different types
of APM integrations, including the all-important end users’ perspective, and how the
different integration types tie into that overall picture of service quality. They will give you
an end-to-end understanding of APM in action, including the creation and use of interactive
dashboards for users, administrators, and executives. You’ll also come to understand how
APM’s information gathering brings data that can tie directly into business metrics through
Business Service Management (BSM).
In the next ten chapters, you’ll gain a detailed level of knowledge about the requirements
and the power of APM, centered around the following topics:
• Chapter 1: What Is APM? This first chapter will document the problem with
today’s traditional monitoring solutions and explain why APM is an effective
solution. It will discuss an introduction to APM and where it fits into the
environment.
• Chapter 2: How APM Aligns IT with the Business. APM provides great value, but
only when an organization’s culture supports what APM has to offer. IT
organizations must have a certain “maturity” if they are to get the highest levels of
value from APM. Those organizations as they mature will, at the same time, find a
greater alignment between their goals and the goals of their business.
• Chapter 3: Understanding APM Monitoring. As you’ve already learned, APM’s
monitoring integrations hook into vastly different areas within your IT
infrastructure. This chapter will discuss those integrations, how they work, and how
your existing monitoring infrastructure can augment an APM solution.
• Chapter 4: Integrating APM into Your Infrastructure. Following on Chapter 3’s
functional descriptions is a further explanation of APM’s logical integrations into
infrastructure components, network analytics, and application analytics. To build its
holistic awareness of a business service, APM must peer into every facet of that
service, from clients to mainframes and everything in between. This chapter
discusses the nuts and bolts of how that will happen.
• Chapter 5: Understanding the End User’s Perspective. You’ll notice in Figure 1.2
that one critical component to be monitored is actually the end user. End User
Experience (EUE) monitoring adds the users’ perspective of the system, illuminating
when users see problems that other areas of monitoring can’t see.
• Chapter 6: APM’s Service-Centric Monitoring Approach. With each of the
necessary monitoring integrations in place, it is now possible to build a model of the
service itself. This model creates a logical diagram of the service and its
components, providing a map for monitors to display and update their information.
The service model is also the structure that drives what kinds of data administrators
will see within their assigned visualizations.
6
Chapter 1
APM by Example
Now that you understand the path this guide will take in your understanding of APM, it is
perhaps best to continue with a short example. This example is designed to tease your
understanding of APM and prepare you for the chapters to come. Later chapters will
continue with larger and more detailed examples like this one to assist with your learning.
Consider a business service that is similar to the one discussed in Figure 1.1. This system
provides some level of service for customers on the Internet. It is made up of six major
components: the network, network-attached storage (NAS), a firewall, and three servers
that communicate with each other to fulfill the display, logic, and data storage needs of the
system.
7
Chapter 1
Figure 1.3: A problem deep within the mainframe can impact the operations of the
entire system.
However, consider the situation in which the problem lies deeper within the application
itself. In this example, the problem is not the loss of an entire server or device. Here, a much
deeper problem exists. Rather than a simple server loss, the response time between the
application server and the mainframe instead slows down. This occurs due to a problem
within the mainframe. The decrease in performance between these two components
eventually grows poor enough that it impacts the system’s ability to complete transactions
with the mainframe. As a result, the upstream reliant servers such as the application
server, database server, and Web server can no longer fulfill their missions.
8
Chapter 1
Figure 1.3 shows a pictorial example of how this problem might manifest itself to the end
user. In the picture, you can see how the information exiting the mainframe does not make
its way to the application server in a timely fashion. Because of this slowdown in
performance, transaction timeout values and thresholds are eventually exceeded. This
causes the application server to no longer be able to serve the needs of the Web server,
which itself can no longer serve the user. In the end, the user experiences a loss of service
of the entire Web site, one that is difficult to trace back to the initial problem.
9
Chapter 1
Figure 1.4: Stovepiped monitoring solutions don’t provide a holistic view of the
entire system.
Needed are solutions that integrate the monitoring information from each component.
Such a system, like what was previously shown in Figure 1.2, will leverage the use of
monitoring integrations across multiple domains into a single, centralized database for
processing. In that database will be a model of the business service itself, creating the
necessary logic that enables the system to understand and report on the data it is
collecting.
Measuring Transactions
Yet this alone still isn’t enough. Truly measuring performance is more than just enabling
PerfMon counters on a Windows server or logging NetFlow statistics from a Cisco network
device. Even comparing one device’s set of information with another gives you only a
limited perspective on the environment as a whole. Counters like these give you
information about devices and interactions between devices. Nowhere in their data do they
provide information about the applications that are installed atop those devices. Their
aggregate data cannot illuminate the individual communications between, for example, two
servers that comprise a service. To collect this data, it is additionally necessary to look at
the individual transactions that occur between service components.
10
Chapter 1
Figure 1.5: Transaction monitoring can watch the sequence of events between
system components and alert when problems occur.
Figure 1.5 shows an example of how transaction monitoring might have recognized that a
problem was occurring. The sequence diagram there shows the interactions between the
different components that make up our example service. There, the user successfully
attempts to place an order on the Web site. The Web site then attempts to create a
shopping basket for the user. This action requires a verification of inventory levels prior to
completion, an action that isn’t completed in time. Consequently, the sequence ends with
the application server timing out its request and an overall failure in the system.
Chapter 3 will go into more detail about this idea of transaction monitoring, followed by an
in-depth discussion on its relation to the end users’ perspective in Chapter 5. But for now,
recognize that multiple and simultaneous mechanisms for monitoring business services are
required if you are to obtain that desired level of situational awareness.
11
Chapter 1
Figure 1.6: The application life cycle’s impact to the business in both perfect and
poor implementations.
Figure 1.6 shows an example of how the life cycle for an application can be recognized.
Identifying the need, scoping the project, developing the application, designing its
architecture, and eventually implementing the solution are all required cost elements that
must occur to set the application in place. Once in production, “perfect” applications tend to
require comparatively little marginal cost period over period.
For a perfect implementation, the impact to the business is displayed by Figure 1.6’s green
line. There, the application begins with zero benefit to the organization and all cost. Prior to
its production deployment, no one within the organization is obviously making use of the
application. At the same time, organizational resources are consumed for those
aforementioned development activities. Once the application is brought into production, its
benefits begin to increasingly outweigh the marginal costs required to keep it running. If
that application is an internal tool for business employees, this benefit curve goes up as
employees find value in it features. For external applications the benefit curve goes up as
potential customers begin using the application and creating income for the business.
12
Chapter 1
However, few applications experience that “perfect” curve between development and
production, cost and benefit. The alternating red curve also seen in Figure 1.6 represents
the fits and starts that occur with applications that are poorly brought into production.
Perhaps the project wasn’t scoped properly and more users are found to need the
application than previously thought. Maybe the application creates an impact on the
network that causes outages or performance issues with other services. Sometimes success
itself becomes problematic. Newly-deployed applications can be so successful that the
initial customer excitement over its release sends it down, crashing under the weight of its
own exuberant users.
All of these represent areas in which issues with application delivery are caused through
poor management of anticipated application performance. These issues occur both within
the application as well as external to it. Too often performance is not considered as a
primary requirement during application development, forcing performance management
to occur late in the development life cycle. This omission of performance management
early in application development adds cost to applications and extends their “cost” period.
An added goal of APM is to provide the data foundation where performance impacts from
the environment are understood at a macro scale. Here, monitoring integrations from all
across the network environment are consolidated to provide the data necessary to create
good designs with new applications right from the get-go.
Consider again the example from earlier in this chapter. In that example, some unknown
condition on the mainframe was eventually found to be the root cause for a multiple-server
failure. If fixing this problem requires a redevelopment of the core application, an outage
like this represents a cost to the business. It means the trendline for the application
switches from the “green” line to the “red” line, extending the time required to declare
success with the application’s deployment.
However, leveraging data gathered through an APM solution could have prevented such a
problem from occurring in the first place. Perhaps the mainframe was already serving
other customers and approaching the limit of its processing capabilities. Perhaps the piece
of code built for the mainframe was not optimized in its processing requests, causing the
mainframe to work particularly hard in processing every request. Any of these
performance-related issues could have been potentially tracked down prior to the failure
actually occurring.
13
Chapter 1
How does the system identify what is useful and what isn’t? Part of your implementation
project will be to define just those parameters. This is done firstly through the creation of a
logical representation of your services and infrastructure called the service model. As an
extremely detailed picture, creating this logical representation can be one of the most
challenging parts of an APM implementation.
Figure 1.7: Mapping physical components into the service model’s logical
representation.
Installing an APM framework and its software results in what amounts to a blank canvas.
On that canvas, you will sketch out your environment’s architecture including all the
elements that make up the systems to be managed. The completed service model becomes
that logical representation of the components in your infrastructure. Figure 1.7 shows a
simplistic example of how this might work. Here, each of the application’s physical
components and geographical locations has been mapped to a dependency diagram. The
arrows in this dependency diagram show where elements within the model rely on other
elements for their processing. At the top level, the service itself requires information or
processing provided by the three servers in the environment. Each of those three servers
requires the support of both network and storage components. Also shown is the user’s
experience divided between each of its multiple locations.
14
Chapter 1
In this model, coloring is typically used to identify the health and status of each of its
individual elements. You’ll see that each of the dots representing an element is colored
green. This easy-to-understand heads-up display provides a way to identify whether that
component is functioning to desired levels. Obviously, when problems occur with a
component, its green color will shift to yellow or red, denoting a caution or warning
condition. Not shown in Figure 1.7 are the individual rules that are used in the background
to identify the “greenness” or “redness” of the element. For example, underneath the
network element may be custom-built rules that that identify bandwidth or latency
conditions that are unacceptable or may indicate a pre-failure condition. When any of those
rules are tripped by the monitor, its color changes to alert the problem condition.
To obtain this data, multiple types of installed monitoring integrations are required. These
integrations provide the hooks into various components that make up the service like code
frameworks, databases, applications, and Web services. Installing and managing these
monitors is a next big step in an APM implementation. Depending on the architecture of
your applications and the components that make up the infrastructure, many monitors may
be required to gather the right amount of data. Table 1.1 lists a few common monitors and
integration points.
15
Chapter 1
SMTP
DNS
OracleForms
Siebel
SAP
XML/SOAP
Tuxedo
Citrix
.NET
Oracle
SQL
Sybase
Informix
DB2
SMB
DCOM
RPC
SSL
Table 1.1: A sample list of potential integration points for an APM solution.
16
Chapter 1
This concept was first described in the book The Definitive Guide to Business Service
Management. There the term digestibility is used to help the reader understand how
different views make sense for different readers. This concept of digestible data relates to
presenting the kind of data that is interesting, useable, and useful to its targeted class of
individual.
Consider three types of individuals who may be interested in data that is generated by an
APM solution. The first individual may fill the role of systems administrator. That
individual may be interested in understanding large-scale conditions that occur across
system components. They may have an interest in network conditions and the status of
services that are in operations. When problems occur, they want to know specifically
where that problem occurs so that they can drive towards a fix.
The second class of individuals who stand to gain through an APM solution are the
developers of the system itself. A developer may be unconcerned with the day-to-day
operations of a system once that system is in full production. Knowing which elements of
the system are up versus down is not part of the job role of that system’s developer.
However, they do get involved when an administrator needs deep troubleshooting
assistance with a problem, or when issues with the system require code updates or fixes.
Application developers are likely to want to view more detailed information about the
individual transactions between service components. They might want to see which
transactions occurred versus which were unsuccessful. Information about the performance
of page refreshes on an application’s Web site is of much greater use to its developer, as
that individual can find and fix the specific problem at the code level.
17
Chapter 1
The third individual who might have interest is the end user themselves. End users of
systems, especially those large-scale systems that are likely to be monitored with an APM
solution, want to know when problems occur. This chapter started with a discussion of how
users don’t like seeing unhelpful messages like “This Web site is experiencing unexpectedly
high volume. Please try again later.” during an outage. An APM solution can be
simultaneously used to notify end users when problems occur. It can be used to help them
understand when performance conditions are lower than normal, and when users can
expect to see a return to full operations.
Figure 1.8 shows a collage of potential visualizations that can be of interest to each of these
classes of users. You’ll see there the high-level stoplight charts that describe when system
components are behaving at or below expectations. This view is handy for the
administrator to know when system components experience problems. Also there is a
detailed view of a set of individual transactions. This view provides the developer the
necessary information they need to trace issues with performance or functionality. The
third image shows an extremely high-level view of a global system, detailing areas where
that system may be experiencing problems. This view provides the right level of detail for
the end user, giving them the knowledge that problems exist.
Figure 1.8: Three examples of visualizations that are tuned to the needs of their user
class: administrator, developer, and end user.
18
Chapter 1
All problems in an IT environment stem from some kind of change within the environment
itself. This is the case because computers are deterministic. Some action or change must
occur within the environment that drives the problem to occur. With the right monitoring
integrations in place, it is possible to characterize the performance of your applications
over time. It is then possible to use that characterization to track when an unacceptable
behavior occurs, and immediately point the finger to where the root cause lies.
Cross-Reference
Chapter 6 will discuss this root cause analysis process in greater detail.
Characterization of Problems
Returning once again to the example problem in this chapter, characterizing the
mainframe’s problem was possible because its nominal behaviors were encoded into the
APM solution’s service model. Those nominal behaviors, such as acceptable processor
utilization and acceptable delay in responding to inventory requests, provided a basis by
which its later unacceptable behaviors could be alerted on.
When you’ve gone through the exercises necessary to characterize the acceptable
behaviors of the elements in your IT environment, you provide that basis for alerting when
unacceptable ones exist. Leveraging this ability with the dependency linking that makes up
the service model provides a way to show how one component’s behavior impacts others.
19
Chapter 1
A problem with this in many IT organizations today is simply not knowing how many users
are affected by each business application or component. APM’s structured approach to
defining the overall architecture provides a way to easily roll up the number of affected
users by component. This ultimately provides a way to prioritize which problems should be
resolved first and which can be left for later resolution.
Situational Awareness
The data that passes along your network’s wires is not something that can be directly
looked at with the naked eye. Using traditional network sniffing tools to watch this data is
also problematic due to the sheer quantity of data that flies by during any discernable
period of time. Thus, better approaches that look at data from the network’s perspective in
combination with each server or application’s perspective gives administrators a better
situational awareness of what’s going on in their networks. Combining this information
with the rich monitoring support through any of APM’s integrations means that the
business can know the status of its applications at all times.
Better Planning
Lastly is the critical need for future planning. Too often, IT organizations go through
planning exercises using a subjective approach, assigning augmentation dollars based on
gut feelings or one-time problems rather than historical behaviors. Taking the long-term
approach with APM data brings IT a mechanism for identifying where system elements
need augmentation or wholesale upgrades. The data provided by APM integrations enables
budgetary decisions to be made based on objective data.
However, before you can truly start your APM implementation, an important first step is in
understanding your organization’s level of process maturity. That level of maturity drives
how you solve problems, how you react to situations, and what level of structure you have
in place. You’ll come to recognize that organizations who operate with a relatively low level
of process maturity suffer under the weight of overwork, waste, and the lack of automation.
Chapter 2 will discuss just those topics.
20
Chapter 2
“Well, most of the time,” responds the IT manager, “Sometimes it really is our fault. This is a
complicated system, one that we’ve been incrementally upgrading and expanding over the
years. I mean, by this version, some of that code has got to be spaghettied together so poorly
that it’ll be impossible to figure it out.”
The executive continues, “But isn’t there anything we can do to predict this sort of excessive
activity, to plan for it?”
“No. At least not with what we’ve got today. I’ll argue that the site is technically still up and
running. We’re not ‘technically’ down.”
Frustrated, the executive sits back in his chair and swings to look out the window as the IT
manager leaves his office. It is just this sort of conversation that irritates him about the
technology focus of this business. But what can he do about it?
---
Dan Bishop is COO of TicketsRus.com, a national retailer of tickets for concerts and sporting
events. TicketsRus.com is one of the largest online retailers of tickets, selling tens of thousands
each and every day for everything from the smallest backwater rock band to the hottest pro
basketball teams during their end-season games. Although TicketsRus.com sells a portion of
tickets through their phone operations facility, a massive phone bank located in Waco, Texas,
today’s vast majority of tickets are sold through their online Web site currently homed in
Denver, Colorado.
That Web site is the singular greatest piece of TicketsRus.com’s intellectual property, and a
primary reason behind their “convenience fees,” which serve as the profit for the business. The
TicketsRus.com Web site is a massive, custom-developed online service that tracks incoming
requests, sells tickets, and enables “gatekeeper” functions to prevent overload during periods
of high traffic. It even prints and mails the tickets once customers check out. An almost fully-
automated system, you could argue that TicketsRus.com’s Web site is TicketsRus.com.
---
21
Chapter 2
You’ve experienced situations like this before. Dan’s problem with his company’s primary
mechanism for making money is a situation felt all across the IT landscape. In today’s
online e-commerce climate, more and more companies are leveraging the Internet as a
primary—if not singular—location for hosting their wares. With the Internet, inventory
and labor costs are dramatically reduced, as are the costs of the brick-and-mortar
storefronts that now no longer must be leased and maintained. With the automation
brought about by a computing infrastructure, far more productive work can be done with
far less manual intervention.
Yet moving one’s operations to an online facility incurs its own set of risks. In the case of
brick-and-mortar, the loss of a store due to a power outage, a massive snowstorm, or a run
on available products means that customers can still go elsewhere for their needs. In the
case of online, the Internet presence is singular. It must operate at all times, with an
acceptable degree of performance, and in such a way that it gives confidence to its
customers that they’re getting value out of the experience.
---
This is exactly today’s problem with TicketsRus.com, and it’s not their fault. In today’s
problem, an extremely popular artist has come out of retirement for a new tour, and that
artist’s adoring fans have completely overloaded the system to look for tickets. The
simultaneous inrush of new business has effectively shut down the site, turning what should
have been a profit success into Dan’s current operational nightmare.
And he’s not sure if his IT Manager really understands the gravity of the problem…
22
Chapter 2
Figure 2.1: IT is seen as “the people who fix,” a common sight in many businesses.
When the business didn’t need IT, these groups of people usually found themselves
shuffled away to other parts of the building. Taking over closets and storage rooms behind
locked doors, there this group awaited the next problem to be fixed.
Over time, this break/fix mentality begins to grow deeply ingrained into both the members
of IT as well the rest of the business who rely on them for services. When IT operates in a
break/fix mode, they usually find themselves reacting to problems. A critical server is down
today? Here come the IT “white knights”, riding in to work through the night and ultimately
save the day.
But at the same time, the break/fix mentality’s “hero effect” actually becomes a liability to
the business. IT organizations that see themselves as the heroes to be called when
problems occur probably aren’t spending the right amount of time preventing those
problems from occurring in the first place. If that critical server was actually reporting a
problem for weeks before it finally crashed, IT is no hero in getting it running again—
they’re actually the problem.
Why this disconnect between IT and the business? Other than a historical position inside
the company’s locked storage closets, what are the causes behind IT’s reactive mindset?
Differing responsibilities and mismatched priorities with the rest of the business, a lack of
common vocabulary, and a missing vision into the business’ dollars and cents are all
common factors.
23
Chapter 2
Different Responsibilities
In the story at the beginning of this chapter, the business of TicketsRus.com is brokering
tickets between artists and sports teams and their end consumers. TicketsRus.com makes
its business by providing a convenient service to its customers, making it easy for them to
find and purchase the tickets they want for the events that interest them.
To this end, TicketsRus.com likely has a massive marketing department. The job of that
team is to make potential customers aware that their service exists. They probably have a
sales department who find new events, artists, and teams to sell on their Web site. Their
executive management team’s primary responsibility is to ensure that the company runs
optimally with good profit and expected return. Each of these groups has a primary mission
that aligns with creating and maintaining the flow of TicketsRus.com’s business.
In contrast, TicketsRus.com’s IT department has a different goal entirely. Their stated goals
are quite distinctive in scope. TicketsRus.com’s IT department is responsible for and
charged with maintaining the operations of the computer systems for the rest of the
company. That charge includes the massive online presence where the company makes
most of its profit. When TicketsRus.com makes a profit, the IT department continues to
keep the computers running. When TicketsRus.com doesn’t make a profit, the IT
department continues to keep the computers running.
However, there’s a problem when the metrics associated with what is considered “up and
running” are not well defined. Whereas the IT Manager sees the current situation as a
temporary hiccup in the otherwise smooth running of the online system, this individual
likely isn’t aware that this short hiccup could become the source of major revenue loss for
the business. Because he hasn’t planned for such a contingency, he truly isn’t aware of the
gravity of the situation.
24
Chapter 2
Figure 2.2: At a high level, APM can measure when user load negatively impacts
overall system performance.
This isn’t necessarily to say that the IT Manager’s lack of planning is entirely his fault. When
the IT Manager hasn’t been handed down the correct kinds of metrics to use in measuring
success, he won’t be looking in the correct places to find it. As you’ll find in this guide, one
of the tenets of Application Performance Management (APM) is to provide a mechanism for
defining just those metrics. Lacking a system in place that can look at system performance
as the sum of its parts (see Figure 2.2), it is difficult or impossible to accurately measure the
success of that system. APM and the solutions that enable it provide just those
measurements.
No Common Vocabulary
IT also suffers from its high level of technical vocabulary that isolates it from other
members of their business. The graph in Figure 2.2 makes sense when it is defined within
the scope of metrics that make sense to IT: % Processor Use, Transactions/Second, Java JRE
Method Timeout, and so on. Metrics such as these, however, are useless when attempting
to provide information to the non-technical members of the business. This breakdown in
communication further illustrates the chasm between IT and the business because business
leaders cannot relate their desired goals to IT in subjective terms that translate to objective
metrics.
25
Chapter 2
• Individuals within network teams see metrics from the perspective of data crossing
the wire.
• Systems administrators are primarily interested and have the greatest vision into
whole-server metrics.
• Developers need to peer into runtime environment metrics to see whether their
code is optimized for the environment.
A major problem with this stratification of IT personnel is that no one group can alone
comprehensively describe the behaviors across every component of a system. If a system
problem spans multiple domains, teams must work particularly hard towards finding a
resolution.
Figure 2.3: APM provides a type of Rosetta Stone, aligning each IT discipline’s focus
under a unified solution.
An APM solution assists with this language problem by providing what could be considered
a Rosetta Stone between each IT discipline, their individual focus, and their own
vocabulary. Although individual integrations within an APM solution are likely to be
managed by their responsible discipline—network integrations by network teams, code
optimization integrations by developers, and so on—the unified system provides a central
gathering point for all metrics. This centralization provides a single location where an
application can be measured across each of its IT disciplines at once. Such an analysis can
be further correlated across all disciplines as well.
The end result is that a fully-realized APM solution enables IT to operate as a single unit,
with system problems being quickly directed to the teams that have the greatest capability
to fix the problem.
26
Chapter 2
In the long run, this lack of financial information removes IT’s empowerment to solve
problems based on their budgetary impact. When IT is incentivized towards resolving
broken system components, they’ll fill their day with accomplishing just that task. Those
repair operations, however, might not be the best thing for the system over the long haul:
• Today’s band-aid repair actually clouds the troubleshooting process for tomorrow’s
outage.
• Today’s quick fix masks the much larger recognition that a wholesale system
upgrade is needed to keep up with the load.
Figure 2.4: APM and its integrations provide the raw data that feed Business Service
Management’s financial view of the system.
The relation of a service’s quality to the IT and business budget technically falls within the
purview of Business Service Management (BSM), a topic that will be discussed in Chapter 9.
However, there is a very important relation between BSM and APM in that BSM requires
the metrics gained from an APM solution to populate its business-centric view of the
system. You’ll find that although APM provides the technology metrics, its combination
with business financial logic is what powers BSM’s view of the world. Figure 2.4 shows an
example of the linkage between these key components.
27
Chapter 2
Consider again the problem situation first explained in Chapter 1. There, a slow response
time between the mainframe and the application server eventually grows to impact the
system as a whole. By nature, these kinds of system events often occur over a period of
time, growing in scope—and delay—until a minor situation becomes a major problem.
Figure 2.5: An APM solution’s high-level client network server view can illustrate
where areas of delay may soon cause a problem.
Figure 2.5 shows an APM system’s view of aggregate transaction performance between the
application server and the mainframe. With this view and others in place, it is possible to
draw a trend line towards a future failure before the failure actually occurred. This capacity
for defining possible pre-failure states enables IT to resolve problems before they actually
happen and before users notice. As you’ll learn in the next section, this proactive approach
to operations is representative of an IT organization at a high level of maturity; one that
drives value back to the business rather than reacting to it.
28
Chapter 2
Understanding IT Maturity
For an organization to efficiently make use of the kinds of information that an APM solution
can provide, it must operate with a measure of process maturity. IT organizations that lack
configuration control over their infrastructure don’t have the basic capability to maintain
an environment baseline. Without a baseline to your applications, the quality of the
information you gather out of your monitoring solution will be poor at best and wrong at
worst.
But how does an IT organization know when they’ve got that right level of process in place
to best use such a solution? Or, alternatively, if an organization recognizes that they don’t
have the right level, how can an APM solution help them get there?
One way to evaluate and measure the “maturity” of IT is through a model that was
developed in 2007 as part of a Gartner analysis titled Introducing the Gartner IT
Infrastructure and Operations Maturity Model (2007, Scott, Pultz, Holub, Bittman, McGuckin).
This groundbreaking research note defined IT across a spectrum of capabilities, each
relating to the way in which IT actually goes about accomplishing its assigned tasks. An IT
culture with a higher level of process maturity will have the infrastructure frameworks in
place to make better use of technology solutions, solve problems faster, plan better for
expansions, and ultimately align better with the needs and wants of the business they
serve.
Process maturity within an organization is defined as quite a bit more than simply having
the ability to solve problems. Within Gartner’s maturity model, the capacity of IT to solve—
and prevent—ever more complex problems was defined largely by its level of process
maturity.
An Example of Immaturity
It is perhaps best to explain this concept of immaturity through the use of an
example. Consider an organization that completely lacks any documentation
of its internal systems. Such an organization is also likely to lack formal
change control processes by which others are notified about changes to those
systems. In such an organization, a system can be configured and later
reconfigured at the whim of a single administrator. If an administrator or
developer finds a problem on the system, they resolve the problem as they
see it, notify no one, and continue about their day.
At first blush, the rate at which problems can be identified and resolved in an
environment can seem extremely beneficial. Administrators or developers
who find issues can quickly resolve issues as they see them, without the need
for complex and time-consuming paperwork, workflow, approvals, and
documentation. Such an organization can run exceptionally “lean and mean”
with their infrastructure, as the overhead associated with the process itself is
nonexistent.
29
Chapter 2
As organizations move from one stage to the next, they will find more documentation of
processes with less replication of work, greater and more advanced levels of configuration
control, different incentives for determining what is considered success, greater maturity
in monitoring, and the implementation of toolsets that enable richer planning and more
effective budgeting. With Figure 2.6 in mind, let’s take a more detailed look at the phases,
how organizations behave, and what benefits they get from each.
30
Chapter 2
In the Survival stage, IT organizations tend to focus on the use of native or freeware tools
for managing their infrastructure. They are constantly putting out fires within technology
they don’t understand. Monitoring and management elements are not in place, which
generally means that IT is notified about problems when users call to complain. IT
organizations in this phase tend to lack the basic understanding of the systems they
manage, let alone the deep understanding necessary to do well with an APM solution. Due
to the break/fix approach to problems, the rare APM implementation here often goes
unused once implemented as no time exists to actually employ its capabilities.
31
Chapter 2
Awareness
Gartner defines Level 1, Awareness as “Realization that infrastructure and operations are
critical to the business; beginning to take actions (in people/organization, process and
technologies) to gain operational control and visibility.” While the Survival stage is typified
by simply making it through from day to day, many organizations eventually develop the
cultural enlightenment that “there must be a better way”. This awareness is manifested
through a realization that their IT infrastructure and its operations are a function of the
business. They may further realize that their organization will need to take action to
formalize processes, standardize on technologies, and control the people and culture of IT
if they wish to mature.
This phase, called the Awareness phase, can in many ways be considered a bridge between
the fully chaotic activities of the Survival phase and the beginnings of structure in the
Committed phase. Here, processes and technologies yet remain ad hoc, however the
culture surrounding IT operations recognizes and begins to embrace the need for better
ways of accomplishing their daily tasks.
Committed
Gartner defines Level 2, Committed as “Moving to a managed environment, for example,
for day-to-day IT support processes and improved success in project management to
become more customer-centric and increase customer satisfaction.” At some point, IT
organizations and the processes that bind them eventually begin to grow the very basics of
structure. In the Committed stage, organizations begin to actually implement tools for
assisting then with the management workload. Problem resolution in this phase is yet
accomplished through a break/fix mentality; however, the level of consideration for
environment-wide solutions begins to grow beyond zero.
In the Committed stage, simplistic problem management solutions such as work order
tracking systems may be incorporated. Yet in this stage, the specifics of their use are often
not enforced through an agreed-upon set of rules. Work order tracking systems here are
used for the individual technician workflow, not necessarily for the tracking of
configuration changes. In the Committed stage, monitoring may be implemented, but in this
stage, that monitoring is limited to the core availability of the device itself.
In this stage APM solutions will not necessarily drive a direct benefit to application
performance, as performance is not yet valued over simple availability. Environments here
are yet focused on managing the inflow of problems as they come in, and as such, don’t
have the time to actually focus on analytics and problem prevention. Smart organizations
can leverage the implementation of more basic APM integrations during this period as a
mechanism to quickly drive the organization to a higher level of maturity. Such an
implementation in this stage will require the corresponding process and workflow
necessary to turn APM data into useful product.
32
Chapter 2
Proactive
Gartner defines Level 3, Proactive as “Gaining efficiencies and service quality through
standardization, policy development, governance structures and implementation of
proactive, cross-departmental processes, such as change and release management.” Once
an IT organization’s culture makes the conscious decision to move away from firefighting
as a way of life, it can be considered on the path to the Proactive stage. It can be argued that
most IT organizations today exist somewhere between the Reactive and the Proactive
stages, with varying levels of process and workflow in place.
A major determinant between these two stages is related to the number of individuals who
have successfully removed themselves from the direct resolution of problems. The time for
these individuals is freed towards looking at rational, automated, and environment-wide
solutions for preventing problems before they impact the user population. Here, the proper
levels of monitoring are likely in place to validate more than simple up/down availability.
Usage trends are monitored and analyzed, with thresholds for alerts in place to notify
responsible individuals. Automation tools are additionally used to enable repeatable
actions to occur on systems when conditions occur. Automated remediation capabilities
may be introduced in this stage as well. Found also in this stage are mature processes for
problem management as well as asset, change, and configuration control.
For organizations here, the implementation of an APM solution can arguably have the
greatest benefit to their business. Once fully in this stage, the IT organization understands
the wholesale changeover from the “break/fix” to the “keep it running” mentality. Lacking
in this stage are the real linkages between individual devices and components of the
greater system. As such, a system-wide view of applications and business services is still
lacking in maturity. Implementing APM here can quickly move an organization to the next
level of maturity.
Service-Aligned
Gartner defines Level 4, Service-Aligned as “Managing IT like a business; customer-
focused; proven, competitive and trusted IT service provider.” A major determinant
between the Proactive stage and the Service-Aligned stage is related to the organization’s
primary focus. When an organization continues its focus on individual technologies as
opposed to how those technologies integrate into a deliverable whole, that organization
remains in the Proactive stage. There, they are proactively resolving problems, but they are
still focusing on the problems and problem prevention. When that organization leaps
towards managing the services they deliver to the business in whole, they have successfully
arrived in the Service-Aligned stage. In this phase, you’ll often see IT delivering their own
customized services to the business with unique names and focuses rather than merely
referring to product names they acquire from vendors.
33
Chapter 2
In this stage, an APM solution—or one that functionally resembles it—is likely in place.
Solutions like APM are necessary in order to gain the situational awareness IT needs to
best manage its environment as an overarching system. At the same time, IT find itself
using that system with the goal of recurring improvement, looking for and resolving non-
optimized areas before users are impacted.
Business-Partnership
Gartner defines Level 5, Business Partnership as “Trusted partner to the business for
increasing the value and competitiveness of business processes, as well as the business as a
whole.” Once IT fully loses its identity as a separate function of business, they can be
considered a partner with that business as opposed to merely servicing its interests. In the
Business-Partnership stage, IT metrics are business metrics, as is the reverse. The role of IT
is as enabler for business processes, and as such, those processes are not considered
without IT as a primary stakeholder. IT is also a co-equal in business planning, as new
endeavors invariably include a technology component.
For an example of this, take a look at Figure 2.7. There you’ll see an example visualization
from an APM solution. The information in the figure displays the expected response time
for a specific Web service call, broken down among the amount of time consumed by the
client, network, and server components of the request.
34
Chapter 2
Completing a request of this type will require an amount of time from each of these three
elements of the IT environment:
• A client will need to process the request internally, preparing it for transmission on
the network.
• The network will need to transfer the request from the client to the server, over any
number of hops, with each transfer and hop incurring an additional time cost.
• The server itself will need time to ingest, process, and prepare a response to the
request.
Intrinsic to this request are a number of variables that require an IT organization with a
high level of maturity if the information is to be of value. To gain the greatest amount of
value from this information, such an organization must have:
• A mature level of configuration control such that the exact configuration of the
environment is known.
• A mature level of process control such that the interfaces between client and server
components can be traced to known threads.
• A mature level of environment control such that additional environment behaviors,
such as network bandwidth and latency as well as network-consuming external
forces can be ruled out.
35
Chapter 2
Secondly, and arguably more importantly, smart organizations can leverage an APM
solution itself to rapidly develop process maturity in an otherwise immature organization. By
reorganizing your IT operations around a data-driven approach with comprehensive
monitoring integrations, you will find that you quickly begin making IT decisions based on
their impact to your business’ applications. You will better plan for augmentations based
on actual data rather than the contrived anticipation of need. You will better budget your
available resources based on actual responses you get out of your existing systems.
In the end, leveraging an APM solution for your business services and applications will
make you a better IT organization.
• The ways IT looks at itself. In earlier stages of maturity, IT sees itself as a fully-
segregated entity from the business. In many cases, IT can see itself as a different
business entirely! Individuals in IT find themselves concerned with the daily
processing of the servers and the network, to the exclusion of the data that passes
through those systems. As IT matures, the natural culture of IT is to begin thinking
of itself as a partner of the business, and ultimately as the business itself.
• The ways IT looks at data & applications. Data and applications in the immature
IT organization are its bread and butter. These are the elements that make up the
infrastructure, and are worked on as individual and atomic elements. IT in earlier
stages will find itself leveraging manual activities and shunning automation out of
distrust for how it interacts with system components. Applications in early-stage IT
are most often those can be purchased off the shelf, with customization often very
limited or non-existent. Later-stage IT organizations needn’t necessarily build their
own applications; however, they do see applications as solutions for solving
business processes as opposed to fitting the process around the available
application.
36
Chapter 2
Consider again the situation in Figure 2.7. A complex performance issue in a business
application can occur across client, server, and network components in the environment.
The client can experience delay due to other processing or issues with the underlying client
infrastructure. The network can be oversubscribed with traffic, or client network
requirements can be greater than existing network components can handle. Servers can be
non-optimized in their processing or simply be overloaded.
Independent point solutions generally monitor only one of these three components at a
time, making the consolidation of data across separate systems with separate databases,
consoles, and formats extremely difficult. The resulting graphic in Figure 2.7 that breaks
down such an issue by its impact in each problem domain presents a way to quickly
identify the location of the problem.
Focusing further into that graphic presents the new picture that is Figure 2.8. This image
shows how such a graphic might be constructed through monitoring integrations installed
to network devices, servers, and even to the clients themselves. The result is a holistic
picture of transaction time itself, broken into its disparate elements. For more information,
drill-down capabilities intrinsic to the interface provide a way to discover more details
about each portion as necessary to resolve the situation.
37
Chapter 2
Visualizations
Monitoring Integrations
Figure 2.8: Transaction timing occurs across client, network, and server components.
This alignment happens along a number of technology axes. Alignment enables IT to better
scope projects for greater success, defocusing on projects or technologies that enhance IT
but stand in the way of business workflow. This business impact to IT projects ensures that
those projects are visible to business leaders. Such visibility enables those leaders to be a
greater stakeholder in IT projects, further ensuring that their incorporation makes sense
for the future. Lastly, alignment provides a way to convert a reactive IT organization to a
proactive one.
38
Chapter 2
It has been said that 70% of the average IT budget is earmarked for existing projects
(Source: Budgeting for Information Technology,
http://www.501cio.com/articles/200709_ITBudgeting.html), leaving only 30% for new
projects on an annual basis. For the new projects, roughly 60% fail to meet their original
goals or schedule (Source: Two Reasons Why IT Projects Continue to Fail,
http://advice.cio.com/remi/two_reasons_why_it_projects_continue_to_fail). Primary
reasons for failure include cost overruns, missing schedule goals, and end solutions that are
“riddled with defects and don’t accomplish the business goals for which they were
designed.”
A major source of the problem occurs when IT isn’t capable of scoping projects in a way
that makes sense for the rest of the business. This scoping problem can relate to:
39
Chapter 2
At the same time, you’ll find that APM’s expanded situational awareness enables the smart
IT organization to become more business-aligned. Nowhere is this more pronounced than
in businesses that rely heavily or exclusively on their technology infrastructures. E-
commerce businesses are particularly impacted. This is the case because in businesses like
e-commerce, the technology is the business. As such, having that enhanced vision into your
business’ technology underpinnings means knowing your customers—and your
storefront—that much better.
Chapter 3 will continue this introduction with a more technical look at underpinnings that
comprise APM’s monitoring integrations themselves. You’ll understand the history of
monitoring as well as how monitoring has evolved over time to become what it is today.
Chapter 3 will discuss how multiple levels and types of monitoring are necessary to gain
that holistic awareness you want out of an APM solution.
40
Chapter 3
Then it happens, always just as the rest of life is smooth-sailing and work is the last thing on
the mind: BEEP, BEEP.
“Uh-oh, there goes your vacation-ending device,” says a friend to John Brown, IT manager for
TicketsRus.com. John looks down to read the text now displayed on his pager, “I can see it in
your face. Something’s down, you probably don’t know what it is, and the only way to figure it
out is to set down that cold one and march right into work.”
John shakes his head at the pager, “This thing is killing me. Since we installed the new
monitoring system, I swear I’m getting pages like this every couple of days. Half the time it’s
nothing. The other half the time it’s something completely different than what shows up on
this stupid thing. You know, monitoring is great, but this kind of monitoring is taking my life
away from me.”
“Yep. Got to. If this is correct, the problem could be a big one, and you know what happens
when fans can’t get their tickets...”
John’s friend jokes, “We don’t want that to happen! I still remember that day when I and
everyone else couldn’t get tickets to the big game through your site. That problem was so bad,
it made the news!”
“Don’t remind me,” grumbles John, remembering that painful event in his past. A bug in the
code between the inventory and e-commerce subsystems for TicketsRus.com decided to rear
its ugly head just as tickets were released for the Finals. The bug, which for some reason only
caused problems at high loads and for certain types of events, had been introduced earlier in
the year with a software update. Because user loads had been light for the following months,
it took literally days to track down the error. TicketsRus.com, this team’s sole source for Finals
tickets, was criticized by its suppliers and even the press. It nearly lost a major source of
income out of the problem.
41
Chapter 3
To rectify the problem, John’s boss Dan mandated that a monitoring system be put into place.
Since then, John has come to regret his selection of monitoring system, an inexpensive but
limited solution that delivered alerts on server outages and not much else. The net result is
that John’s nights got a lot more sleepless and an e-commerce system which “felt” fine before
now alerted him and his teams on a near-constant basis.
John tells his friend as he heads for his car, “I’ve gotta’ run. Take care of the burgers for me.
Who knows when I’ll be back…”
---
This information is excellent for knowing when something isn’t right with your IT
infrastructure. It gives you the information you need to know that a problem exists. But this
kind of information is solely limited to answering the question, “What happened?” Knowing
that a particular server, service, or device appears down is one thing. Understanding
exactly why it went down is quite another.
That’s not to say that reactive monitoring isn’t useful in an IT environment. In fact, nothing
could be further from the truth. Consider the story that started out this chapter: Without
some form of monitoring in place, John would never have known that something was amiss
in his data center. Prior to the Memorial Day incident, not having that monitoring in place
would have easily turned a small problem into a big one. A simple outage could have gone
unnoticed for minutes or hours while TicketsRus.com’s customers were unable to purchase
the products they needed at the time they needed it.
42
Chapter 3
In effect, added success begets added due diligence. Like in the case of the Finals mentioned
earlier, as your suppliers and customers rely on you for greater things, they expect greater
things as well. Application Performance Management (APM) and its advanced monitoring
integrations enable you to provide that greater level of service.
To that end, IT has seen a similar evolution in the approaches used for monitoring its
infrastructure. IT’s early efforts towards understanding its systems’ “under the covers”
behaviors have evolved in many ways similar to Gartner’s depiction of organizational
maturity. Early attempts were exceptionally coarse in the data they provided, with each
new approach involving richer integrations at deeper levels within the system.
IT organizations that manage complex and customer-facing systems are under a greater
level of due diligence than those who manage a simple infrastructure. As such, the tools
used to watch those systems must also have a higher level of due diligence. As monitoring
technologies have evolved over time, new approaches have been developed that extend the
reach of monitoring, enhance data resolution, and enable rich visualizations to assist
administrative and troubleshooting teams. This chapter discusses how this evolution has
occurred and where monitoring is today. As you’ll find, APM aggregates the lessons learned
from each previous generation to create a unified system that leverages every approach
simultaneously.
43
Chapter 3
“ping”
Are You Up?
Yep.
Application
Server
Network
Management
Server
Figure 3.1: Early network monitoring was singularly concerned with system and
device availability.
This solution works well for low-criticality environments because it is elegantly simple. If I
ping the server every 2 minutes, I’ll know that that server has gone down no greater than 2
minutes after the outage occurs. Implementing basic availability monitoring is a key step
for organizations that want to move from Gartner’s Survival phase (“little to no focus on IT
infrastructure and operations.) to its Committed phase (“moving to a managed
environment, for example, for day-to-day IT support processes and improved success in
project management to become more customer-centric and increase customer
satisfaction.”).
But basic availability metrics can only go so far. Servers that are experiencing a processor
spike condition may be wholly incapable of processing useful data. Responding to an ICMP
“ping” request is an extremely low-level interrupt that requires virtually zero processing
power. Thus, even an unhealthy server can usually successfully respond to a ping request.
As such, basic availability metrics generally cannot identify when a server is not down but
merely hung.
44
Chapter 3
Figure 3.2: With SNMP, any SNMP-aware information can be gathered through a
request/response interaction.
With SNMP, a framework was established for requesting and receiving detailed
information from networked devices. That framework operates on a request/response
basis, with a network manager requesting information from an onboard SNMP agent
through a network call (see Figure 3.2). The network manager identifies the category of
information requested by its Management Information Base (MIB) Object Identifier (OID).
This OID is a unique identifier for the specific piece of information being stored by the
client. For example, in Figure 3.2, the network management server is attempting to GET the
information located at OID cpsModuleModel.3562.3. Globally unique across all devices, the
contents of that OID can be anything:
• Network statistics
• Device configuration information
• Sensor information
• System or device performance metrics
As configured by an administrator, it is the job of the network manager to determine which
information is interesting and should be polled. That information is stored within the
network management system’s database for later review by an administrator. The network
management system is configured to alert administrators when information is received on
inappropriate behaviors. Similarly, clients can alert the network management system
unilaterally through an SNMP trap when special conditions occur that require more
immediate attention.
45
Chapter 3
There are numerous reasons for SNMP’s limited scope and reach here. In order for an
SNMP-enabled network manager to gather information from any element on the network,
that element must have SNMP awareness. Thus, every device, operating system (OS),
application, and service must internally convert its own instrumentation data into the
format that SNMP understands. Also problematic is the poll-based nature of SNMP. SNMP is
configured to poll devices for their information on a regular basis, with the network
management server the source of those polls. This can create a bottleneck as the level of
monitored devices scales and limits the resolution of its data. Finally, until only recently,
SNMP lacked key security features.
Figure 3.3: Agent-based solutions leverage on-board clients to gather and transfer
monitoring information to centralized monitoring servers.
46
Chapter 3
As with SNMP solutions, agent-based solutions are in widespread use today. For servers,
services, and applications, these solutions enjoy benefits over and above SNMP-based
solutions because their information does not need translation into a format that is
understood by SNMP. For example, if an Oracle database decides to store performance
information one way, while a Siebel installation elects another method, both can be easily
encoded into the agent software. The agent can collect this information irrespective of
original vendor, source, or format, and translate it into a format that is useable by the
central monitoring solution.
Agent-based solutions also enable a much greater resolution of data, enabling monitoring
to scale with the needs of high-performance and high-criticality systems. The result is a
high degree of data resolution associated with metrics on the individual system itself.
Figure 3.4 shows an example of a graph that can be created out of such data, showing how
the health of the server and an installed database are graphed over time. As later chapters
will discuss, visualizations like this roll up low-level metrics to provide a high-level
understanding of a component’s health and service quality.
47
Chapter 3
However, the strengths of the agent-based approach also give way to its primary weakness.
A naturally system-centric solution, the agent-based approach only looks at information on
that system itself and from the perspective of the system. Relating instrumentation data
across multiple systems wasn’t natively possible with an agent-based approach unless that
relation was done at the central monitoring solution. This proved problematic. For many
environments, the limitations of the agent-based approach grew more obvious as the count
of hardware instances required to build an application or service grew in number. Most
specifically, using only an agent-based, server-centric approach, it was impossible to
visualize impacts coming from the rest of the environment.
Relating this situation to a business application scenario, let’s take another look at the
simplistic system discussed back in Chapter 1. In that system, shown again in Figure 3.5, a
number of elements integrate to provide a service for end users. However, in this second
scenario, a network-based tape backup device is also on the network. Due to a
misconfiguration by an administrator, a large backup has been initiated against the
mainframe during the workday.
48
Chapter 3
PROBLEM
Tape Drive Initiates
Over-the-Network
Backup
RESULT
Backup Traffic
Saturates Available
Storage
Bandwidth
User Firewall
Web Server Application Server Mainframe
Database Server
Figure 3.5: The actions of an unrelated device can cause performance problems on
the monitored system.
Large-scale tape backups are usually scheduled to occur during times of low processing
requirements. One reason for this is because the backup process can require an incredible
amount of network bandwidth if not properly tuned or segregated to alternative networks.
In this case, the entire customer-facing system is affected by a mistake made on a
completely unrelated device. Information gathered by agents on each of the devices cannot
easily show that a problem is occurring. Yet, the net result is a substantial reduction in
performance across the entire system.
It is for situations like this that an agentless approach is additionally necessary. Figure 3.6
shows an example report from an agentless monitoring solution. This report shows that the
primary consumer of network bandwidth is related to VPN traffic. The results from this
report can and should be cross-referenced with information from agent-based
visualizations to get a better situational awareness of network conditions.
49
Chapter 3
Figure 3.6: An agentless monitoring approach reports on aggregate traffic across the
environment.
Although situations like this tape backup mistake are unlikely to happen in a well-managed
production network, other environment behaviors can and do have impact on application
performance. Perhaps an e-commerce system experiences a flood of requests, limiting the
data it can successfully process in a particular period of time. Or, the overuse of a separate
and unrelated system on a shared network impacts the performance of a customer-facing
application. Only through the simultaneous use of both agent and agentless monitoring can
an IT organization get a complete understanding of its environment’s behaviors.
50
Chapter 3
Indirect network integrations operate in a much different way. Rather than installing
devices directly on the network, physically bridging connections between devices, indirect
network integrations interface with specially-enabled network devices to directly gather
their statistics. Common protocols such as NetFlow, JFlow, SFlow, and ipFIX function across
different network components to gather flow-based network traffic information from
devices.
51
Chapter 3
Monitoring Storage
Server
User Firewall
Web Server Application Server Mainframe
Database Server
Figure 3.7: Network probes and on-device monitoring protocols such as NetFlow
enable agentless monitoring across the entire network.
A singular difference between the direct and indirect methods is whether an actual
device—the “probe” itself—must be physically installed between connections. Obviously,
the installation and maintenance associated with physical probe devices adds an
administrative burden to their use. However, probes can be installed virtually anywhere,
making them highly flexible in heterogeneous networks. This is in contrast to the easy
administration associated with indirect integrations. Devices must natively support such
integrations, so not all areas of the network may be accessible. Figure 3.7 shows an
example of both types.
Transaction-Based Monitoring
And yet even with these two types of monitoring integrations in place, mature
environments still found themselves lacking in the depth of visibility into applications.
Although agent-based monitoring provides information about individual systems and
agentless monitoring fills out the picture with network statistics, a much deeper level of
understanding is still necessary.
That “deeper level of understanding” arrives with a type of monitoring that digs past
aggregate network statistics to peer into the individual transactions themselves between
elements of a system. By looking at individual transactions that occur between system
elements, it is possible to look for areas where code performance or inter-server
communication is a fault.
52
Chapter 3
To give you an idea of how transactions work, think about the last time you clicked on a
link for an image in your favorite browser. Clicking that link directed your browser to
request the download of the image. In a few seconds or less, that image was later rendered
on your browser for you to view.
Figure 3.8: Multiple conversations (“transactions”) must occur for a single image to
be downloaded from server to browser.
But what goes on in the background when such a request is made? What kinds of
conversations are required between your local computer and the remote server for that
image to successfully make its way across the Internet to your laptop? In actuality, the
conversation between client and server can be amazingly complex. Figure 3.8 shows an
example of the communications that must occur for the image euromap.gif to be
successfully downloaded off the Internet. Requests, replies, and acknowledgements are all
required steps for what seems simple on the surface.
53
Chapter 3
Figure 3.9: A Thread Analysis View provides a look at transaction details by time.
54
Chapter 3
The information in a Thread Analysis View is gathered through the analysis of particular
types of traffic occurring between identified application components. In Figure 3.9, an
HTTP GET command is being analyzed along with backend SQL queries related to the
request. Identifying the source and destination of the transaction along with its payload
and description assists the troubleshooting administrator with deconstructing the
transaction into its disparate components. This process is similar in function to the analysis
of a network trace done with network management tools. However, unlike the network
trace’s focus on individual packets, here the analysis is elevated to the level of the
transaction.
Figure 3.10: A Conversation Map illuminates which components are talking with
whom.
Elevating the analysis even further creates a Conversation Map view. This high-level view
assists the administrator with a look at the components involved in a transaction as well as
characteristics about the communication. This information is useful for identifying which
participants might be the cause of a performance issue or other problem.
These graphics are obviously only a small portion of the visualizations that can be created
through the use of an effective APM solution. Chapter 7 will focus exclusively on
visualizations like these, while Chapter 8 will highlight a specific troubleshooting example
through the use of an extended example.
55
Chapter 3
Figure 3.11: Product-specific hooks provide deep insight into their behaviors.
Let’s re-imagine the simplistic environment once again, this time attaching some well-
known products to the otherwise generalized terms “Web Server,” “Application Server,”
and so on. Figure 3.11 shows this environment once again, showing a Cisco firewall, an
Apache Web server running custom Java code, an Oracle database, and Siebel middleware,
all connecting back to a zSeries mainframe. Although the actual product names themselves
are unimportant for this discussion, the fact that shrink-wrapped products are components
of this environment is.
APM’s transaction-level monitoring enables the capacity to peer into the individual
conversations that occur between servers and applications in your environment. Yet
today’s enterprise applications themselves also support pluggable mechanisms for
gathering instrumentation data directly from the application itself. This instrumentation
data can provide additional insight into the inner workings of the applications in your
infrastructure.
Consider the situation where a custom-built Web site is created atop an Apache Web
engine and built in part using the Java language. In this case, determining the inner
performance characteristics of the Web server and language might be best served by
querying directly to Apache’s and Java’s internal metrics frameworks. This internal
information can be merged with transaction statistics to gain an understanding of where
processing delay is occurring—at the client, on the server, or within the network.
56
Chapter 3
It can be argued that End User Experience (EUE) monitoring encompasses some of today’s
most advanced monitoring technologies. It is, after all, one of the most recent of the
available enterprise monitoring approaches. EUE monitoring provides its value by
measuring the performance of the application from the perspective of its ultimate
customer—the end user.
EUE functions in three very different but very important ways. The first of these is through
the introduction of client-based monitoring at the user’s location itself. This can occur
through an agent’s installation to an end user’s location or through the use of special
probes. By co-locating an agent with the end user, that agent is enabled to monitor the
known behaviors of your system. It can then look for situations in which end-state
performance back to the user has degraded past acceptable thresholds. By locating
monitoring agents at the client itself, the individual transactions associated with end user
behaviors can be mapped out and timed to validate that your end users are experiencing
the right level of service.
A second way to measure the end user performance of your applications is through the use
of automated “robots.” Also located in areas where end users make use of your application,
these robots run a set of predefined scripts against your application. These scripts leverage
synthetic and actual transactions that are very similar to the types of actions a typical user
would perform against the system. For example, if users click through a Web site, attempt
to add items to a shopping cart, and ultimately check out, these types of actions should be
simulated by the automation robot.
57
Chapter 3
Figure 3.12: EUE integrations can be co-located with the users themselves or run
through robots for consistent automation.
Since these actions and their scripts are well-defined, the timing associated with their
processing is also a known quantity. As the robot runs the same scripts over and over, it is
possible to quickly determine when service quality diminishes for the end user. EUE is a
powerful component of APM monitoring, one that is arguably its greatest value proposition
for your business applications.
The third way to measure EUE experience involves the use of physical probe devices that
reside in-line with network endpoints. These special hardware units are particularly useful
in situations where metrics cannot be gathered directly from other network components.
This may be due to ownership or other reasons. More on this third method as well as on
probes will be discussed in Chapter 4.
Incorporating EUE’s into the enterprise WAN may require the installation of agents or
robots across many or all the individual sites within the WAN network. By installing these
components to multiple locations in the environment, the end users’ perspective can be
measured based on geographic location and any network behaviors that are experienced in
that locale.
58
Chapter 3
For example, consider the situation where a business application is homed in Denver,
Colorado but is used by employees through the United States, EMEA, and Australia. That
same application might function with no problems for the users in the United States. The
latency found in those connections might mean that even low-bandwidth connections
support the application’s use with no issue to the end user.
However, EMEA and Australia users might see a different situation entirely. Due to the
realities of physics, even the fastest connection between the United States and other
continents adds a specific quantity of network latency. If your business application is
latency intolerant but bandwidth insensitive, the users in those locations could be on the
fastest connection possible but still see a low-quality experience with your application.
In short, EUE’s ability to quantify the user’s experience is crucial to maintaining your
customer satisfaction.
With this information associated with monitoring’s potential, the next topic really relates to
how it can be implemented into your existing network environment. Chapter 4 discusses
the processes and practices associated with integrating APM into your existing application
infrastructure. Tiering its monitoring integrations across users, applications, databases,
and mainframes requires multiple integrations. Chapter 4 will assist you with
understanding how to accomplish those tasks correctly.
59
Chapter 4
“Yep. That’s the plan. Point everything back to 192.168.0.55. That’s the server we’ve set up to
collect all this information.”
The network engineer sits back in his chair, patently annoyed with the request, “You realize
how much effort this will involve, connecting to every network device all across the network to
do this one change. When do you need this done?”
“How about…now…?” responds John Brown, IT manager for TicketsRus.com. Now it’s John’s
turn to get annoyed at how this conversation is going. He’ll admit that this is a big request,
but that’s why he pays his network engineer a salary, to make these sorts of things happen.
John thinks a bit as he sits across the desk from his network engineer. He’s been reading more
of late about Gartner’s concept of IT maturity, most especially since that last incident on
Memorial Day. In learning more about the differences between a reactive IT culture and one
that proactively solves problems, he’s come to realize how survival-oriented his organization
really is. That Memorial Day incident should have never happened…
“So tell me again about this ‘magical’ monitoring solution you’re dreaming up, John. You
know we’re already collecting MRTG statistics off all the network equipment. You look at the
same graphs I do. We know when we’re seeing problems with bandwidth or when our ISP isn’t
meeting their agreements,” continues the network engineer, obviously irritated about this
intrusion into his team’s traditional boundaries, “I hate to be blunt, but why should my team
care about this new technology?”
John fires back, “Because it’s this ‘new technology’ that’s going to save this company. You too
got called in for the Memorial Day incident. You remember how long it took us to track down
the solution.”
“True,” responds John, mentally noting this conversation for the engineer’s performance
appraisal coming up later this year, “but the extra 2 hours we spent sitting around the table
pointing fingers at each other didn’t get us up and running any faster. That’s what the APM
solution is for, finger-pointing prevention.”
60
Chapter 4
That breaks the ice, giving them both a laugh out of their situation. From the outside looking
in, TicketsRus.com appears like a single entity to its customers, providing a unified storefront
for selling tickets. As a technology company, however, the reality of its internal struggles
between teams is a constant battle.
John leaves the engineer’s office and reflects a bit on the conversation as well as all the other
similar conversations he’s had with server administrators, developers, and even fellow
members of management. This APM installation is more than just a technology insertion. Just
getting this solution installed has been a lesson in professional growth as much as clicking
Next, Next, Finish. John realizes that the sheer process of fitting his “magical” monitoring
solution into TicketsRus.com’s culture is in and of itself a maturing activity.
He just can’t wait to see what the thing will look like as a finished product.
That statement isn’t written to scare away any business from a potential APM installation.
Although a solution’s installation will require the development of a project plan and
coordination across multiple teams, the benefits gained are tremendous to assuring quality
services to customers. Any APM solution requires the involvement of each of IT’s
traditional silos. Each technology domain—networks, servers, applications, clients, and
mainframes—will have some involvement in the project. That involvement can span from
installing an APM’s agents to servers and clients to configuring SNMP and/or NetFlow
settings on network hardware to integrating APM monitoring into off-the-shelf or
homegrown applications.
In this chapter’s story, the network engineer refers to John’s APM solution as “magical,” yet
the reality of its resulting situational awareness couldn’t be any further the opposite.
Rather than a source of any subjective mysticism, an APM solution enables a level of
objective analysis heretofore unseen in traditional monitoring.
61
Chapter 4
Figure 4.1: APM’s integrations enables real-time and historical monitoring across a
range of IT components, aggregating their data into a single location for analysis.
The realities of that objective data are best exemplified through APM’s mechanisms to
chart and plot its data. Figure 4.1 shows a sample of the types of simultaneous reports that
are possible when each component of an application infrastructure is consolidated beneath
an APM platform. In Figure 4.1, a set of statistics for a monitored application is provided
across a range of elements. Take a look at the varied ways in which that application’s
behaviors can be charted over the same period of time. Measuring performance over the
time period from 10:00 AM to 7:00 PM, these charts enable the reconstruction of that
application’s behaviors across each of its points of monitoring.
Chapter 8 will analyze graphs like this in dramatically more detail, running through a full-
use case associated with a problem’s resolution. But, for now, let’s take a look at some of
the information that can be immediately gleaned through the types of visualizations seen in
Figure 4.1.
62
Chapter 4
• During and immediately prior to that same period—identified by the red vertical
bar in each graphic—you can quickly determine that the problem was preceded by a
spike in Web server processor use. This spike is found from the data in the upper-left
graph.
• During and immediately after the problem, a spike in Web server response time was
also experienced. This data can be distilled from the information presented in the
top-middle graph.
• Perhaps as a result of the problem, or as one of its root causes, simultaneous drops
were experienced in network link utilization (third row, middle), percentage of slow
transactions (second row, left), and accounting transactions (fourth row, left).
With this information in hand, one can theorize that a natural correlation between these
events has occurred. Some situation on the Web server appeared to cause the short spike in
the HTTP error rate. That spike in error rate caused a simultaneous drop in processing, one
that may have been noticeable by the application’s users. The net result is a behavioral
change on the part of the distributed application.
At this point, you should be thinking, “This analysis comes to a pretty obvious conclusion
about the problem’s culprit.” Yet that’s exactly the point that’s important to recognize: Your
initial assumption of a system problem is often not the problem itself but merely a symptom of
the problem.
This example started out by looking at a recognizable spike in the HTTP error rate. Yet
without a monitoring system in place to notify on this problem, your actual starting point
for a problem like this might instead be with the accounting group and a phone call to the
service desk. Perhaps that group noticed a short slowdown in their data processing.
Perhaps the network’s MRTG statistics showed a slight and unexplained dip in link
utilization. Maybe a user called in with a concern that the application “was working slow
today.”
In all these cases, quickly identifying the correlation between the root cause of a problem
and the down-level fallout from that problem is only possible using statistics across the
gamut of that application’s elements. To access this data, however, requires the
involvement of individuals across the IT organization. Implementing APM’s integrations
into infrastructure components, network devices, and applications requires concerted
effort. This chapter will discuss some ways in which those integrations are commonly
inserted into your existing technology infrastructure.
63
Chapter 4
Consider for a moment the architectures that are required to build such systems today.
Massive levels of redundancy are required, which at the same time bring higher levels of
availability while adding higher levels of complexity. Tracing down a system problem
grows significantly more difficult when that problem could have occurred on any one of
many servers in a cluster.
To give you an example of how a complex system like this might look, see Figure 4.2. This
example represents a large-scale system not unlike what you might expect out of an e-
commerce company like TicketsRus.com. This system contains multiple services and
interconnections, including some that are out of the direct control of local administrators.
64
Chapter 4
65
Chapter 4
In this system are a number of elements, each with a specific role to fill:
• In the first tier sits an externally-facing Web cluster that provides the front-end
servicing of business clients. That cluster handles the presentation load from
incoming clients, resting atop a second-tier Kerberos-based authentication system
for the processing of user logins and passwords. This cluster is the primary point of
entry for users to connect into the environment.
• Servicing the cluster is a set of second-tier systems. User data such as state and
shopping basket information is stored within a separate second-tier ERP system.
Inventory-processing functions have been offloaded onto a set of Java-based
application servers.
• Mainframe and order management systems are in the third-tier. Here, product
inventory information is stored as well as the processing of business logic
associated with orders. This includes functions such as inventory, management of
user shopping baskets, suggestions for alternate and/or complimentary products,
and so on.
• A final system in this third tier, the 3rd Party Credit Card Proxy, manages the local
processing for credit card orders. The job of this proxy is to work with the external
credit card processing service, forwarding credit card information and receiving
approvals.
• In the fourth tier is the routing equipment necessary for external connections to
suppliers—used for real-time inventory updates and other supplier
communication—as well as the third-party credit card processing facility.
This system is obviously highly comprehensive in the services it provides for its users and
its business. It can display a list of inventory on a Web page. It can process user mouse
clicks and present alternate and complimentary products to users based on their click
habits. It can aggregate user-desired products into shopping baskets, and process their
credit cards for the completion of purchases. It can even tie into supplier extranets for the
automated updating of product information in real time. Whether for a company like
TicketsRus.com or any company requiring a customer-facing e-commerce system, this
example architecture is designed to provide the necessary kinds of functionality.
This tiering of clients to Web servers, Web servers to application servers, and application
servers to databases is a common architecture in many of today’s complex systems.
Interconnecting these systems are networks, firewalls, and security devices that ensure the
secure connectivity of data. That data is stored onto centralized storage in multiple places.
It is these types of systems that make excellent starting points for an APM solution. As a
revenue driver, such solutions must remain highly available while at the same time
providing their customers with an acceptable level of performance during their interaction.
This is critically important in the case of customer-facing solutions, because when your
system cannot perform to your customers’ demands, you may find yourself losing business.
66
Chapter 4
All these monitors illuminate different behaviors associated with the greater system at
large, and all provide another set of data that fills out the picture you first saw in Figure
4.1’s charts and graphs. Now take a look at Figure 4.3, where some of these monitoring
integrations have been laid into place.
67
Chapter 4
68
Chapter 4
Using agents that are installed directly onto individual servers, it is possible to gather
metrics directly off those servers. Each server OS has its own mechanism for gathering and
reporting on server-specific performance characteristics: The Microsoft Windows OS uses
the Windows Management Instrumentation (WMI) service for gathering such information,
storing it in a special area of the Windows registry, and presenting it to external servers
through either external WMI queries or its WS-Management Web service. Event log data is
stored in proprietary logs and presented through similar interfaces. Linux and UNIX
servers leverage a combination of tools—vmstat, iostat, netstat, nfsstst, others—for the
gathering and dissemination of data. Event log data can be gathered and distributed
through the Syslog daemon.
Figure 4.4: % Processor Time for a Web server, as gathered by an agent on the Web
server itself.
In all these cases, installed agents on each server are empowered with gathering the right
kinds of data and reporting this data back to central APM servers. The net result is the
creation of graphs similar to Figure 4.4, where % Processor Time for the Web server is
shown over a 24-hour period of time. In that graph, it should be apparent that the Web
server’s utilization grew dramatically beginning at 10:00 AM during the day of
measurement.
69
Chapter 4
The process of actually installing agents to the correct servers in your environment is one
important facet of your APM installation and should be included as an action in your
project plans. Depending on the type of APM solution selected, that agent may require a
manual installation or may be automated through its console. Obviously, each added agent
to a system involves slightly more of its resources that are consumed for management
functions. To reduce the number of agents on a system, some APM solutions enable the
ability to leverage data from other monitoring solutions. Using this “monitor-of-monitors”
approach, the APM platform gathers its data instead from another in-place monitoring
solution. In the end, your selection of APM solution should consider its abilities to integrate
with the existing management platforms that may already be present in your environment.
For example, an environment’s Order Management System might run atop an Oracle
database. It is entirely possible that server-centric statistics such as % Processor Time
cannot discretely capture the internal behaviors of the Oracle database. Perhaps that
database is processing a large number of “bad” records that impede its ability to correctly
work with the good ones. Maybe the application’s individual queries are not correctly
optimized. In both of these situations, it is feasible that an Oracle-specific behavior doesn’t
directly manifest into server-centric metrics. Necessary are deeper integrations, built-in to
the installed agent, which can query Oracle’s native performance statistics for additional
data.
70
Chapter 4
Figure 4.5: Agent integrations into the Oracle database itself display more specific
data about the database’s behaviors.
Figure 4.5 shows a graph of information gathered from just that Oracle database. Here, the
same time period is shown as in Figure 4.4 but the gathered information instead shows the
percentage of “slow” transactions as defined by the administrator and experienced within
the Oracle database. As with Figure 4.4, the area of concern has been highlighted between
the red bars immediately after 10:00 AM.
In this second graph, it should be obvious an increase in the rate of slow transactions is
highly correlated with Figure 4.4’s increase in processor utilization. The combination of
these two graphs illuminates a greater level of detail about how one application’s behavior
can have an impact on overall system performance. In this case, perhaps there is extra
processor use required to deal with slow transactions. Or, the individual transactions being
processed by the database are particularly complex or are not properly optimized for
performance.
When looking at an APM solution, look for one whose agents are enabled to support your
needed middleware applications and databases. By incorporating application- and
database-specific integrations directly into agents, it is possible to discover more detailed
information about the inner workings of these otherwise-opaque systems.
Although this capability is useful when the number of supporting applications and
databases is small, it grows substantially more useful as their count increases in the
environment. Figure 4.6 shows a rollup visualization that details the level of performance
across a set of applications.
71
Chapter 4
The graphic here details the behaviors across ten different applications that may be
involved in a particular infrastructure or business service. Specific to each application is
information about the number of users who may use that application, the total amount of
data used by the applications, its aggregate performance, as well as the number of users
that may be affected by that application should it experience a problem.
The percentages displayed in the column marked “Application Performance” relate to that
application’s instantaneous performance. To create these percentages, an application
administrator or predefined template must identify which performance metrics make
sense for that application’s measurement. Also needed are the threshold values for
identifying when an application is not performing to expected levels. Your APM solution
should provide the internal mechanisms for identifying these thresholds.
The net benefit of these calculations arrives during normal operations. When one or more
applications’ performance levels degrade past acceptable levels, administrators can use the
information in the “affected users” column to prioritize their resolution effort to those with
the highest impact on users.
The actual gathering of these types of network statistics can be enabled through a number
of solutions, many of which are likely already available on your network hardware today.
Virtually all of today’s business-class network hardware natively includes the support for
SNMP integration. With this information in hand, a centralized network monitor can pull
statistics directly from networking equipment through SNMP polls.
72
Chapter 4
Yet SNMP is only one solution. SNMP alone cannot provide the right kind of information
associated with network traffic “flows,” meaning the overarching conversations between
elements on the network. Flow data monitoring can be considered a superset of that seen
by measuring individual packets. It provides a view of network traffic at a higher level than
at the individual packet, yet not to the level of inter-server transactions.
To provide this level of data, additional protocols have been developed such, as Cisco’s
NetFlow, that report on high-level “flows” rather than packet-level inspection. Although
still not looking at this data from the level of the server-to-server transaction, flow
information gives the troubleshooting administrator a better sense of their network’s high-
level traffic patterns.
Figure 4.7: Another view of the system from the network’s perspective shows that
network performance is nominal during the period of measurement.
An example of this kind of data is displayed in Figure 4.7. Here is shown a view of the
impact of the network during a slightly offset period of time. In this graph, it is possible to
quickly see that the network experienced a slight dip in performance between 2:00 AM and
5:00 AM. That performance returned back to baseline by 10:00 AM, when the database
began experiencing problems. This information is critically useful for the troubleshooting
administrator, as it quickly shows that high-level network performance doesn’t appear to
have a direct impact on the system’s problem.
As with application monitoring, aggregate views of the network are also possible once
network integrations are laid into place. One such view is shown in Figure 4.8. Here,
aggregate network statistics across multiple sites are shown in a single view, with
associated trending arrows attached to each.
73
Chapter 4
Figure 4.8: Aggregate network statistics across multiple sites are shown, with
trending arrows showing prior behaviors.
Graphs such as this are only possible when network statistics from across the environment
are gathered into a single location. By integrating each individual network device’s
statistics into an APM framework, an administrator can quickly pinpoint network errors
and transfer rates, further isolating a problem to particular network segments.
Installing Probes
Chapter 3 introduced the concept of probes and the efficacy of their use in network
monitoring. Your decision to use network probes will be first dependant on the capabilities
of your networking equipment. Networking equipment that cannot support built-in
integrations (which are rare these days) or network security policies that prohibit the
passing of monitoring data are both situations where probes may be necessary.
74
Chapter 4
ct ng
n
ne ori
io
on nit
o
M
C
Figure 4.9: A passive network probe is installed between an internal LAN and its
connection to an external WAN.
One situation where probes can provide special data not otherwise possible through in-
device integrations is in measuring traffic across network links that are not within local IT
control. For example, Figure 4.2 showed a connection between a third-party credit card
proxy and a router to the extranet that is shared with the credit card processing service.
Often, these kinds of connections are not within the direct control of the local IT
organization, which makes problematic the installation of on-device monitoring
integrations. In these cases, network probes can be physically installed between otherwise-
uncontrollable connections to monitor their traffic for inconsistencies.
Figure 4.3 showed a particularly useful example of such an installation. Here, a probe can
be installed with the intent of monitoring and alerting for Service Level Agreement (SLA)
breeches between the business and the external processing network. This adds a level of
due diligence during SLA breech negotiations as well as added protection against the
impact of external actions such as outages or performance losses.
Measuring Transactions
Yet another perspective on system behaviors occurs when the monitoring solution is
pointed towards the connection between the Web site and its Inventory Processing System.
The view enabled through this integration combines otherwise packet-based data to look
at the individual transactions between these two servers. Although the packet-based data is
useful for recognizing the overall behaviors of the network in relation to that individual
connection, transaction measurements provide more details about the specific
“conversations” between servers and services on two different hardware components.
75
Chapter 4
Take a look at Figure 4.10 for another chart that details this view of individual transaction
rates. Here, you can see that immediately prior to the time in question, a spike in
transactions began to occur between the two systems. Perhaps the timing of this
transaction spike gives some added details about the original processor utilization spike
from Figure 4.4. Perhaps an added spike in use by users was a cause of the problem rather
than an internal issue.
Figure 4.10: Elevating the view from individual network packets shows additional
information about the conversations between servers.
When thinking about this concept of transactions, it is important to consider the individual
conversations between the different elements on a system. Consider, for example, the types
of conversations that could potentially occur between a generic Web server and its
Inventory Processing System. Those conversations have both a source and destination IP
address, but they also operate over a known set of TCP or UDP ports. Within a single set of
ports, individual Web services at both ends can transfer multiple types of information, with
the ultimate end consumer of this information being different services on each server.
Your APM solution should include a set of network filters that works with both agent and
agentless monitoring to identify these conversations. Only by leveraging the right network
filters can a system gather meaning through the combination of individual packets of
information. Those filters must understand the protocols being used by both sides of the
communication as well as the well-formed data used in their conversations.
76
Chapter 4
Continuing the example from Figure 4.10, let’s assume an administrator wishes to see the
actual “words” in the transactional communication between the Web cluster and the
Inventory Processing System. To do so, they must drill down even further. By correlating
the network probe’s “external” view of a transaction’s performance with the “internal”
application analytics perspective from a server-resident agent, it is feasible that a view
similar to Figure 4.11 can be created.
Figure 4.11 shows a truncated view of the details of that conversation. Here, the discussion
between the two servers is drawn out in detail. The Web server attempts to contact the
Inventory Processing System, with the processing system eventually responding. Statistics
associated with delay timing is displayed in addition to the conversational details. The
result is a kind of time-oriented log of the conversation within and between the two
servers, showing exactly where areas of delay are found.
A result of this level of detail is that conversational elements and areas of unacceptable
time lag can be identified within the individual code elements of each server’s Web
services. For example, if a particular callback from one server to another shows a high rate
of delay, while the network itself can be eliminated as a source of lag, it grows very easy to
flag the situation to developers. Ultimately, the end goal is to quickly identify the problem
and come to a resolution that can be patched into the system.
77
Chapter 4
As a result, APM solutions include a number of mechanisms to roll up this massive quantity
of data into something that is useable by a human operator. This process for most APM
solutions is relatively automatic, yet requires definition by the IT organization who
manages it.
The concept of “service quality” is used to explain the overarching environment health. Its
concept is quite simple: Essentially, the “quality” of a service is a single metric—like a
stoplight—that tells you how well your system is performing. In effect, if you roll up every
system-centric counter, every application metric, every network behavior, and every
transaction characteristic into a single number, that number goes far in explaining the
highest-level quality of the service’s ability to meet the needs of its users.
78
Chapter 4
This guide will talk about the concept of service quality in much greater detail over the next
chapters, but here it is important to recognize that implementing service quality metrics
also requires the involvement of the APM solution’s implementation team. Consider the
graphic shown in Figure 4.12. Here, a number of services in different locations are
displayed, all with a health of “Normal.” This single stoplight chart very quickly enables the
IT organization to understand when a service is working to demands and when it isn’t. The
graph also shows the duration the service has operated in the “normal” state, as well as a
monthly trend. This single view provides a heads-up display for administrators.
Yet actually getting to a graph like this requires each of the monitoring integrations
explained to this point in this chapter. The numerical analysis that goes into identifying a
service’s “quality” requires inputs from network monitors, on-board agents, transactions,
and essentially each of the monitoring types provided by APM.
For example, if processor use on servers becomes unacceptable when it goes over 80% for
a 5-minute period of time, this behavior must be specifically ingested into the APM
platform. The same holds true for network behaviors: If network bandwidth utilization
over 85% is considered an unacceptable situation, it too must be configured into the
system. The aggregation of each of these thresholds ultimately combine to create the
service quality metric shown in Figure 4.12.
79
Chapter 4
Although this process can seem extremely time-intensive, effective APM solutions speed
the process by incorporating industry-standard templates for common thresholds. The
inclusion of these templates assists administrators with a starting point for later
customizing the unique characteristics of their environment. More on this concept of
service-centric monitoring will be covered in Chapter 6.
You may have noticed that there is one major omission in this chapter’s discussion on
implementing APM. End-User Experience (EUE) monitoring is one topic not found in this
chapter. Due to its relatively new entrance into the market as well as its potential for wide-
sweeping changes in the way services are monitored, this technology gets an entire chapter
to its own. The next chapter will discuss EUE in detail, talking about its technology
underpinnings, where it fits into the environment, and the types of data that can be
gathered as a result of its implementation.
80
Chapter 5
In any case, John finds himself staring blankly at a set of charts from his recently-implemented
APM solution, finding little in their meaning this morning. He looks through charts of
individual server performance. He flips through those relating to his networking behaviors
and finds nothing there of interest either. He even peeks into the transaction breakdowns
between his servers, ridiculously comprehensive in their level of captured data, but ultimately
too low-level for his management-oriented mind to comprehend. Those charts are for a
developer’s brain, not his.
In looking at this graphic, he finds that he cares little about what that chart actually
represents. “Some measurement of the ticketing system had this little bump during the
workday hours,” he thinks to himself, “that’s interesting, I guess.”
“These charts are giving us the information we want,” he thinks to himself, “They help my
admins find and fix problems. They help my network engineers track down bottlenecks. Heck,
they even found the piece of code that caused that big problem a few months ago. We’d have
never tracked that down without the transactional views. Yet something still nags me…
81
Chapter 5
John’s problem today has nothing to do with lack of monitoring. With his new APM solution in
place, quite the opposite is true. Fully implemented, John’s APM solution now gathers metrics
from servers and their individual applications. The network itself is represented, both from
the perspective of individual servers as well as the WAN as a whole. Yet in all of these metrics
he’s gathering, he’s missing the one piece that ultimately represents success: His users’
experience.
Just then the phone rings. It’s Dan Bishop, the COO and John’s direct superior. Things are never
good when Dan calls, “I’m hearing scattered reports that our wait time on the system has
spiked to over 20 minutes per purchase. What’s up?”
“Well, your numbers might not show it,” Dan continues, “but an old golfing buddy of mine just
called in and reported the same. Track it down and let me know. Oh, and put two tickets for
next week’s concert at the arena on hold for him, will you?”
Yet Chapter 3 and 4’s discussion concluded at the very point where experience-based
monitoring actually starts to get interesting. With the development of End User Experience
Monitoring (EUE), automated solutions for watching your business systems get their first
looks into the actual behaviors experienced by an application’s users. Gathering metrics
from the perspective of the user themselves brings a level of objective analysis to what has
traditionally been a subjective problem. If you’ve ever dealt with the dreaded “the servers
are slow today” phone call, you understand this problem.
What Is Perspective?
This guide has used the term “perspective” over and over in relation to the types of data
that can be provided by a particular monitoring integration. But what really is perspective,
and what does it mean to the monitoring environment?
82
Chapter 5
Consider, for example, a set of fans watching a baseball game. If you and a friend are both
watching the game but sitting in different parts of the stadium, you’re sure to capture
different things in your view. Your friend who is sitting in the good seats down by the
batter is likely to pick up on more subtle non-verbal conversations between pitcher and
catcher. In contrast, your seats deep in the outfield are far more likely to see the big picture
of the game—the positioning of outfielders, the impact of wind speed on the ball, the
emotion and effects of the crowd on the players—than is possible through your friend’s
close-in view.
Relating this back to applications and performance, it is for this reason that multiple
perspectives are necessary. Their combination assists the business with truly
understanding application behaviors across the entire environment. An agent that is
installed to an individual server will report in great detail about that server’s gross
processing utilization. That same agent, however, is fully incapable of measuring the level
of communication between two completely separate servers elsewhere in the system.
This view is critically necessary because it is not possible—or, at the very least,
exceptionally difficult—to construct this experience using the data from other metrics.
Relating this back to the baseball example, no matter how much data you gather from your
seat in the outfield, it remains very unlikely that you’ll extrapolate from it what the pitcher
is likely to throw next.
For the needs of the business application, end user experience (EUE) enables
administrators, developers, and even management to understand how an application’s
users are faring. First and foremost, this data is critical for discovering how successful that
application is in servicing its customers. Applications whose users experience excessive
delays, drop off before accomplishing tasks, and don’t fulfill the use case model aren’t
meeting their users’ needs. And those that don’t meet user needs will ultimately result in a
failure to the business.
83
Chapter 5
This line of thinking introduces a number of potential use cases where EUE monitoring can
benefit an application’s quality of service. EUE monitoring works for valuating the
experience of the absolute end user as well as in other ways:
84
Chapter 5
Robot
External User
Web Cluster
Figure 5.1: Multiple use cases exist for targeting EUE monitoring.
85
Chapter 5
Figure 5.2: EUE can measure user behaviors across multiple connections in multiple
geographic locations.
How does this watching and reporting occur? In short, by creating a log of each user’s
activities. Consider for a moment how an Internet-facing application works. In the example,
the application’s user interface (UI) is Web-based, served through a front-end Web cluster.
For a user to work with that Web-based application, the cluster must generate and present
Web pages to the user. The user interacts with those Web pages by clicking in specified
locations, with each click resulting in some response returned back to the user.
A benefit of working with Web-based applications is that each click can be encapsulated
into its own transaction. When the user clicks on a Web page link, that click begins a long
chain of events. The Web server interacts with down-level services to gather necessary
data. Those down-level servers may then work with others even further down the
application’s stack. Eventually, through some combination of effort, the right data is
gathered. That data is then passed back to the front-end Web servers, which render new
content for the user.
By measuring which links the user clicks on, as well as the response time in receiving and
rendering resulting data back to the user, it is possible to identify the quantity of time
consumed by each step in the process. Later, this chapter will talk more about the spread of
time between the different system elements—client, network, and server—but for now,
recognize that EUE monitoring for end users works because the action of each user is
encapsulated into a Web transaction that can be measured.
86
Chapter 5
Conversely, users that connect to a US-based application from the Asia-Pacific or EMEA
regions must route their communication through transcontinental connections and over a
much longer distance. The quality of those connections as well as their length of travel can
impact the overall experience of the user. By measuring an application’s performance from
a series of different geographical locations, it is possible to recognize when affecting
networking conditions exist.
As with the earlier example, because measurements here are made at the Web server, all
inbound connections can be measured against each other. Determining the time required
for a full user action to be completed illuminates much about the quality of the connection,
and thus the user’s experience.
With Internet-based applications, these non-standard behaviors are more the norm than
the exception. Users are used to the “always-on” nature of the Internet, electing to work
with its applications as if they too were always on—logging in, logging out, stepping away
mid-transaction, moving on to another task, and so on. Because of these erratic behaviors
another more predictable “end user” perspective is necessary. That perspective is provided
through internally- and externally-placed robots (see Figure 5.3).
Figure 5.3: Robot users can repeat the same action to derive a baseline of
performance and watch for deviations.
87
Chapter 5
The primary job of a robot is to simulate the behavior of a standard user. Such a robot is
programmed with a series of actions, the completion of which is accomplished in a known
period of time. Repeatedly running through that set of actions creates a baseline response
profile for the application. The actions are known and are run over and over, so
administrators can then be alerted when performance deviates from the baseline.
For example, if a robot is preconfigured to click through a series of pages on the External
Web Cluster with the goal of finding and eventually purchasing an item, it is possible to
determine the average period of time required to complete the actions. When that period of
time deviates from the baseline at some point later on, it can be assumed that an issue or
problem is occurring somewhere in the application.
Although robots alone are not likely to assist in locating the problem, they can operate as a
bellwether for downstream problems. Identifying a change in the overall performance back
to the user often means that a problem or other issue should be reviewed using other
monitoring metrics.
Using an example from TicketsRus.com, the primary mechanism for their ticket sales is
through their Internet-based application. Many individuals in the IT organization—
administrators, engineers, developers, and so on—are responsible for maintaining the
technology that powers that application. Yet a completely different force of individuals is
also necessary for ensuring that the right tickets are brokered for the right events and to
the right people. These accounting, sales, and management individuals require their own
interface into the application that has nothing to do with its ultimate rendering of Web
pages for customers.
Figure 5.4 shows how EUE can assist these individuals. Here, an Internal Accounting User
interacts with the Order Management System to ensure that the right tickets are always
available for purchase. The actor in this figure may be one person or an entire department
of individuals that are spread across a country or the globe. Targeting EUE monitoring in
this location gives the troubleshooting administrator another set of data to identify user-
visible behaviors on the system.
88
Chapter 5
Figure 5.4: EUE can be incorporated into other parts of the application to monitor the
behaviors of internal users as well as external.
EUE monitoring at the level of the Order Management System can identify when that down-
level system experiences a loss of performance. It is possible to compare the information
gathered at this level with other information at the External Web Cluster to isolate
potential problems. Such EUE monitoring for internal users can occur at any level in the
application’s stack where performance is a concern. Each addition of monitoring at
different user endpoints provides yet another set of performance measurements that are
useful in measuring overall quality of service.
Whenever an outside party is contracted for those services, a binding agreement is usually
laid into place to define their quality. An Internet or Cloud Computing service provider
guarantees certain levels for bandwidth and latency as part of the price paid for their
services. A credit card processing facility guarantees a proscribed level of uptime. Suppliers
with direct application connections must meet minimum requirements.
Yet in many cases, the organization doing the monitoring of that service level is the
provider itself. For example, an Internet provider guarantees a particular service level but
does so based on the metrics that they themselves measure. You can imagine the conflict of
interest that occurs in this case when the business providing a service is also in the
business of measuring their success with that service.
89
Chapter 5
In this case, another form of EUE integration becomes useful. Targeting EUE here provides
a secondary set of measurements when it becomes necessary to independently assess
external-party service levels. Figure 5.5 shows another example of an external service that
could be monitored by such an EUE integration. There, the third-party Credit Card Proxy
along with its Extranet Router is contracted to handle payment card services for the e-
commerce system. This is a common service that is contracted out to external parties
because of the complexities of payment card handling.
Figure 5.5: Validating service provider connections is another effective use of EUE
monitoring.
In this situation, payments for any services or items on the e-commerce Web site route
through that external system. Yet this external system often lies outside the direct control
of the local IT organization. Since the business derives all of its income through this
interface, it is considered mission critical to the business. As such, it is a best practice to
implement independent monitoring to verify its service quality.
This kind of monitoring can and likely should occur in-line with any external system that
participates in the application. The resulting metrics can be used in independently
identifying any violations to Service Level Agreements (SLAs) as well as in negotiating
chargebacks when vendors don’t meet their agreed obligations.
90
Chapter 5
In cases like this, the use of in-line probes is useful in gathering the right
information. Probes were first introduced back in Chapter 4 as one
mechanism for integrating APM monitoring into otherwise-unavailable
infrastructure components. There, Figure 4.9 (provided below) showed an
example of how a probe can be installed between a supplier extranet and an
internal LAN to monitor traffic.
c t ng
n
ne ori
io
on nit
o
M
C
91
Chapter 5
Think for a moment about a typical Internet-based application such as the one being
discussed in this chapter. Multiple systems combine to enable the various functions of that
application. Yet there is one set of servers that interfaces directly with the users
themselves: the External Web Cluster. Every interaction between the end user and the
application must proxy in some way through that Web-based system. This centralization
means that every interaction with users can also be measured from that single location.
EUE leverages transaction monitoring between users and Web servers as a primary
mechanism for defining the users’ experience. Every time a user clicks on a Web page, the
time required to complete that transaction can be measured. The more clicks, the more
timing measurements. As users click through pages, an overall sense of that user’s
experience can be gathered by the system and compared with known baselines. These
timing measurements create a quantitative representation of the user’s overall experience
with the Web page, and can be used to validate the quality of service provided by the
application as a whole.
It is perhaps easiest to explain this through the use of an example. Consider the typical
series of steps that a user might undergo to browse an e-commerce Web site, identify an
item of interest, add that item to their basket, and then complete the transaction through a
check out and purchase. Each of these tasks can be quantified into a series of actions. Each
action starts with the Web server, but each action also requires the participation of other
services in the stack for its completion:
• Browse an e-commerce Web site. The External Web Cluster requests potential
items from the Java-based Inventory Processing System, which gathers those items
from the Inventory Mainframe. Resulting items are presented back to the External
Web Cluster, where they are rendered via a Web page or other interface.
• Identify an item of interest. This step requires the user to look through a series of
items, potentially clicking through them for more information. Here, the same
thread of communication between External Web Cluster, Inventory Processing
System, and Inventory Mainframe are leveraged during each click. Further
assistance from the ERP system can be used in identifying additional or alternative
items of interest to the user based on the user’s shopping habits.
• Add that item to the basket. Creating a basket often requires an active account by
the user, handled by the ERP system with its security handled by the Kerberos
Authentication System. The actual process of moving a desired item to a basket can
also require temporarily adjusting its status on the Inventory Mainframe to ensure
that item remains available for the user while the user continues shopping.
Information about the successful addition of the item must be rendered back to the
user by the External Web Cluster.
• Complete the transaction through a check out and purchase. This final phase
leverages each of the aforementioned systems but adds the support of the Credit
Card Proxy System and Order Management System.
92
Chapter 5
In all these conversations, the External Web Cluster remains the central locus for
transferring information back to the user. Every action is initiated through some click by
the user, and every transaction completes once the resulting information is rendered for
the user in the user’s browser. Thus, a monitor at the level of the External Web Cluster can
gather experiential data about user interactions as they occur. Further, as the monitor sits
in parallel with the user, any delay in receiving information from down-level systems is
recognized and logged.
A resulting visualization of this data might look similar to Figure 5.6. In this figure, a top-
level EUE monitor identifies the users who are currently connected into the system.
Information about the click patterns of each user is also represented at a high level by
showing the number of pages rendered, the number of slow pages, the time associated with
each page load, and the numbers of errors seen in producing those pages for the user.
Figure 5.6: User statistics help to identify when an entire application fails to meet
established thresholds for user performance.
Adding in a bit of preprogrammed threshold math into the equation, each user is then given
a metric associated with their overall application experience. In Figure 5.6, you can see how
some users are experiencing a yellow condition. This means that their effective
performance is below the threshold for quality service. Although this information exists at
a very high level, and as such doesn’t identify why performance is lower than expectations,
it does alert administrators that degraded service is being experienced by some users.
An effective APM solution should enable administrators to drill down through high level
information like what is seen in Figure 5.6 towards more detailed statistics. Those statistics
may illuminate more information about why certain users are experiencing delays while
others are not. Perhaps one server in a cluster of servers further down in the application’s
stack is experiencing a problem. Maybe the items being requested by some users are not
being located quickly enough by inventory systems. Troubleshooting administrators can
drill through EUE information to server and network statistics, network analytics, or even
individual transaction measurements to find the root cause of the problem.
93
Chapter 5
It is for these reasons that an effective APM solution will create numerous visualizations
out of collections of transaction information. These visualizations roll up the
communication behaviors between user and server or server and server into an easy-to-
use graphical form. One particularly useful visualization that is commonly used in
troubleshooting is the C-N-S Spread (see Figure 5.7).
94
Chapter 5
The C-N-S Spread measures the amount of time required to complete a transaction
between two elements. In generalities, it breaks down that quantity of time between the
amount consumed by their Client, Network, and Server components. You can see in Figure
5.7 that these three components are broken down even further to include the network
overhead components of Latency, Congestion, and the TCP Effect. This spread of
information illuminates a number of interesting behaviors associated with the
communication:
• Client. The quantity of time spent at the client. This can include the amount of time
required for the client to process and render incoming data for the user.
• Server. This relates to the amount of time a server is processing an inbound
request. This can include locating records in a database, processing the business
logic surrounding those records, or completing essentially any activity associated
with the request.
• Bandwidth time. This represents the Network link speed component of the Spread.
Here, the amount of time required to clock data onto the network is measured; the
faster the link speed, the faster the clocking rate.
• Latency. This represents the Network distance component of the Spread—the
amount of time required for requests and replies to traverse the network. The
greater the distance, the higher the latency.
• Congestion. Similar to Latency, congestion measures the delay associated with too
much data attempting to pass across the network. When congestion is high, data is
delayed or even discarded if the network is oversubscribed.
• TCP Effect. Any network communication also has a certain level of flow-control
overhead associated with reliably getting packets from one location to another. This
TCP Effect can be broken out separately as well to identify when TCP-based errors
or other issues are having an impact on communication.
95
Chapter 5
Figure 5.8: Graphs like the C-N-S Spread leverage integrations across the suite of an
application’s components.
Consider the areas in which monitoring must be in place to create such a visualization.
Clients must be monitored to understand their behaviors. The network must be monitored
to watch for transaction traversal. The processing of requests on servers themselves must
additionally be watched. Even more complex is the logic involved with tracking this
information across the various components of an environment, and ultimately converting it
to a useable form. Only a mature APM solution with its integrations across the suite of
application components has the reach necessary to create such a visualization.
Coded into one updated method was a change that removed optimizations in inventory
processing. Removing this optimization forced the server to slow its processing of
inventory requests. Such a problem can be commonplace, especially with home-grown
code, and can be quite difficult to track down once implemented.
96
Chapter 5
In this case, however, the application’s administrators were quickly able to determine that
a problem existed with the updated code. Looking at a visualization similar to Figure 5.6,
the application’s administrators immediately noticed that the metrics associated with user
experience were dropping from the green state to the yellow, and occasionally the red
state. This high-level monitor immediately indicated that users were “experiencing” the
problem. Although it is likely that no one had called in to complain—perhaps users
considered it a momentary hiccup rather than an endemic problem—administrators were
immediately aware that something was wrong.
With this information in hand, administrators could quickly pull up another visualization
similar to Figure 5.9 to trace the problem at a slightly lower-level perspective. There, they
identified that the time consumed by the Server component was far larger than its
established baseline.
Figure 5.9: Drilling into the server’s Java codebase enables developers to locate un-
optimized code.
Clicking further into the details, administrators began peering into the individual servers to
find areas of delay, eventually focusing their attention to the update on the Inventory
Processing System. Recognizing that the problem is likely related to the update,
administrators enlisted the support of the development team. That team dug deeper to find
the visualization shown in Figure 5.9. There, you can see how a particular Thread and Main
Class is highlighted along with its rate of delay. This timing information related to specific
threads—and ultimately the methods being processed by those threads—enabled the
development team to quickly find and fix the offending code.
97
Chapter 5
The rate at which a problem like this can be resolved is attributable directly to the depth of
monitoring provided by an APM solution. Without its “everything and everywhere”
approach to watching for environment behaviors, such a rapid resolution would not be
possible.
This can be a particular problem when services are available for a short period of time or
on a limited basis. The TicketsRus.com story provides another metaphor for this situation
in relation to its selling of concert and sporting event tickets. These types of items are
generally available starting at a particular date and time, with a finite number of tickets
available. While for many events this is not a problem as the level of ticket supply meets the
level of customer demand, there occasionally comes the time when demand far exceeds
supply: sporting event finals, major concerts, and so on.
The way in which these situations manifest into customer-facing systems is through a
widespread slowdown in application performance. Figure 5.10 shows an example set of
graphs that can explain such a situation. Here, an application’s Application Performance
Index (Apdex) is related to the rate of unique users attempting to use the application. You
can see here that the performance index falls dramatically with the inbound spike of users.
98
Chapter 5
In the case of a massive user influx, metrics such as these help identify when performance
problems have little or nothing to do with the application’s architecture itself. Consider the
types and rate of alerts that could be triggered by such an inrush of users. Processor
utilization on servers goes above thresholds. Network bandwidth becomes saturated,
causing latency to spike dramatically. Application analytics on down-level services begin
notifying that they cannot keep up with the load. Web pages are unable to refresh, causing
errors at the client level.
Lacking a holistic perspective on the environment, such a situation could cause an alert
storm to the pagers of unsuspecting administrators. Administrators might find themselves
struggling to bring meaning to such a situation, tracking down symptoms of a much greater
problem.
Applications that have the right kinds of APM monitoring in place might be able to
encapsulate overall application performance into an index such as the Apdex metrics noted
in Figure 5.10. Relating this to the rapid rise in incoming users provides a focus for
understanding why the alert storm is occurring. It also provides ways to recognize where
system bottlenecks can be later eliminated to reduce the effect of massive popularity the
next time demand overwhelms supply.
Lastly, is the ability to notify the end users themselves when their activities cause a
reduction in overall performance. Although such performance problems aren’t often the
result of bad decisions made by the business, the business itself is usually the one that is
blamed. One very useful way to eliminate finger-pointing and remind users of the site’s
overwhelming short-term popularity is to notify the users of the problem. In this case,
users can be automatically alerted by the system that a high volume of requests is currently
being received and that their requests may take longer than expected. In the end, an
educated customer population is less likely to blame your business when the problem is
related to their use.
EUE further improves this recognition of success by enabling a greater vision into the
environment. That vision speeds root cause analysis, enabling visualizations that very
quickly drill down to problems. Because user behaviors are specifically tracked, even the
most challenging of code-oriented problems can be isolated for a quick fix.
99
Chapter 5
Finally, EUE improves a business’ capability to refine their application infrastructure over
time. Its data shows where bottlenecks require hardware expansion or software update
while providing a real-world justification for basing short-term and long-term planning
activities.
To this point, however, EUE is still but one piece of the larger model of service created by
an APM solution. That service model is the central whiteboard by which an application’s
components are laid upon and connected. Through the process of creating and refining an
application’s service model, the linkages between components become well-defined. This
creates a web of dependencies upon which alerting and status information can be based.
Chapter 6 discusses how this model is created, and how the concepts surrounding the
service-centric monitoring approach enable a complete representation of an application’s
entire set of resources.
100
Chapter 6
In fact, there aren’t many problems that don’t get initially triaged by John’s Help desk team.
Back in the old days, internal problems were often discovered when a user called it into this
phone line. It was this very Help desk who first found out about that major Web site problem
during last year’s Finals. Painful in the extreme, John was determined to never see a day like
that again.
“OK, everyone, settle down,” John asks of the room, “We’re about to get started.”
John called the meeting today to introduce the Help desk staff to his new monitoring system. A
project he’s been working on for months, John and his implementation team were finally
ready to bring it into full production. Part of that rollout involved creating dashboards that
were specifically designed for this team. These dynamic visualizations brought high-level
information about each subsystem’s status to his eyes in the Help desk.
Although he’d admit this group isn’t his most technically adept, they were highly capable of
handling phone calls and waiting for green lights on a dashboard to turn red. In fact, that
simple capacity was one of their greatest benefits to the organization. With part of their job
being triage of any technical problem, they could very quickly identify if a perceived problem
was indeed “perceived” or actually real. By inserting this 24x7x365 group into the problem
resolution process right at its start, John managed to keep his top-level engineers on call
without burning them out. In short, it was the job of this team to identify why a light went red,
and subsequently alert the right people when necessary.
In John’s meeting today, he was ready to unveil his greatly improved set of lights.
“Everyone. We’re here today to unveil a project that’s going to improve how you identify
problems,” John begins, “Our APM monitoring solution, which you see here, gives us a high-
level heads-up display about each of our production systems. You can see in this graphic that
each component is given a stoplight showing green or red. It’s ultimately your job to identify
when something goes red, track it down, and alert the right personnel if necessary.
101
Chapter 6
“Everything you see in this visualization is clickable. You’ll notice that if I click on any of these
links, I can drill down into even more detail about that item. If you want, you can even click
right down into the specific details about the problem.
“Now, I recognize that that level of detail is probably too much for most of us, but the idea
here is that everyone from Help desk to engineer to manager to code developer can access this
same set of visualizations. This means that you can very quickly and easily walk an engineer
or a developer through what you’ve learned once you engage them. Now, we’re all on the
same page, and we’re all looking at the same data!
“Any questions?” asks John of the crowd. One hand shoots up.
“Mr. Brown, this is great information,” offers John’s newest employee, one whom John feels has
real potential, “I understand that you’re collecting these metrics from essentially everywhere
these days. But how are you crunching this data into something that makes sense?”
Real potential, John thinks. He smiles as he looks back to the audience, “Ahhh, great question.
There’s the real magic with this system. You see, it’s all about the model…”
The new employee in this chapter’s episode of our ongoing TicketsRUs.com story asks a
critical question about just those kinds of calculations. This individual recognizes that
widespread monitoring integrations enable an IT infrastructure to view behaviors all
across the environment. They understand that you have to plug into each component if
you’re to gather a holistic set of data. But actually fitting those pieces together is something
that remains cloudy.
It is this process that requires attention at this point in our discussion. In reading through
the first five chapters of this guide, you’ve made yourself aware of where monitoring fits
into your environment. The next step is in creating meaning out of its raw data. As John
mentioned earlier and as you’ll discover shortly, the real magic in an APM solution comes
through the creation and use of its Service Model.
102
Chapter 6
But first, think for a minute about the high-level stoplight charts that were unveiled to
TicketsRUs.com’s Help desk during that meeting. Perhaps one of those charts looked
similar to Figure 6.1. In this visualization, each individual system element is given three
fairly simple metrics:
• What is its current status? It is here where the top-level stoplight charts are
presented to the viewer. Systems that perform normally get a green light, while
those that are currently down see red. A yellow light can be shown for those
systems that aren’t technically down but are experiencing some failure or pre-
failure condition.
• Is it available? Understanding exactly which services are available for their
customers is a binary value. If a system is non-functional for its users, it is
unavailable. This second column adds data to the first by illuminating which
services are indeed not serving their customers.
• What is its medium-term trend of availability? Lastly, the third column expands
this visualization’s usefulness backwards in time. Here, the monthly trend of service
availability is provided with enough granularity that viewers can see if and where
problems occur over the medium term. Services that go down together may be
related. Those that go down more often may need extra support. The medium-term
trending of a service’s quality helps administrators identify where expansion or
augmentation may be necessary.
Ultimately, one of the primary goals of an APM implementation is to define in real time the
measurement of a service’s quality. Services that accomplish their mission with a high level
of quality are likely to have high levels of customer satisfaction, repeat customers, and net
profit for the business. Conversely, when services are of low quality, customers are likely to
go elsewhere with their business. If you’ve ever experienced a problem with a Web-based
service that simply didn’t function to your needs, you know how important this
measurement can be.
103
Chapter 6
The idea of Service Quality is not new nor does its outward subjectivity hold true when
incorporated into an APM solution. When you think about a particular service, what are the
possible states that that service can operate in? It can be functional or non-functional, but
its level of actual functionality to its users can also lie across a spectrum, as Figure 6.2
shows. Each of these conditions represents a potential state that that service can be
operating in:
A loss in a sub-system to a business service feeds into the total quality of that
service. A reduction in the performance of a system reduces its quality. And, most
importantly, an increase in response time for a customer-facing system reduces its
service quality.
104
Chapter 6
Thus, it can be argued that any reduction in a service’s capacity to accomplish its stated
mission represents a reduction in that service’s quality. This idea should make sense to the
casual observer; lower service quality means a reduced user experience. Yet, this
explanation still hasn’t answered the fundamental question of, “How does one apply an
objective approach to defining quality?” The short answer is, “The right model, and a whole
lot of math.”
But before actually delving into a conversation of the Service Model, it is important to first
understand its components. Think about all the elements that can make up a business
service. There are various networking elements. Numerous servers process data over that
network. Installed to each server may be one or more applications that house the service’s
business logic. All these reside atop name services, file services, directory services, and
other infrastructure elements that provide core necessities to bind each component.
Take the concepts that surround each of these and abstract them to create an element on
that proverbial whiteboard. This guide’s External Web Cluster becomes a box on a piece of
paper marked “External Web Cluster.” The same happens with the Inventory Processing
System and the Intranet Router, and eventually every other component.
By encapsulating the idea of each service component, it is now possible to connect those
boxes and design the logical structure of the system. This step is generally an after-the-
implementation step, with the implemented service’s architecture defining the model’s
structure and not necessarily the opposite. Figure 6.3 shows a simple example of how this
might occur. There, the External Web Cluster relies on the Inventory Processing System for
some portion of its total processing. Both the External Web Cluster and the Inventory
Processing System rely on the Intranet Router for their networking support. As such, their
boxes are connected to denote the dependency.
105
Chapter 6
Note
Your APM solution will be equipped with a “designer” tool that enables this
creation and connection of components directly within its management
interface. You’ll also find that best-in-breed APM solutions arrive with
support for the automatic discovery and creation of Service Models. This
automation can dramatically speed the process of creating your initial
Service Model’s structure over manual efforts alone.
106
Chapter 6
With this in mind, let’s redraw Figure 6.3 and map a few of these potential points of
monitoring into the abstraction model. Figure 6.4 shows how some sample metrics can be
associated with the Inventory Processing System. Here, the Database Performance and
Transactions per Second statistics arrive from application analytics integrations plugged
directly into the installed database. Agent-based integrations are also used to gather whole
server metrics such as Memory Utilization and Processor Utilization.
Component Health
Database Performance
Inventory Processing System
Transactions per Second
Memory Utilization
Processor Utilization
Intranet Router
Figure 6.4: Individual monitors for each element are mapped on top of each
abstraction.
You’ll also notice that the colors of each element are changed as well. At the moment Figure
6.4 is drawn, the Inventory Processing System’s box is colored red. This indicates that it is
experiencing a problem. Drilling down into that Inventory Processing System, one can
identify from its associated metrics that the server’s Processor Utilization has gone above
its acceptable level and has switched to red.
107
Chapter 6
Each of the metrics assigned to the Inventory Processing System’s box are themselves part
of a hierarchy. The four assigned metrics fall under a fifth that represents the overall
Component Health. This illustrates the concept of rolling up individual metrics to those that
represent larger and less granular areas of the system. It enables the failure of a down-level
metric to quickly rise to the level of the entire system.
Drilling down in this model highlights the individual failure that is currently impacting the
system, but that specific problem is only one piece of data found in this illustration. As you
drill upwards from the individual metrics and back to the model as a whole, you’ll notice
that the individual boxes associated with each component are also active participants in the
model. Because the overall Component Health monitor associated with the Inventory
Processing System has changed to red, so does the representation of the Inventory
Processing System itself.
Going a step further, this model flows up individual failures to the greater system through
its individual linkages between components that rely on each other. In this example, the
External Web Cluster relies on the failed Inventory Processing System. Therefore, when the
Inventory Processing System experiences a problem, it is also a problem for the External
Web Cluster. The model as a whole is impacted by the singular problem associated with
Processor Utilization in the Inventory Processing System.
However, consider the situation in which the problem lies deeper within the
application itself. In this example, the problem is not the loss of an entire server or
device. Here, a much deeper problem exists. Rather than a simple server loss, the
response time between the application server and the mainframe instead slows
down. This occurs due to a problem within the mainframe. The decrease in
performance between these two components eventually grows poor enough that it
impacts the system’s ability to complete transactions with the mainframe. As a
result, the upstream reliant servers such as the application server, database server,
and Web server can no longer fulfill their missions.
During that explanation, it was suggested that a properly-implemented APM solution could
quickly identify the problem and trace troubleshooting administrators down to the correct
solution. That process is realized through the implementation of abstractions such as that
shown in Figure 6.4.
In the use case associated with Figure 6.4’s model, perhaps a Help desk employee notices
that the system has switched from a healthy to an unhealthy state. Perhaps the stoplight for
the system as a whole has changed from green to red or yellow. The Help desk employee
now knows that something is amiss within the system.
108
Chapter 6
With this information in hand, that Help desk employee can then drill down within the APM
solution’s interface to find the component that is experiencing the problem. Once the
employee finds the offending component, they can drill down even further to discover
which aspect of that problem component is the source of the problem. In this case, that
source relates to a processor overuse condition within the Inventory Processing System.
This information greatly improves the ability to initially triage system issues. With this
information in hand, for example, the triaging Help desk employee knows that the server
team is a likely candidate for resolving the problem. The problem probably doesn’t relate to
a network issue. Its initial troubleshooting is likely not within the realm of the application’s
development team. Determining the right resources to troubleshoot the problem reduces
the level of effort required to solve the problem while improving the speed to resolution for
the problem.
Obviously, Figure 6.4 is only one part of that overall Service Model. A completed Service
Model for our example system is shown in Figure 6.5. Here, the previous three components
are shown in relation to the rest of the model. Also present is the earlier physical structure
for comparison. Arrows are also drawn to illustrate where dependencies might lie between
each individual component.
Note
In a fully-realized Service Model implementation, an additional element titled
“The E-Commerce System” would be created at the top-most level and
connected with each individual component as a dependency. This connection
ensures that the loss of any component has an impact on the functionality
(and, thus, Service Quality) of the entire system. This element is not present
in Figure 6.5 only for readability reasons, but is an important part of a full
Service Model implementation
109
Chapter 6
User
ERP System
Extranet Router
Figure 6.5: The complete Service Model for our example system, along with the
physical structure for comparison.
It is critical to understand that the Service Model is a construct that lives entirely within the
APM solution. Your APM solution will provide a whiteboard-like designing utility that
enables the creation and connection of individual elements. It is within this design utility
where you will input the specific data and metrics that are tagged to each element. In effect,
this design activity is the next step after installing monitoring integrations into the various
aspects of your business service.
110
Chapter 6
Service Quality
With this, the entire conversation swings back to the initial question of “How does one
quantitatively represent Service “Quality?” At this point, the business service has been
deconstructed into its disparate components. Each of those components has been laid out
into a logical structure, with dependencies highlighted using connectors. Individual
monitoring integrations have also been assigned into each component as they make sense
(for example, network monitors into network components, server monitors into server
components, and so on).
The final step in this process is the piecewise labeling of each component and its
monitoring with the behaviors that are considered acceptable. This process effectively
creates the mathematical model that the APM engine uses to define a service’s quality, and
is another task that is commonly done within the APM solution’s designer utility.
111
Chapter 6
These metrics are part of a hierarchy, so there must be included a set of rollup values that
identify when a higher-level value crosses a threshold. In the case of Figure 6.6, the rollup
value for Component Health remains green until one of its dependent monitors crosses into
the Failed state. Although this is a simple example of such a service, it is easy to see how
added logic and more layers of hierarchy enable administrators to add substantial levels of
granularity into the system. For example, a rollup value can remain healthy as long as a
percentage of its dependent values remain healthy. Or, that rollup value can change its
state based on the moving average of individual dependent states. Enriching the data even
further, individual values can be given time limits, whereby a healthy state will not switch
to an unhealthy state unless the threshold behavior occurs over a predetermined period of
time.
It is the summation of all these individual threshold values that ultimately drives the
numerical determination of Service Quality. A business service operates with high quality
when its configured thresholds remain in the green. That same service operates with low
quality when certain values flip from green to red and is no longer available when other
critical values become unhealthy. The levels of functionality between these states, as
introduced in Figure 6.2’s spectrum, become mathematical products of each calculation.
In effect, one of APM’s greatest strengths is in its capacity to mathematically calculate the
functionality of your service. Taking this approach one step further, IT organizations can
add data to each element that describes the number of potential users of that component.
Combining this user impact data with the level of Service Quality enables the system to
report on which and how many users are impacted by any particular problem.
112
Chapter 6
It is for this reason that many APM solutions include a set of templates for
known server, network, and application types. If it is considered an industry
standard for database servers to maintain an acceptable processor utilization
of 90%, then having that as a starting value assists you with much more
quickly swinging your Service Model into operations.
You should also recognize that your Service Model is intended to be an
organic construct. Over time, your service will change and evolve, and with it
so will its Service Model. For example, you may find down the road that
processor utilization over 80% actually causes a down-level problem with
your system, or that another service component needs to be added into the
model to support a new business endeavor.
An effective APM solution’s Service Model designer will allow you to
reposition and reconfigure Service Model components and configurations at
any time to mirror your ever-changing infrastructure.
The process of actually creating that model and implementing your APM solution tends to
require formal project planning with the appropriate stakeholders if it is to be successful.
To assist, consider the following seven-step process as a common methodology in creating
a service model and ultimately implementing your APM infrastructure.
Step 1: Selection
The first step in any APM implementation project is involved with identifying its ultimate
outer boundary. Although an APM solution can monitor virtually everything in your IT
environment, it may not make sense to do so. Monitoring of environments that aren’t in
production can create alert storms as those environments go through development and
testing. Integrating APM into unrelated IT infrastructure elements can create islands of
monitoring that don’t mesh into your Service Model. And, you may find that some IT
components just don’t have a measurable impact on your business’ bottom line.
When thinking about where you should place your outer boundary of monitoring, consider
first those elements that you define as business services. A business service is one whose
operation can be quantified in terms of dollars and cents. By limiting your APM solution (at
least, in the beginning) to the revenue-impacting portions of your IT infrastructure, you
will have a much greater capability of getting your arms around its initial implementation.
Once in place, you may consider adding services as necessary.
113
Chapter 6
Note
It is crucial in the selection process to recognize that common infrastructure
services such as name services, directory services, and the like can also be
critical components of a business service. When a business service relies on
these common infrastructure services for their support, they automatically
become a part of your Service Model's hierarchy. Any outage of an
infrastructure service can lead to an outage of the business service, which
incurs a revenue impact.
Be aware also that the implementation of an APM solution is more than just a technology
insertion. APM’s quantitative data has a tendency to change entire business processes.
Thus, gathering the right stakeholders in your organization—those who can impact
business process maturity—will ensure that your business ultimately recognizes the total
benefit from an APM implementation.
Step 2: Definition
Once the proper boundary has been selected and agreed upon for the initial APM rollout,
the next step is involved with formally defining each of the components that make up that
service. This process can be quite exhausting, as it requires a precise decomposition of each
element that makes up the service. A number of data points must then be defined for each
component that has been isolated. Consider the following data points as the minimal set
that are critical to creating that component’s abstraction:
• Component Name and Description. For each isolated component, a unique name
and description of that component is necessary for later tracking.
• Users. Understanding the type of users as well as the expected count of users that
will make use of each isolated component is useful for identifying the level of
affected users during an outage.
• Outage Impact. Valuating the revenue impact associated with a loss of this service
helps derive data for executive dashboards.
• Baseline and Desired Behaviors. Technical metrics and their desired threshold
levels are necessary. These metrics will link back to observed behaviors through an
APM solution’s monitoring integrations. It is likely that the definition of these
behaviors will consume the largest part of your definition activity.
• Dependencies. Lastly, identifying the dependency chain among components helps
to later draw the interconnections between elements in the Service Model.
Remember that dependencies needn’t necessarily be limited to directly required
services. Dotted-line dependencies are also important towards capturing the impact
associated with any loss of service.
114
Chapter 6
Data gathered during this phase is most often captured into an external spreadsheet or
database. That external document can be used by the project team in the next phase as the
Service Model itself is built. It is worth mentioning here that this definition phase also has
the benefit of documenting a system in ways that may not have been previously
accomplished. Thus, your level of process maturity naturally increases as a function of
completing this documentation activity.
Step 3: Modeling
Once finished with the formal definition of each component, your next step will be to use
this data to construct the model itself. This process translates the externally-captured
information in your spreadsheet into a form that works within your APM solution’s design
tool.
The process of actually creating the model should be relatively trivial if the right level of
effort was placed into the previous step. With the right level of detail in your definition
spreadsheet, you should already know which elements should be entered as well as their
dependencies and initial metric thresholds. The net result of this activity will be a
completed Service Model that is ready to accept monitoring data once integrations are in
place.
Step 4: Measurement
The fourth task in this process begins with the installation of monitors into your
environment. This process ties the Service Model’s empty framework into the actual data
that it will use in generating its calculations. The actual installation of monitors can take an
extended period of time, as the sheer magnitude of change control required to implement
APM’s wide swath of monitors hooks into every part of your environment.
Properly managing this installation will require the support of change control as well as
configuration control stakeholders in your environment. It will also require the technical
support of each technology’s administrators, an activity that can involve personnel and
project management. Further, achieving the buy-in of each component’s administrative
stakeholder can require assistance from business management due to the politics of
systems administration.
Note
For this reason, be aware that this step can be disruptive to your business
service if not completed with care. The installation of agents, incorporation of
agentless monitoring, and initial rollout of EUE monitoring can impact the
normal production of the environment for an initial period until monitors are
correctly tuned.
Once implemented, your APM environment will need a period of steady state to allow those
incorporated monitors to gather their data and begin filling out the picture of service
behavior. This period can be relatively short or rather extended as you ensure your
monitors are gathering the right kinds and volumes of data.
115
Chapter 6
During the data analysis phase, installed monitors should be evaluated for their
effectiveness in gathering the right data to fill out the Service Model. As with the previous
phase, this phase tends to require the involvement of each component’s administrative
stakeholders. Those stakeholders can assist in quantitatively identifying the behaviors that
should be captured as well as tuning the appropriate monitoring thresholds. Even with
APM solutions that leverage templates, those common settings must be correctly re-tuned
for each particular environment.
Step 6: Improvement
Through any data analysis phase, you will find errors in Service Model development or
inconsistencies in the data being gathered. The sixth step in this process needn’t
necessarily be an isolated step that occurs after step five. This improvement process of
retuning settings can occur in parallel with data analysis, iteratively identifying and
resolving areas of improvement in the overall implementation.
116
Chapter 6
Step 7: Reporting
Lastly is the reporting phase. This final step in an APM installation concerns itself with the
development of visualizations for the rendering of data. This process in and of itself can
require its own project plan, as multiple types and classes of visualizations are likely to be
desired by the different users of the system:
• Triage. Help desk individuals are often the first line of defense in identifying and
isolating problems within IT systems. For this reason, visualizations that enable this
team to recognize that a problem is occurring and provide initial triage support are
necessary. These visualizations tend to be of low data resolution, providing just
enough information to help teams identify which resources to engage for resolution.
• Administrative and Developer. Administrators require their own set of
visualizations for understanding the behaviors of their systems under management.
This set commonly deals with low-level behaviors that can be managed by
administrators as well as visualizations that assist with deep troubleshooting.
Developers need their own capabilities as well; however, these tend to lean towards
detailed information about code operations and performance. Developer
visualizations are commonly used in finding and resolving areas of non-
optimization in custom code.
• Management. Technically-minded individuals needn’t be the only consumers of an
APM solution’s data. Those in management as well can leverage an APM solution for
identifying long-term trends in service usage, revenue impacts from both outages
and successes, and high-level project status.
• End User. An APM solution’s visualizations can also provide benefit for a service’s
end users. End users appreciate when they’re given useful information about the
functionality (or non-functionality) of the systems they’re working with. Users who
are given actionable information about system status can make their own decisions
about working with the system or delaying attempts until full functionality is
restored.
Cross Reference
A much larger discussion on the creation and use of APM visualizations will
occur in Chapter 7.
117
Chapter 6
Keep in mind that an APM solution comes equipped with a set of tools that enable the rapid
troubleshooting of applications as problems occur. The Service Model is only one of those
tools. Leveraging specific tools for network management, server monitoring, and
transaction following will assist with identifying the behaviors that are potentially of
interest within the model. One example of the use of such tools relates to the server
performance information displayed in Figure 6.7.
Figure 6.7: Server discovery tools identify performance conditions in the existing
infrastructure.
The data gathered from a tool such as this one provides a glimpse into performance
conditions within the existing infrastructure. In this example, the processor utilization and
queue length are measured along with virtual memory usage and interwoven queue
percentage. Information gathered through current and historical analysis of components
helps to identify the actual thresholds being experienced on a particular piece of hardware.
You can see in Figure 6.7 that virtual memory tends to average around 15% utilization over
the measured period. This information becomes a starting point to develop an
understanding for the baseline behavior of a service component. By recognizing that this
server’s baseline utilization of virtual memory remains around 15%, it is possible to tune
its metric to a more precise level.
118
Chapter 6
This example shows how the tuning process requires effort to both tune as well as de-tune
configured counters. Finding the correct setting for a metric that is experiencing too many
false positives is relatively easy: Simply dial down the metric’s threshold value until it no
longer turns red during a period of acceptable performance.
However, this is only one half of the necessary tuning activity needed for counters. The
other half is involved with watching for false negatives. In this case, metrics are de-tuned to
a level that is too far, and as such aren’t providing value in alerting on system behaviors. In
these cases, it is often necessary to look back at current and historical performance to
identify when counters aren’t tuned enough.
This is particularly important for situations when a monitor may be rapidly shifting
between states due to a problem with a component or when its threshold is tuned to a level
between existing states on a system. Monitors in a rapid state change condition can be
exceptionally difficult to track when historical analysis capabilities are not present in the
APM solution. Your APM solution should include the ability to look into the short-term past
for identifying when a monitor’s state has changed. Situations where rapid state changes
are occurring with a particular monitor threshold should be resolved through retuning.
A business that expects use of its system by only users in the United States can assume that
the remaining period of the day corresponds to their application’s period of low use.
Outages during that period are likely to impact relatively few users compared with an
outage occurring during peak hours. Maintenance activities are also commonly scheduled
during these low-usage hours.
119
Chapter 6
The timing calculations for applications across only a few time zones are relatively easy.
But consider the complexity that occurs when that same application is used by applications
in the EMEA and Asia-Pacific areas as well. The multiple-hour differences in time zones
between the continental United States, EMEA, and Asia-Pacific regions mean that some
traditionally low-usage periods actually correspond to times when other regions are
beginning their workday.
Service Calendar information aids an APM solution by identifying when the high-usage and
low-usage patterns exist for the monitored application. Based on the source for expected
users, an APM’s Service Calendar can assist triage teams with identifying the true level of
impact to global users. It further assists administrators with finding the best time of day to
schedule outages for updates and fixes. When completing your Service Model, consider
including Service Calendar information into its calculations to ensure that a true level of
impact is recognized.
Yet the Service Model is not a construct that is designed to be used by APM’s end
consumers. That Service Model exists in the background, to be used by APM’s calculations
engine. For users, a set of visualizations or “dashboards” are commonly constructed that
provide the right level of information. Chapter 7 discusses how those dashboards are
created and used while presenting a run-down on common dashboard elements you’ll find
in an effective APM solution.
120
Chapter 7
One of those new tactics he now cradles in his hand as he orders a second drink. Through his
PDA’s Web browser, Dan is showing a fellow conference attendee some of the visualizations
from his new APM solution.
“So, here’s our internal Web site,” Dan explains, “On this site, I can take a look at the rate of
incoming orders. I can see which events are tracking to expectations and which ones might
need a little extra help in getting sold. Over here, I can track revenues on a per-month basis,
per-day, or even down to the individual hour.”
The other attendee is Lee Mitchell, CEO of Dan’s closest competitor and a long-time personal
friend. Lee leans in close to peer at the PDA’s not-entirely-tiny screen. On that screen are
metrics he’s used to seeing in his own systems. But something is different, and he can’t put his
hands on it.
Lee counters, “That’s great, Dan. But we’ve been able to pull these metrics for years. I’ve got
something fairly similar back at the office, although I’ll admit that pulling it up on the PDA
nets you extra ‘gee-whiz’ points. My guys have built something like this that pulls reports right
out of our accounting system to get me the same kinds of information.”
Dan smiles because this is exactly the road he’s been leading Lee down for the past 20
minutes. His old system was a lot like Lee’s in that he could pull metrics. But those metrics
always needed to be pulled. Once a report was generated, it was only a static representation
from a single system. With his APM solution, everything is real time and integrates not only
with the accounting system but also his entire IT infrastructure.
“A-ha!” proclaims Dan with a grin, “I know what you’re talking about, and that’s exactly
where we were about 12 months ago. Now, take a look at this…”
121
Chapter 7
Dan clicks on a few places on the screen and switches the view to a geographic representation
of his multiple data centers. Each data center shows a stoplight chart—green, red, or
yellow—displaying each data center’s representation of overall health. Dan continues, “In this
view, I can see which of my data centers are meeting which parts of their SLAs, and which
aren’t servicing my customers.”
He clicks to drill the view into the metrics for his data center in Rochester, New York, which
curiously shows a yellow condition. Seeking more details, he discovers that his Rochester ISP is
experiencing a problem that impacts his bandwidth to the Internet.
“So, here you can see that we’ve got an issue in Rochester,” Dan continues. “Some Internet
device is probably having a problem, which means that fewer people are connecting through
that point of presence.”
Lee scratches his head, “I’m still not seeing the a-ha moment here, Dan. I’ve got this kind of
data as well. My crew would be looking at this in the NOC right now if we were having a
problem, which, come to think of it, we might in Rochester if you are too.”
Dan chuckles, “Here’s the a-ha. All of this data is being gathered from all of my systems,
crunched through some service quality as well as business logic, and presented to me all at
once. Want to see something really impressive?” Dan clicks a few more links on the page, “This
screen aggregates my revenue impact data with that system performance data. It tells me
exactly how many users are impacted by the Rochester situation, how much money I’m losing,
and where my data center manager needs to send teams to fix the problem. The entire
infrastructure is completely visible, right here through Web pages I can access on my PDA.
“It gets better. Even my developers use it to trace specific lines of code that aren’t working
correctly. Everyone from the techies to my aging brain gets the visualizations they need,” Dan
stops as the formerly-yellow light turns green, “Hey, looks like they’ve fixed the problem!”
Lee’s eyes widen as he realizes the complete vision such a system brings, “Alright, you win.
Drinks tonight are on me. Now, tell me more about this system.”
122
Chapter 7
It is the word “useful” that is most important in the previous sentence. “Useful” in this
context means that the visualization is providing the right data to the right person. “Useful”
also means providing that data in a way that makes sense for and provides value to its
consumer.
The concept of digestibility was first introduced in this book’s companion, The Definitive
Guide to Business Service Management. In both guides, the digestibility of data relates to the
ways in which it can be usefully presented to various classes of users. For example, data
that may be valuable to a developer is not likely to have the same value for Dan the COO.
Dan’s role might care less about the failure of an individual network component compared
with how that component impacts the system’s customers. Each person in the business has
a role to fill, and as such, different views of data are necessary.
Yet what’s interesting is that each of these different types of data must still be gathered to
be useful. Lee’s solution gathers data from only a single location, namely his accounting
system. As a result, his results can only be based on the quality of data available from that
system. If, for example, a network problem occurs in a data center, Lee’s accounting system
can’t factor the problem into its reports. As a result, Lee’s data isn’t fully representative of
the actual conditions “on the ground” across his entire customer services infrastructure.
You’ll find that this integration of business metrics into traditional monitoring represents a
key way in which APM impacts business decisions.
Cross Reference
Chapter 9 will discuss this linkage in further detail, focusing on how the topic
of Business Service Management (BSM) is impacted by the data gained
through an APM solution.
Even when dollars and cents calculations aren’t part of an APM Web page, visualizations
assist the business in other ways: They provide a mechanism for finding faults in the
environment. They enable traceability from the initial discovery of the fault down to its
actual root cause. They also enable an otherwise-impossible glimpse at the medium- and
long-term health of the system, displaying hard-number metrics that report on the quality
of service being delivered to customers.
In this chapter, you’ll step through a series of mock-up visualizations that illuminate these
situations and others. The goal here is to show you smart ways in which visualizations can
be generated out of the monitoring integrations we’ve laid into place in previous chapters.
123
Chapter 7
Note
This chapter’s pictures show how Web-based dashboards and their
visualizations bring value to raw data. Being Web-based, these dashboards
are designed to show a large amount of data at a single glance. As such, they
can appear very small in print. This is done intentionally to illustrate how
much data can be consolidated into a traditional browser-based view.
For our purposes, the design of the visualization and the ways in which it
represents its data is more valuable than the actual data.
124
Chapter 7
Figure 7.1: A top-level stoplight chart that alerts when systems violate established
metrics for service quality.
A visualization like this is useful for service desk employees as well as administrators
because it answers the top-level question “Are you functioning?” If all the lights are green
in this graphic, administrators and service desk employees can be assured that the system
is and has been functioning to expectations.
In contrast, when any of the visualization’s cells changes, it can be assumed that some
change has also occurred in the application. Both “when” and “where” questions are
answered at the same time, with the top representation showing the location and count of
affected users and the bottom showing how long the problem has occurred.
In this example, the second line associated with the online banking system began
experiencing a yellow condition at roughly 4:00a, which escalated to a red condition at
around 6:00a. Not all users are impacted, with those in EMEA experiencing a greater
number of impacted users than others.
125
Chapter 7
That information comes through an APM solution’s drill-down visualizations. In Figure 7.2,
the visualization for the online banking system has been drilled down to view a few of the
different technology elements that enable its functionality. Here, servers, databases, the
network infrastructure, and software elements are all shown with a slightly-greater level of
granularity over the very simple graphic shown in Figure 7.1. The additional bits of
information provided here help the service desk identify that the problem is likely due to a
server fault, helping them identify which group of individuals may be best suited to resolve
the situation.
126
Chapter 7
Yet this level of data is still not something that is useful for a troubleshooting
administrator. At this point, the presence and domain of the problem have been identified,
but its location within that domain remains unrecognized. In order to determine that
information, even deeper monitoring integrations are required.
Remember that an APM solution gathers its metrics from multiple sources. Those sources
can be the instrumentation within the applications themselves, they can come from various
network components and probe devices, or they can come from the actual server metrics
themselves.
Figure 7.3 shows the information in Figure 7.2 can be expanded further to view the actual
server monitor that initially tripped the alarm. Here, red lights are seen for the online
banking system as originally seen in Figure 7.1, and ultimately that system’s infrastructure
elements. One of those elements is the Web server at 10.4.224.42. Drilling down into the
details of that element, it appears that the server is experiencing a CPU overuse condition.
With this information in hand, the service desk can now transfer ownership of the problem
to the correct set of administrators for its resolution.
127
Chapter 7
Administrators
Top-level visualizations like the ones previously shown are useful for an IT environment’s
first responders. Once an alert associated with a problem system has been raised,
administrators can be notified to track that problem to its root cause. It is very obviously
within this step where APM can provide optimizations to the triaging process.
But triage and resolution are two different things. Recognizing that a CPU overuse
condition has occurred on a server does nothing to assist in bringing that issue to
resolution. That task lies with the business’ administrators, who must first identify its root
cause, a process that can also be assisted through APM visualizations.
For each system component in this graphic, multiple types of data are presented. The front-
end server’s count and rate of transactions are specified, along with the C-N-S Spread for
those transactions. Web server and application server transaction details are also aligned,
providing—like before—a single glimpse of system health across each element.
128
Chapter 7
For this example, assume that metrics were laid into place prior to the fault. These metrics
quantified the acceptable and unacceptable behaviors across each monitored system
component. For example, the metric for HTTP server errors might have been configured to
alert when the count of errors grew beyond zero. As a result, a troubleshooting
administrator can quickly identify by the red-colored columns that a greater-than-
acceptable quantity of HTTP errors are occurring.
Further, the amount of C-N-S “Server time” spent by both front-end and Web servers is
greater than expected. The combination of these two pieces of information helps the
administrator further track down the possible CPU overuse condition.
Note
It is important to recognize that any APM solution is equipped with a
dashboard designer. This designer enables these visualizations to be
modified as necessary to suit the needs of the consumer. Thus, if your
business needs a view like the one in Figure 7.2 but with different data, it is
possible to create a slightly different view.
IT Management
Directing these IT personnel in a cohesive manner is another activity that can be an ad hoc
exercise without the right data. Consider the situation when multiple problems occur at the
same time in an IT infrastructure; a situation that isn’t terribly uncommon. When multiple
parts of a large and complex system experience problems at once, directing personnel to
resolutions that have the greatest impact is exceptionally important.
You can, for example, send out a team of administrators to fix a database problem when
that team’s time might be better served in fixing a simultaneous email server problem. As
systems grow in complexity, determining the right way to provision your human resources
can be as problematic as running the system itself.
An alternative way to handle the provisioning of resources to problems uses the same data-
driven approach as the previous example. Rather than making educated guesses on which
problems impact which users, APM can build this information out of the data it gathers
from your system components. Figure 7.5 shows how this information might look in a
resulting visualization.
129
Chapter 7
The module shown here is one piece of a larger visualization, with the availability and
service quality information for each application not displayed. Figure 7.5 does, however,
show a list of applications that may or may not be in a degraded state. For each, the count
of affected and total users is displayed for those which are experiencing a problem.
Graphics like this one quickly assist IT management with directing troubleshooting
resources to the applications with the largest impact on operations. Here, the online
banking system’s outage impacts more than 50% of its total users, making it a greater
priority than the email system (or any other outage) for resolution.
Visualizations also help IT management with the planning and budgeting aspects of their
job. In this case, historical data can be used to create visualizations that document where IT
is spending the majority of its time. In Figure 7.6 shows a Pareto chart has been created to
document the number of outages over a period of time for a set of business services. Pareto
charts are used to highlight most-important issues among a set of potential issues. The bar
chart for each business service documents the number of issues for that service, while the
line graph shows the cumulative frequency of occurrence.
130
Chapter 7
In this case, a historical Pareto chart gives IT management the data it needs to identify
where the majority of issues occurs within an environment. In this example, Trading, Citrix,
and Credit Services represent the top-three issues seen by the environment over the
measured period of time. Because these three services are experiencing the highest count
of issues, they make excellent low-hanging fruit for expansion or re-architecture activities.
Business Executives
Some situations can arise that are not technical in nature and as such are the purview of
business executives. Perhaps an Internet connection from a particular service provider
experiences a problem that is caused by the actions of the service provider itself. There is
no technical problem with the connection; it is merely not meeting its Service Level
Agreement (SLA) obligations.
Traditional monitoring solutions might overlook these types of situations due to their
heavy focus on technical metrics. Yet an APM solution’s widespread reach across systems,
networks, and even external connections can identify when executive-level support is
necessary for solving what ends up being a contractual problem. Figure 7.7 shows a sample
dashboard module that merges business contract logic with availability information to
alert when SLA conditions are in breach of contract.
131
Chapter 7
Figure 7.7: SLA fulfillment that is measured with actual data from monitoring
instrumentation.
Even during nominal activities, business executives struggle with the need for information
that they often have no capability to understand. In our chapter example, Dan the COO
might not understand what a network router is when it fails, but he absolutely needs to be
notified when that router failure causes an impact to his business operations.
It is this dissonance between the data that executives can get and what the data they need
that is a primary motivator for APM incorporation. APM solutions—most especially when
they are installed as part of a much larger BSM solution—enable the executive to view
information that they can digest and that they can truly care about.
Figure 7.8 shows what could be the most simplistic visualization for a business executive,
presenting the instantaneous performance of the IT environment in a dial format. It shows
very simply an aggregate percentage of how well that business executive’s services are
meeting the needs of their customers. The availability of the overarching service itself (via
an internal perspective) as well as the availability of the service to its users (via an external
user perspective) are noted in these twin graphics.
132
Chapter 7
Yet these two graphics only show the performance as it occurs at the moment it is read. To
get an executive-level view of service history over a medium-term period of time, another
module is commonly used similar to that shown in Figure 7.9. This graphic extends the
instantaneous representation of availability over a configurable period of time. Notice how
the easy-to-read format draws the eyes to situations that most business executives want to
prevent.
Figure 7.9: Another module that displays service quality over a period of time.
Dial and bar chart modules like these are commonly used as components of much larger
dashboards. In Figure 7.10, they and a number of other high-level modules are
consolidated to create a single-glimpse view that is useful for the business executive. This
dashboard includes visualizations that show service and user availability alongside service
quality and impact metrics for the business system’s various components. Notice how each
module presents its information in a slightly different way.
133
Chapter 7
Dashboards like these provide information that gives business executives the confidence
that their systems are meeting the needs of their customers. With information that is
updated in real time, the business executive can reduce their need for operational status
reports from each component owner or manager. The result is that executives can spend
more time on value-added activities while reducing the level of attention necessary to daily
operations.
Code Developers
With business executives requiring a very high-level view of the environment, their polar
opposite is your group of code developers. This group requires an extremely detailed view
of the individual functions being run within a system, broken down into detailed
explanations of inter-device conversations and transaction data. You can argue that the
data needs of this group go even further than what is needed by systems administrators,
because code developers actually create and manage the code that creates your business
system.
To that end, businesses that consider an APM solution must be careful of the capabilities
that such a system can provide for this group of individuals. For example, traditional
monitoring solutions tend to suffer from the “shrink-wrap support” phenomenon. Here, a
monitoring solution very openly offers support of many common technology products and
platforms, such as specific databases, middleware applications, or network devices. But
your business service is likely comprised from as much custom code as these off-the-shelf
applications. Thus, the ability to drill into the specifics behind the inter-application
communication is as important as the applications themselves.
For example, consider our previous situation where a Web server was experiencing a CPU
problem. Knowing that that Web server was experiencing an increase in CPU utilization is
less valuable than recognizing the exact Web page or code method that hasn’t been
processor-optimized. An effective APM solution should have the capability to peer directly
into database transactions to find such optimizations and present them via visualizations to
your developers.
134
Chapter 7
Such a deconstruction is shown in Figure 7.11. Here, a very simple table has been
generated that contains details about the performance of the front-end Web site. The
specifics here relate to a series of end user operations and their effective performance.
Transactions associated with login pages, search and view policies, search processes, and
logout operations are aligned with their effective rate of performance.
Like before, the acceptable transaction rates for each of these activities has been
preconfigured within the APM solution’s logic. As a result, in this image, a developer can
quickly see that each of the measured activities is performing to desired expectations.
Charts, graphs, and tables are never enough with this group of data consumers, as their role
is to always look for areas in which to improve the application. As such, even when an
application is performing to desired specifications, there are always places in which
database queries can be further optimized, Web pages can be accelerated, and applications
can be given more power to accomplish their jobs. Code developers are also charged with
continuing this process even as changes are requested to their applications, whether those
changes be updates to Web sites or deep-level code updates to support new lines of
business.
One central issue with this dynamic rate of change with many business services is in
measuring their effective performance over time. Today’s slowdown in performance might
be related to a run on a new product with hundreds or thousands of new customers coming
in as new business. It could also be related to a bug fix that was implemented only to
discover that the fix caused more damage than improvement to the system. To that end,
multi-view visualizations like that shown in Figure 7.12 provide yet another way in which
multiple APM monitoring integrations can be tied together in a time-oriented way to track
down performance issues.
135
Chapter 7
In Figure 7.12, three different views of a business service are gathered on a single pane of
glass. The top view shows the rate or volume of pages that are being rendered over a 24-
hour period of time. During that same period of time, the load time of those pages is related
along with an overall representation of application performance.
By aggregating each of these views over an equivalent period of time, a code developer can
quickly identify where correlations occur between different system activities. In this
example, it is easy to see that between the hours of 8:00a and 12:00p and again between
1:00p and 4:00p there is a substantial spike in the volume of Web site pages being
requested by clients. This volume changes from nearly zero to over a hundred Web pages
being rendered per minute by the Web server. At the same time, however, the load time for
these Web pages remains relatively steady. The application performance index of that Web
server also remains consistent over the monitored period.
136
Chapter 7
With these three graphs aligned, a code developer can quickly determine that there
appears to be no correlation between the volume of rendered pages and their effective load
time (for the volume of pages that were monitored). Such a developer can then be assured
that the volume of pages being rendered is not impacting CPU performance, and as such,
does not need code optimization or a hardware expansion.
If, however, the developer does find that some issue with the actual code is causing the
problem, alternate visualizations can be brought to bear to break down the processing of
that Web page into its disparate elements. As was first discussed in Chapter 3, any Web
page is rendered as the sum of a large number of individual parts. Those parts can gather
their data from internal databases or file structures, or can rely on external sources for
data. As such, when one of those parts or external sources is not performing to the level
needed by the Web server, the result is a reduction in performance.
Breaking down those transactions can be accomplished through a chart that looks similar
to Figure 7.13. In this chart, the individual parts that make up a Web page—graphics, HTML
code, scripts, and so on—are deconstructed by filename. Each file is rendered as part of the
Web page in a particular order, with some components overlapping. With the quantity of
elapsed time shown on the bottom, it becomes very easy for a Web developer to see where
delays in page rendering are impacting the overall perception of Web server performance.
137
Chapter 7
Remember that each individual Web page is the sum of its parts. With each one requiring
different parts for its complete execution, some pages can perform well while others
experience unacceptable load times. The problem in the unmonitored environment is in
tracking down which are the problem pages in and among those that are performing well.
To the unaided eye, tracking down one problem Web page can be an extremely time-
consuming process.
An APM solution can leverage its end user experience monitoring to keep records of page
performance on all pages at the same time. Aligning those pages to an index of performance
creates a table similar to Figure 7.14. Here, dozens of pages or more can be ranked by their
performance against each other. Pages that experience the highest levels of performance
are given a rating of one, with all other pages given a decimal rating below that number.
Pages with the lowest ratings are experiencing the worst performance, and as such, require
the greatest amount of attention by developers.
Figure 7.14: Measuring the Application Performance Index across multiple Web
pages at once.
End Users
Lastly are the end users themselves. This often-ignored class of users is your ultimate
consumer; however, they’re often forgotten when problems occur. Using a real-world
metaphor, keeping your Internet customers informed about situations is just as important
as the airlines notifying you when your flight will be late. An uninformed customer is an
unsatisfied customer, so keeping them aware of situations with your Internet-facing
systems is critical to keeping them coming back.
The problem in many organizations is in relating the information to end users in ways that
are digestible to them. Also problematic is relating the right amount of information: not too
little so as to annoy users or create situations of distrust, and not too much so as to give
away proprietary information to competitors.
138
Chapter 7
For these reasons, end user dashboards similar to Figure 7.15 can be some of the most
difficult ones to correctly configure. You’ll see that dashboards of this type often include the
lowest resolution of information, while at the same time presenting enough data to users so
that they know when problems occur. Typically, three types of data points are given to end
users when creating dashboards like this one: information about current outages, current
status of the infrastructure, and data about any upcoming or planned outages in the future.
With these three pieces of information in hand, end users find themselves empowered
enough to know when problem situations are actively occurring, and when they should
expect to return to access the business service with its full capabilities.
139
Chapter 7
It is that determination of “what to do next” that is the topic of this guide’s next chapter.
Taking a new approach to the APM topic, Chapter 8 will depart from the traditional
conversation to instead tell a story. That story continues the saga of Dan and John and the
rest of TicketsRus.com, but over the course of an entire chapter. Chapter 8 will give you the
opportunity to see an APM solution in action, showing how TicketsRus.com’s APM
implementation can be used to solve a major problem from start to finish. Told in the
narrative format as each chapter’s story, you’ll learn a bit more about the company. You
might also learn a bit more about your own business, and how it goes about solving similar
problems today with—or without—an APM solution’s objective analysis capabilities.
140
Chapter 8
However, it’s worth saying again that APM is all about its pictures. The return provided by
an APM solution comes from the data it presents to the multiple classes of users in your
organization: administrators, developers, business executives, Service Desk employees, end
users. As such, truly understanding the value in those visualizations needs a second look.
You’ve hopefully been enjoying the chapter story of TicketsRus.com, and how Dan and John
and the entire cast of players have evolved along with their monitoring capabilities. You’ve
seen how John’s job has gotten easier as the notifications he and his team receive grow
more useful. You’ve also seen how Dan gets the quality information he needs to make data-
driven decisions as a business executive.
But what you really haven’t experienced to this point is a complete walk through of the
entire process. Such a walk through can tell the extended story of a potentially scary
problem along with how the APM solution set assists. That storytelling happens in this
chapter. Its goal is to add a dash of humanity into the 20,000-foot perspective that’s been
the forefront thus far in this guide.
The story you’ll be reading is entirely fictional but is based off the types of problems and
war stories that you probably experience every day. Every IT organization in every
business has its share of technology issues; they’re a fundamental part of doing business
with technology. This chapter’s story is written to show how the presence of a fully-
implemented APM solution enables every actor to do a better job supporting the needs of
the company as a whole.
You’ll also quickly notice that you’ve already seen many of the images here in previous
chapters. Many are slightly-adjusted views of those seen in Chapter 7, where this guide’s
extended discussion on pictures and data-filled visualizations was most prevalent. Where
feasible, the images have been altered to fit the storyline and its characters. This reuse is
done intentionally, to bring a sense of continuity between the topics that have already been
discussed and the story that ensues.
141
Chapter 8
When triaging administrators become aware of problems through user interactions, and
when troubleshooting administrators must track down solutions based on gut feelings
rather than hard data, the resulting process is purposefully un-optimized. In immature
environments that don’t use a data-driven approach to problem resolution, eight steps are
common: awareness, assessment, assignment, handoff, transaction-level triage,
infrastructure-level triage, handoff, and solution.
Awareness
The first step in the unmonitored environment is often its most painful. Actually becoming
aware that an IT problem exists is one of the hardest hurdles to overcome when IT
organizations lack process maturity and monitoring integrations. Without formalized
systems for watching and reporting on system behaviors, the first acknowledgement of a
problem often comes after the users have been affected.
Assessment
“There’s a problem with the database? We’ll look right into that.” This statement is a
common response in the unmanaged environment. In immature environments, the
assessment of the problem often starts with simply verifying that the user’s called-in
problem actually exists. With little or no instrumentation present, this process requires
walking through the stated problem as presented to the Service Desk. For those with
traditional, stove-piped monitoring solutions, this step can also involve verifying stated
behaviors using each solution’s individual console while attempting to manually aggregate
their information. In short, without monitoring integrations, the assessment process here
consumes time in actually determining whether the problem even exists.
Assignment
All but the most nascent IT organizations today use some form of work order tracking
system to transfer ownership of problems from one team to another. This process, even in
immature environments, can appear to be well solved, but in reality may be masking
processes that are not optimized.
Consider the quality of the data that often makes its way into work orders as they are
created. When a user calls in to announce a problem, the triaging Help desk operator must
insert the correct amount of useful and pertinent data into the work order. If this doesn’t
occur, the assignment process fails and forces the next person in line to again assess the
problem. The result is lost time and effort.
142
Chapter 8
Handoff
The handoff process itself also suffers in an immature environment, primarily due to the
establishment of priority to the issue. In unmonitored environments, it is functionally
impossible to determine the number of affected users except by sheer guess. Often, the
decibel level of the caller determines the priority of the work order rather than its actual
impact on business operations. The result is that problems get worked on in the wrong
order, satisfying “louder” callers over others with greater needs.
Resource
This problem is, in fact, so endemic to immature organizations that it is a
common theme in this guide’s companion The Executive Guide to Improving
Your Business Through IT Portfolio Management. There, the problem of
“decibel management” as it relates to IT projects is discussed in greater
detail.
Transaction-Level Triage
At the point of handoff, it is common for the ownership of problems to be forked towards
either developer or infrastructure teams. Developer teams by nature look at problems in
relation to their codebase, seeking out areas where individual transactions might be non-
optimized or service components may be broken.
Infrastructure-Level Triage
Infrastructure administrators, in contrast, look at problems with a more big-picture
approach. These individuals leverage their infrastructure experience and cross-system
vision to analyze issues in relation to all the other components in the infrastructure.
Characterization
Both halves discussed in the two previous points approach problems using different paths.
This double-pronged approach is particularly suited for the types of customized business
services that are common in customer-facing applications. However, there exists a
shortcoming when the problem cannot be characterized by either team. In this case, the
problem is often bounced back and forth between each team or their sub-teams, such as
“networking” versus “security,” and so on. As will be discussed in a minute, complex
problems that bridge problem domains require the support of multiple teams. Lacking a
common vision, the only way to bring teams together is through a “war room” approach.
143
Chapter 8
Handoff
Handing off the problem once again from troubleshooting to issue management for its
ultimate resolution is yet another step in the process. This step often requires coordination
between change management and configuration management teams as well as
communication between each component’s stakeholders.
Solution
The final step is in actually implementing the identified solution, a long way from its initial
awareness seven steps ago. You can see how this multiple-step process ensures that no
issue goes untracked, but at the same time, it creates a burden of process overhead for the
solution. Particularly problematic are those times when issues are much larger than a
single team can handle alone—such as when a code issue impacts the infrastructure as a
whole or when a change to the infrastructure breaks some piece of code.
In situations where multiple teams are needed to solve the problem, the “war room”
approach becomes the tactic of choice in many organizations. By bringing representatives
from every team into a single room, each becomes responsible for tracking down artifacts
of the problem within their zone of control.
The problem with this war room approach is in its costliness. War rooms are necessary in
the unmonitored environment because data is not consolidated into a single location for
common consumption. Individual team members can’t simply look to a Web page to see
how others are doing with the problem. In order to actually characterize a problem that
exists across multiple domains, the only way in an unmonitored environment to share
metrics is through personal contact.
A fully-realized APM implementation is very much like the digital representation of that
war room, but without its actors. As has been stated many times before, an APM solution
gathers its metrics from every part of the IT environment, consolidating them into single-
glimpse views for consumption by all teams. The result is that environment behaviors
across every domain can be seen by everyone without the need for “circling the wagons”
and resorting to qualitative “gut feelings” for potential solutions.
144
Chapter 8
Visibility
Behaviors that occur outside expected thresholds are alerted via high-level visualizations.
Through drill-down support, the perspective and data found in that high-level visualization
can be narrowed to one or more systems or subsystems that triggered the failure. Using
tools such as service quality metrics and hierarchical service health diagrams, triaging
administrators can be quickly advised as to initial steps in problem resolution.
Prioritization
Counts of affected users are predefined within an APM solution’s interface, enabling
triaging teams to identify the actual priority of one incident in relation to others that are
outstanding. As a result, those with higher numbers of affected users or greater impacts on
the business bottom line can be prioritized higher than those with lesser affect.
145
Chapter 8
Improvement
Throughout the entire process, the APM solution continues to gather data about the
system. This occurs both during nominal as well as non-nominal operations. The resulting
data can then be later used by improvement teams to identify whether additional
hardware, code updates, or other assistance is needed to prevent the problem from
reoccurring. By monitoring the environment through the entire process, after-action
review teams can identify whether the resolution is truly a permanent fix or if further work
is needed.
It should be obvious to see how this six-step process is much more data driven than the
earlier traditional approach. Here, every team remains notified about the status of the
problem and can provide input when necessary through the sharing of monitoring data.
When problems occur that cross traditional domain boundaries, those teams can work
together towards a common goal without the need for war rooms and their subsequent
finger-pointing.
With this in mind, let’s see how this new approach works for the ongoing story of
TicketsRus.com. In this continuance of the story, John finds himself starting what appears
to be a regular day at the office. He has started his day by completing the tuning of a
behavior threshold within his APM solution.
The problem that is about to ensue can and will be handled much differently than in
previous situations. This time, John and his teams have the data they need at their
fingertips to turn what could be a disaster into a relatively minor impact on their daily
operations. Rather than repeating the public horror of their previous “Labor Day Incident,”
John and his teams keep the problem from growing out of control and affecting their end
customers.
As he begins this relatively trivial task, he thinks back to an afternoon last week. That day, a
false positive in this particular monitor caused a minor stir at the Service Desk. During that
afternoon, the company had just released a new set of tickets for a fairly popular concert to
happen in a large local venue. The band for that concert had just returned from an
extended break, and their fans were eager to see them perform once again. The resulting
rush created an abnormally heavy load on the Web site that the company hadn’t
experienced in a while, and at least one incident that was related to this very counter.
After the incident, a few code changes were also made to the system. The same heavy load
was expected this week as the same band’s second series of concerts were to be made
available to fans.
146
Chapter 8
Clicking through his APM solution’s designer tool, John brings open the Service Model for
TicketsRus.com’s external Ticket Sales application. There, he finds the hierarchical diagram
that he and his teams had spent so much time tuning over the past 6 months. Including
representations for every component in the system, from the External Web Cluster all the
way through the Order Management System and others, this designer tool provided the
workspace where they built the logical representation of the overall system.
Component Health
Kerberos Auth. System
Database Performance
Inventory Processing System
Transactions per Second (600)
Memory Utilization
ERP System Processor Utilization
Inventory Mainframe
Extranet Router
John double-clicks on the Inventory Processing System element and brings up its
properties. There, he is shown the list of health monitors that are attached to this particular
system. Each monitor is configured with one or more settings that define what
TicketsRus.com considers healthy versus unhealthy behaviors. Browsing down through its
fairly comprehensive list, he finds the culprit in last week’s alert brouhaha.
“A-ha!” says John to nobody in particular, “It looks like that last batch of tuning updates
dialed down the Red threshold on Transactions per Second to a number that’s far too low.
Let me just set this back to 600, where it belongs.”
147
Chapter 8
He clicks to view the properties of the Transactions per Second element and makes the
appropriate change. He thinks to himself, “APM is a great solution, but you absolutely have
to keep on top of your tuning. Making sure that every alarm is tuned properly seems to be a
never-ending activity in this tool.”
Thankfully, his chosen APM solution arrived with a set of preconfigured profiles that got
the monitoring team started. Those profiles gave the team a set of starting points for each
class of hardware, application, and service that were based on known best practices—for
example, 80% for Web server processor utilization, 600 for database transactions per
second, and so on.
He recalls the first few weeks after its initial installation where a few of those “suggested”
threshold levels weren’t really all that tuned for his environment. During that period, he
had shown up for work with an entire screen of health stats flashing Red for warnings that
really weren’t all that relevant to his environment. “There aren’t that many Web-based
ticket sales applications out there that service an entire continent, so we had our hands full
with customizing for a while,” he reminisces to himself.
It took a period of time to get those metrics tuned just so, but there were always areas for
improvement in the system. That’s why he spent a period of each day keeping tabs on how
well the solution was doing its job. Today’s adjustment for the Inventory Processing
System’s metrics is no different.
Browsing around through some of the other screens, he notices something out of place.
“That’s funny,” he exclaims, “Now why is the country map showing a yellow condition for
one of my sites. The…Rochester site...it seems. We’re not doing any work there today.”
John calls up the Service Desk to see what the matter is. With Dan halfway across the world
at some ticket seller’s conference—“Yet another junket,” thinks John—John is technically in
charge of operations. And a major issue is all he needs.
148
Chapter 8
“Hey, Eloise, this is John. Tell me about the yellow condition I’m seeing in Rochester.”
“Hi John. Well, it was yellow. Look again. APM now says we’re having a problem with the
entire Ticket Sales application, all across the map. Whatever it was, it started in Rochester
and spread to the entire country map. We’re looking into it now.”
John’s heart skips a beat as he recalls the Finals Week incident from not too long ago, a low
point in his otherwise illustrious career. If this problem is for real, it’ll be the moment of
truth for his relatively new APM solution. Another repeat of that Finals Week situation, and
he could be looking for a new job.
“I’m coming up to the NOC to manage the incident. I’ll be there in a minute.”
John drops the phone into its cradle. Before he leaves, he calls up his highest-level Service
Quality visualization for all his monitored systems. The board shows Green everywhere,
with the exception of a fairly-disturbing strip of Red across the line marked Ticket Sales.
Displaying 2674 in the Overall column, John recognizes that some nearly-3000 people
aren’t able to buy tickets right now. He dashes off to the Service Desk to coordinate his
teams.
The NOC is a flurry of activity. Though, John thinks, this incident’s “flurry” doesn’t seem as
“frantic” as in previous major incidents. Rather than the usual rush of administrators
scurrying in and out, seeking updates and where they can help, this time the hubbub is
more like a murmur.
He heads to Eloise’s desk for an update. “Whatever it is, it appeared to start in Rochester
and spread through other parts of the system. Right now, we’re showing Red conditions in
the NA-East and Canada-East zones. The other zones are showing Yellow, which I’m not
sure what to think,” she states.
149
Chapter 8
“Was anyone doing any work on the system when it went down?” John asks. Although
doing work on the system outside its designated maintenance windows wasn’t supposed to
happen, sometimes his eager administrators tried to sneak in work when they shouldn’t.
“As far as I know,” says Eloise, “nobody’s even been in the data center this morning. The
network administrators have been in a meeting, and most of the systems administrators
have been at their desks prepping for the next Change Management meeting.”
“OK,” John asks of the room, “Will someone call down to the systems admins and see if
anyone’s been up to anything? And will you,” he points towards one of the Service Desk
employees at random, “run down to whatever conference room the network
administrators are in and see what they know?”
By this point, pretty much everyone in the IT organization had probably received some
kind of notification about the problem. Being the money-making system for the company,
almost everyone was on notice when a problem of this scope occurred. Why they hadn’t
called into the Service Desk yet either means that they’re looking at their own
visualizations—proving, he thinks, that the processes surrounding his APM solution are
working—or they’re all huddled around the water cooler. He hopes for the former.
“Eloise, let’s drill down a bit into these alerts and see what’s going on,” he faces the large
screen in the center of the NOC’s alert wall. Recently installed, it now operates as a kind of
giant heads-up monitor for the entire Service Desk to see at once. That monitor has access
to all of his APM solution’s visualizations, from developer to administrator to even Dan’s
business executive views. It has already come in handy on a couple of occasions.
John continues, “Bring up the high-level network view. Are we still seeing those odd
behaviors from last week?”
Eloise takes control of the projected screen’s computer, clicking through visualizations to
find the one John wants. She brings up the high-level network screen he’s interested in
seeing. It discouragingly shows more than a few elements that aren’t in Green.
150
Chapter 8
“Somebody help me here,” John asks no one in particular, “We were working on these
network metrics just last week, and they’re still in Yellow and even a few Red. Are these for
real, or are these problems that are still residual from our tuning activities of last week?”
Right then John’s network engineer pops into the room, “Ignore those colors, John. That’s
the meeting that me and my team where in just a few minutes ago when this whole thing
started. We’re still tuning our part of the system. You aren’t seeing these error conditions
roll up right now because we’re still trying to determine what thresholds make sense for
us.”
“Have these changed from where they were last week?” John asks, “The colors look to be in
the same place, but is today’s ‘real’ situation impacting your numbers in any way?”
The engineer squints at the rows of numbers on the screen. They show his network
performance metrics alongside upstream and downstream loss rates, round-trip time, and
total bytes, “Everything looks OK. The numbers are on the same magnitude of what we
were seeing before any of this happened. It looks like the problem isn’t us this time.”
The engineer smiles, “Old habits die hard, man. Let me get back with my team and keep
digging around over there. We’ll see what we can find.”
“Hey, John,” Eloise catches his attention. While John’s been chatting with his network
engineer, she’s been browsing through a series of other screens, “It looks like today’s
problem might be with the servers. Check this out.”
Eloise has drilled down past the top-level screens to view each individual technical system.
There, she’s looking at a new visualization that shows health statistics for the system’s
Servers as well as Database, Network, and Software Infrastructure. The Service Quality
indicator for Servers shows Yellow.
151
Chapter 8
“Double-click Servers. Let’s see if we can’t figure out which one’s having a problem today.”
Eloise double-clicks the Servers link to bring forward a new screen. On this one, she’s
presented a list of all the servers that make up the Ticket Selling system. “We know we can
ignore the network metrics right now. Frontline and Changepoint both seem OK, as well as
most of the others. Salesforce is unavailable now, but we know about that one. It’s the Site
HTTP that doesn’t look happy. It’s showing less than 70% for its application performance.”
“Now, we’re getting somewhere,” says John. He looks at his watch. The time shows 8:29a.
Exactly 7 minutes have passed since the first APM light went from Green to Yellow, “Bring
up the Service Tree for our Servers.”
Eloise drills further into the visualization, bringing forward the Service Tree. For
TicketsRus.com, the Service Tree for Ticket Selling servers is broken down into
Infrastructure, Application, Database, and Mainframe servers. The indicator for
Infrastructure Servers glows Red.
152
Chapter 8
She clicks the plus sign next to Infrastructure, and sees that the Web Servers seem to be
triggering the Red alert. Drilling further, she exposes that the problem is with the Web
server at 10.4.224.42, one of the servers in the Ticket Selling application’s External Web
Cluster. Clicking one more time, the error appears specifically to fault its CPU utilization.
“So, here’s a source of the alert,” John tells Eloise, “We now know the predominant
symptom of the problem. Let’s get this info to the systems administrators. I’ll call them. You
draw up a work order with this info.”
---
Half a world away, Dan’s been sitting at the bar with his old friend and now competitor Lee
Mitchell. With his PDA viewing the same Web screens as everyone in his NOC, he’s aware of
the situation as well. Right now he’s showing Lee the country map view on his PDA as one icon
flashes from Green to Yellow.
“So, here you can see that we’ve got an issue in Rochester,” Dan continues in his story to Lee.
“Some Internet device is probably having a problem, which means that fewer people are
connecting through that point of presence.”
Dan clicks past the country map to bring forward his availability dials, the two indicators he
looks to most to identify how well his systems are working. While talking with his friend, he
notices that the dial for User Availability has dropped to 87.4% and continues to go down.
With a couple of drinks in him, he’s in too good a mood to worry too much. He knows that
John’s got him covered.
---
Back at TicketsRus.com’s corporate headquarters, John hopes that Dan isn’t watching this
situation as it unfolds. It’s always easier to talk with him about these situations after
they’ve happened rather than when they’re going on.
153
Chapter 8
He calls down to the systems administrators, “You guys got the work order, yes?”
“Yep,” reports Eric, the administrator who answers the phone, “We’re actually way ahead of
you. We were together working on a few things for today’s Change Management meeting
when the alerts started to come across. So, we’ve been digging into the metrics to see what
we can find for the past few minutes.”
While the Service Desk has been focused on high-level quality alerts, the systems
administrators have been focusing their attention to the actual performance metrics on
their servers. This group believed pretty quickly that the problem might lie somewhere
within their realm of control, considering the data they’re seeing.
Eric brings forward what he jokingly likes to call his Monster Dashboard for the Web server
that’s having a problem. He likes this dashboard because it aligns all sorts of metrics by
time. He knows when he looks at this dashboard that every metric shows the same period
in time, giving him and his team a way to correlate different behaviors that may occur
simultaneously or at least very close to each other.
The Web server’s Monster Dashboard is equipped with nine performance views all at once.
With CPU utilization for the Web server in the upper left, response time for that Web server
in the upper middle, and HTTP errors in the upper right. Right away, he sees some
correlations between these three measurements that give him a bit more information
about the problem.
He focuses on each of these three modules in turn. “Yep, we’ve definitely got a CPU problem
here. It actually seemed to start a couple of hours ago, but is only now getting to the point
where it’s a problem. See this spike,” he points out the trailing spike in CPU utilization to
another administrator, “there’s our problem.”
154
Chapter 8
“Now, check this out. You see how the response time for the Web server got quite a bit
worse as this problem escalated. Over the past few hours, we’ve been slowly creeping up to
the point where this problem began to be noticeable by users. That’s what tripped the
alarm.
“Didn’t we drop that second set of tickets today for that band’s new tour, the one that’s
been on hiatus for, like, 4 years? I wonder how much of this has to do with an actual
problem and how much has to do with their screaming fans trying to get their tickets?
“Well, now, doesn’t this just get all. Check out the HTTP Error rate. Looks like we had a
spike there, and then it just dropped to nothingness. That’s not a pattern that we usually
see when the load is super-high. HTTP Errors should be right around zero all the time,
unless there’s something going wrong. To me, that makes this whole situation feel like a
developer problem. Something’s wrong in that latest code drop perhaps…?”
155
Chapter 8
Eric ponders this recommendation for a few seconds more and decides to make the call. In
the old days, he thinks, this is when the resolution to the problem might get worse than the
problem itself. Right around now is when John would have been frantically calling
everyone in for a big meeting. We’d all need to bring our suggestions, draw them out, and
see which ones made sense. Eric didn’t like these meetings very much. They often went far
into the night.
This has him thinking, “You know, I never really liked this new APM system when John was
talking about it. I was always worried that it’d make my job harder. But now that I’m seeing
it here today, I’ve got to give the guy a hand. It’s making this process a lot…well…easier.”
Eric picks up the phone to call the developers as he finishes logging his impressions into
the work order. Also different than in the previous system, he can link his research notes
directly to items on his dashboard. Since everyone in IT can read almost every dashboard—
those with the financial info are locked away, but the troubleshooting dashboards are all
common access—he needs only jot down the few notes he’s made over the past minutes
and paste links to the correct dashboard URL right into the work order.
On the other end of Eric’s phone call is Rhonda, one of TicketsRus.com’s lead developers,
“Yep. We’re looking at this too. We got the notifications just after you guys did. It is looking
like a problem that’s on our side. We ran that last batch of code updates through regression
testing for days but we must have neglected to run some kind of test. C’est la vie.”
Rhonda has also been looking at the problem through her own set of eyes. She first caught
wind of the problem while grabbing another cup of coffee down the hall a few minutes ago.
She figured she’d look into the problem a bit even before she got word that the ball was
indeed in her court.
Rhonda finds her answer as she pulls up her Apdex by Page URL report. In looking through
it, she chuckles a bit about how the executives and the Service Desk keep thinking about
this Ticket Selling system in terms of “quality.”
156
Chapter 8
“What a subjective way to think about a deterministic system. Quality, heh,” she mutters to
herself, “What does quality really measure?”
As soon as the words come out, she stops to think about what she’s just said, “Well, I guess,
come to think of it, their numerical value for Quality is a lot like my numerical value here
for Apdex. Apdex in this case measures how well a particular page is performing, with a
value of one meaning ‘perfectly’ and those less than one describing their level of
performance in comparison with each other as well as the established baseline of ‘perfect.’
OK. So, I’ll admit each set of numbers has meaning to each set of people. I’ll buy that.”
Getting back to the issue at hand, Rhonda takes a look through her Apdex numbers and
immediately sees that a few are absolutely not where they need to be. Four in particular
give her cause for concern. She thinks to herself, “That batch file shouldn’t be performing
that poorly, but it’s the gateway.jsp code that’s really at fault here. Three procedures are
absolutely unacceptable in terms of performance.”
She continues to herself, “I wonder how much these three procedures are impacting the
overall loading of the page.”
Rhonda switches her view to call up a view of the average page load time by element.
There, she can see each of the pieces of code that goes into a page load, and how much time
each requires to complete executing its portion.
157
Chapter 8
“Wow. Eight, Eight-and-a-half seconds to load that silly .JSP file?” Rhonda admits to herself,
“That’s waaaay out of the ballpark in terms of performance. Here’s why we’re having
problems loading pages for a whole set of our users. This is one of the new code pieces we
updated last week. We’re under pretty heavy load right now. There’s probably some non-
optimized code in the file that didn’t get caught by regression testing.”
Rhonda feels she’s getting closer to a root cause of today’s problem. She thinks, “The big
question now is, ‘How does this load problem relate to the CPU problems that Eric was
seeing? Usually a slow .JSP load doesn’t necessarily equal a processor overuse scenario…”
To answer this question, she needs to see the communication between the Web server and
one of its downstream databases. Rhonda remembers that John’s APM solution now gives
her some very nice transaction-level metrics between servers. In the old days, if she
wanted to see bits and bytes as they cross the wire, she’d first need to call up the network
engineers and have them set up a sniffer in the environment. With TicketsRus.com’s new
Change Control rules, that sniffer could take a while to get approved.
Now, with the APM solution, the sniffers are already in place. Rhonda needs only to find
and pull up the right trace. She does just that, and discovers that the extended load time has
to do with a double-looping construct in the .JSP code.
158
Chapter 8
“A-ha! There’s this slightly-more-than seven second lag in this subroutine. It looks like
we’re looping inside of looping. Amateur mistake. Sorry, everyone,” she mutters to no one
in particular.
Rhonda knows exactly the subroutine she needs to adjust to fix the problem. Rewriting a
half-dozen lines of code, she calls up John to approve fast-tracking the fix.
“It is,” she tells him sheepishly, “and I’m the one at fault. That last batch of code included
a…ahem…bug that’s causing a delay issue when we get under load. I’ve tracked down the
problem and have already coded the fix. All I need from you is the approval to push it out to
the External Web Cluster.”
John harrumphs. He doesn’t usually implement code fixes this quickly, preferring to run
them through testing and Q&A first, but he needs to get this system back online quickly,
“Alright. Patch the code.”
Rhonda inserts the patched code into her automated staging and deployment system,
thoughtfully runs a quick test on it in her testing environment first, then pushes it out. She
waits.
John waits also. He looks at his watch again. Its 8:34a. Exactly 16 minutes have passed since
the system alerted first on this problem. Sixteen minutes in the old days and they might
have gotten everyone into the same room by now. Today, 16 minutes later and John finds
himself again staring back at his country map. Without warning, his Red and Yellow icons
begin flipping back to Green.
159
Chapter 8
---
On the other side of the world, Dan has finished his second drink at the hotel bar, as well as his
story to Lee about his new APM solution.
“It gets better. Even my developers use it to trace down specific lines of code that aren’t
working correctly. Everyone from the techies to my aging brain gets the visualizations they
need,” Dan stops as the formerly-yellow light turns green, “Hey, looks like they’ve fixed the
problem!”
Lee’s eyes widen as he realizes the complete vision such a system brings, “Alright, you win.
Drinks tonight are on me. Now, tell me more about this system.”
Yet, there’s one last part of this story that hasn’t been told. This guide has repeatedly
referred to the idea that performance metrics can be aggregated with business data to get a
financial perspective on IT systems. This capability relates to the topic of Business Service
Management (BSM), which is the topic for Chapter 9. By incorporating BSM’s financial logic
into the technical logic seen through an APM solution, stakeholders like Dan are greeted
with real-world impacts to dollars and cents based on situations like the one told in this
story. Chapter 9 is next, and in it finishes the story.
160
Chapter 9
That chapter also showed how effectively resolving problems requires a data-driven
approach, one with a substantial amount of granular detail across multiple devices and
applications. Using this approach, it is possible to trace a system-wide performance
problem directly into its root cause. By integrating into databases, servers, network
components, and the end users’ experience itself, a fully-realized APM solution is uniquely
suited to gather and calculate metrics for entire business services as a whole.
Yet the topics in the previous chapter’s story were fundamentally focused on the
technologies themselves, along with the performance and availability metrics associated
with those technologies. Its resulting visualizations were heavily focused on the needs of
the technologist:
• Service desk employees were able to track the larger issue directly into its problem
domain.
• Network administrators were able to identify whether metrics for network
utilization were within acceptable parameters.
• Administrators were able to use health and performance metrics to identify
symptoms of the problem.
• Developers were able to ultimately identify the failing lines of code and quickly
implement a fix.
Missing, however, in the previous chapter’s story is another set of business-related metrics
that convert technology behavior into useable data for business leaders. This class of data
tells the tale of how a business service ultimately benefits—or takes away from—the
business’ bottom line. It also creates a standard by which the quality of that service’s
delivery can be measured. It is the gathering, calculation, and reporting on these business-
related metrics that comprise the methodology known as Business Service Management
(BSM).
161
Chapter 9
This is a critical approach to how IT brings value to the business; however, it isn’t one that
is used by all organizations. Those without high levels of IT maturity are intrinsically
unable to attain alignment between IT and the business.
Chapter 2 talked at length about this problem of IT and business alignment. It discussed
how different IT organizations display different levels of organizational maturity, with
greater maturity bringing greater business value. Like APM, BSM is a methodology that
both requires IT maturity while it develops IT maturity.
Note
To better understand the concepts in this chapter, you may consider turning
back to review those first introduced in Chapter 2. As successfully
implementing BSM requires a high level of IT maturity, understanding
exactly how that maturity is developed and measured is important.
BSM and APM are two methodologies that are naturally linked by their requirements for
data. The information gathered through an APM solution’s monitoring integrations directly
feed into the requirements of a BSM calculations engine. Performance, availability, and
behavioral data of the overall business service and its components are all metrics that aid
in calculating that service’s overall return. These metrics also provide the kind of raw data
that helps identify how well a business system is meeting the needs of its customers.
Figure 9.1 shows a logical representation of where BSM links into APM. Here, APM begins
with the creation of monitoring integrations across the different elements that make up a
business service. Those monitoring integrations gather behavioral information about the
end users’ experience. They collect application and infrastructure metrics as well as other
customized metrics from technology components. APM’s data by itself is used primarily by
the IT organization for the problem resolution and service improvement processes
discussed to this point in this guide.
162
Chapter 9
Service-level
Expectations
Service-level
Reporting Business Service Management
Business
Service Views
Monitoring Integrations
The addition of BSM creates a new layer atop this APM infrastructure. Here, the business
itself becomes a critical component of the monitoring solution. Business processes and
service level expectations are encoded into a BSM solution, with the goal of creating
business service views that validate and report on how well the technology is meeting the
needs of the business.
163
Chapter 9
Figure 9.2: The technology underpinnings of a business service feed into BSM’s
Service Model.
This positioning of the Ticketing System at the model’s top is significant. Not shown in
Figure 9.2 but important for BSM’s Service Model is how multiple business services can be
represented in parallel through this augmented model. For example, the same organization
may support business services for Ticket Brokering and/or Vendor Management. These
services, which might or might not be available to customers in different geographic
locations, are added to the topmost level of the model and linked into its geographic
elements. The resulting multiple-service representation allows business leaders a single-
glimpse view across all the services and locations that make up their business.
164
Chapter 9
Different here is the type of data that is represented within this level of the model. Here,
business leaders are primarily interested in information that measures “how well the
technology is meeting the needs of the business” and is commonly manifested at a high
level using a metric referred to as Service Quality.
Quality is a term that has been discussed previously in this guide, yet thus far with a strong
focus on the technology underpinnings to business services. In the BSM perspective, the
idea of quality has been formalized. It corresponds to a quantitative and numerical
representation of the success or failure of a business service.
Now, at first blush, assigning some numerical value to an abstract concept such as “quality”
seems inappropriate for a data-driven solution. But think for a minute about what makes a
quality service—one that meets the needs of its end consumers. To borrow a line of
thinking from Chapter 6:
• When the service is fully functional and meeting the needs of its customers, is that
service of high quality?
• When the service is fully non-functional, meeting the needs of no one, is that service
of high quality?
• When the service is functional but with some non-functional actions, is that service
of high quality?
• When the service is functional but with low assurance that user actions are being
fully accomplished per user needs, is that service of high quality?
• When the service is functional in appearance but fully non-functional, is that service
of high quality?
It is obvious to see how the first question represents a business service that is operating
with a large amount of quality: The service is operational. Users are interacting with it and
accomplishing their needs. It is also obvious to see how the scenario in the second bullet
operates with zero quality: The service is completely down and no one is accomplishing
anything with it.
165
Chapter 9
Yet complex systems operate in more than just a binary “on” versus “off” state. A system
can be functional or non-functional, but it can also operate in many different states in-
between. For example, built-in redundancy can mean that individual component failures
may not affect that service’s overall availability. Those same failures may only slightly
affect its performance. Reductions in performance can also render a functioning service to
a state of slowness that no user would want to interact with. Here, the service is
operational, but not with any level of performance that can be considered “successful.”
Reductions in quality can also occur when the loss or reduction in performance of down-
level components ultimately means a partially-functional service.
As you can see, the situation gets murkier when the state of a system exists in this area
between “on” and “off.” In each of the final three bullets, how well is this system operating?
How well is it fulfilling the needs of its users? As an example, in the third bullet’s scenario
the service is mostly functional with some non-functional actions. If some users can
accomplish their tasks, does this make its quality higher than in the fifth bullet’s scenario
where the service appears functional but really isn’t working at all?
The answers to these questions are obviously non-trivial. As such, this extended
conversation is intended to prove that a spectrum measurement of “quality” is necessary in
addition to the simplistic green versus red representation of an element’s state. A
visualization of just that spectrum is shown in Figure 9.3. There, you can see how the
quality of service for multiple business services is shown in a single image. Visualizations
such as this one enable business leaders to identify which services are meeting the needs of
their customers, yet without being bogged down in the minutiae of their technology
underpinnings.
Consider the situation where the quality of a business service is constantly changing
between acceptable and unacceptable states (see Figure 9.4). When today’s quality
measurement is reported at 77, while yesterday’s was reported at 95, it is easy to recognize
that the overall system’s effectiveness has diminished between these two days.
166
Chapter 9
The first step in developing this sense of quality is obviously in creating the abstraction of
the business service that is its Service Model. This process has been explained in detail
throughout this guide. By developing the Service Model, IT and the business define the
elements that make up the business service as well as their interconnections. Both the
elements as well as the interconnections are important because each fills out the picture of
the service in its own way.
167
Chapter 9
Once that Service Model is fully realized, the next step requires the mapping of service
levels, Key Performance Indicators (KPIs), user impacts, and revenue impacts atop its
structure. In this process, the metrics that define success or failure within business
processes are used as thresholds. For example, if the business defines a particular rate of
completed transactions to mean that the service is acceptably fulfilling customer needs,
that metric should be added to the appropriate element. Or, if the user drop rate from a
Web front end remains within a particular parameter, adding this metric in the appropriate
place is another useful threshold. In the following sections, consider a few of the metrics
that you likely already have in your business today.
Service Levels
Service level metrics are often first defined by the IT organization or through Service Level
Agreements (SLAs) between the business and IT. In these agreements, the business
identifies that services and their components must be available during certain hours with
an identified minimum of downtime. Mean-time between failures, failure rate, allowable
downtime, and expected performance are all common metrics that can be applied to
network, server, application, or other elements by the business.
Highly mature organizations are often capable of adding performance-based metrics into
their SLAs as well. With these types of service levels, specific thresholds for element
performance are known and documented. Lacking an APM solution, these types of
performance metrics are often exceptionally difficult to gather and report on. Their
monitoring may be accomplished through multiple, non-integrated solutions that are
separately controlled by individuals within each technology domain. With such point
solutions in place, the sharing of information between domains can be difficult or even
impossible. However, when APM monitoring is extended across the IT infrastructure, the
business gains the ability to gather these kinds of performance-based metrics across many
different types of technology elements at once.
Service levels with external service providers are another area in which APM provides a
great assist. By gathering availability and performance metrics against contracted service
providers, your organization gains its own set of data to be used during outages. This
information is also useful when contract disputes require chargebacks for SLA breech
events with service providers. Any business service that relies on external services for a
portion of its activities requires this kind of monitoring to truly gain a representation of
overall service success.
KPIs
KPIs are often business-oriented metrics used to define the success level of an organization
or business process. These metrics are often used to quantify activities that are otherwise
subjective in nature. These can be activities such as leadership development, the level of
customer-business engagement, and overall customer satisfaction.
168
Chapter 9
As a rule, KPIs should be designed as actionable metrics, with the value of the metric
driving some necessary action by the organization. KPIs should also be defined to provide
information on status, trends, or variance to business artifacts such as plans, forecasts, or
budgets. These two elements are critically important to KPIs that will eventually be
encoded into BSM, as they provide a basis for quantifying its data. In essence, you need
metrics whose value eventually drives some change to the environment if they are to be
useful in the context of APM. By leveraging metrics that have a known reaction line, your
BSM solution can further be used to provide necessary alerting when that action needs to
be taken.
Mapping KPIs to business artifacts in addition to technology components also enables the
later assignment of dollar values to incidents. As you’ll learn in a minute, a fully-realized
BSM solution can highlight where expensive problems need immediate resolution or when
system or user behaviors are impacting the business bottom line.
Accomplishing all of this requires some sort of design tool. An effective BSM solution will
provide the necessary logic to match incoming KPI data to defined thresholds and
management reaction lines. That logic can be incorporated through user-defined rules,
through relationships between elements, or through complex expressions. Complex
expressions may use if-then-else expressions or regular expressions with which to
construct the necessary thresholds.
Figure 9.5 shows an example design tool where a complex expression has been constructed
that validates availability metrics. In this example, both minimum and expected thresholds
are described. The combination of these two values quantifies the behaviors that are
considered appropriate and inappropriate for the assigned element.
169
Chapter 9
Figure 9.5: A BSM solution’s design tool provides a location where expressions can be
constructed based on incoming data.
User Impacts
Chapter 8’s story discussed a few examples of how the level of user impact drives the
targeting of troubleshooting resources. It argued that those problems with a greater user
impact should in most cases be prioritized over those with a smaller impact. Defining those
user impacts is another activity that occurs within a BSM solution.
Using the BSM software’s designer tool, it is possible to identify the impact associated with
each of the elements in the model. For example, when a particular network connection goes
down to a geographic site, the number of users in that site are known to be affected by the
problem. Or, when one of a pair of clustered transaction processing servers goes down, it
can be assumed that the total level of processing will be reduced by half.
170
Chapter 9
Once the user impacts for individual elements are known and entered into the system, the
Service Model with its interconnections is then used to identify the flow-up and flow-down
impacts for each element. This is represented in the simplistic example shown in Figure
9.6. Based on the dependencies encoded through the model’s interconnections, individual
element impacts can be combined to understand how many users are affected by a problem
with any element. In this example, the Inventory Processing System is known to have 1700
users, while the External Web Cluster is known to have 8300 users. Here, the loss of the
Inventory Processing System can impact its 1700 users as well as a portion of those who
use the External Web Cluster.
Figure 9.6: A simplistic example of how user impacts can be assigned to Service
Model elements.
Revenue Impacts
Knowing how many users are impacted tells one story of a problem. But knowing exactly
how service behaviors impact revenues is yet another. Information about revenue impacts
can be inputted manually into a BSM’s threshold logic. Or, more dynamically, they can be
gathered from business artifacts such as budgets, sales metrics, or other revenue data.
BSM is uniquely suited above all other monitoring solutions in that it includes the capacity
to aggregate traditional technology monitoring with financial data from these kinds of
sources. When sales or budgetary data is available in a format that the BSM solution can
work with, it becomes possible to relate technology and user impacts to hard dollar gains
or losses to the organization.
171
Chapter 9
One example of this can be seen in Figure 9.7. In this example, technical information from a
site’s external Web metrics has been related to financial information as gathered from a
sales or revenue database. In this example, a historical trendline can be developed that
shows the relationship between unique visitors to a Web site and the level of daily revenue
that occurs as a function of those visitors.
Figure 9.7: Using a BSM visualization to relate user count to revenue statistics.
This information becomes fantastically useful to the business leader because it provides a
real-time and historical look at the systems under their management. Yet it does so without
exposing the complex technology underpinnings that aren’t part of their job role. Such
visualizations relate the technology behaviors to business successes as a function of
revenue and/or sales. Business leaders with access to this kind of information have a much
greater capability to quickly shift focus, activities, and even entire lines of business as
needed based on quantitative information.
In some cases, a BSM solution needn’t be used at all with its technology monitoring
elements. A BSM solution provides a single-glimpse visualization of business activities, so it
becomes a location where purely financial information can be gathered for regular
consumption. Figure 9.8 shows an example of this, displaying monthly profit versus payout
information that is sourced directly from a financial database.
172
Chapter 9
Figure 9.8: Information in BSM visualizations can be gathered from purely financial
sources as well.
Notwithstanding the labor cost associated with collecting and creating the necessary
reports, the manual collection process also masks another problem: data granularity. Think
about the situation where it takes a few days of labor to compile and report on business
metrics. The result is that metrics are always at least a few days old, and there is no
capacity to see in real-time how incremental system changes directly impact business
revenues. Lacking data granularity, you’re always making business decisions on old data.
173
Chapter 9
BSM represents one solution that can accomplish just that goal. With a fully-realized BSM
implementation in place, business leaders are given access to real-time information about
technology as well as financial impacts. Because the data that drives their visualizations is
gathered constantly through its APM underpinnings, the business leader gains greater
visibility into customer and system behaviors. Leaders with this information can much
better reposition their business when conditions mandate changes to the business model.
One important metric that is gained through the creation of quality metrics is actually the
functional inverse of quality: the Cost of Poor Quality (CoPQ). When a business service is
not meeting the needs of its customers, it is not bringing in a maximum level of revenue.
With a BSM solution’s metrics in place, it grows possible to measure the lost revenue that is
incurred through poor quality service delivery. That lost revenue is directly and inversely
related to the level of quality in your system.
CoPQ is a term that is defined by Six Sigma as those costs which are generated as a result of
producing defective material. This cost includes the cost involved in fulfilling the gap between
the desired and actual product/service quality. It also includes the cost of lost opportunity due
to the loss of resources used in rectifying the defect.
More on Six Sigma and other frameworks in a minute, but for now, recognize that metrics
such as CoPQ become easy to measure once metrics like quality and revenue impacts are
well defined. BSM solutions have the potential to provide all of these numbers.
174
Chapter 9
This situation grows even more problematic when global businesses sell their wares on the
Internet. Business on the Internet is commonly considered a 7×24×365 operation, with
services and businesses never really closing for operations. The expectation of never being
down presents a set of problems to the Internet-connected business. If Web services are
always up,
Creating such a calendar is another non-trivial task. First and foremost, actual metrics of
user counts must be collated and averaged based on time and day. Those metrics must be
aggregated across the multiple geographic locations and time zones where the business
service is primarily located. Internet-based services with replicated infrastructures in other
parts of the world will see greater levels of inbound users at different parts of the day.
Figure 9.9 shows an example of how a simplistic business calendar can be constructed
across United States, EMEA, and Asia-Pac localities. Here, the areas shaded in red indicate
peak hours for that locality. Those in green represent non-peak hours. A BSM solution that
calculates a service’s business calendar must then aggregate those metrics into an
aggregated business schedule. That schedule can be used to answer the previously-posed
questions as well as define the hours where servicing affects the least users.
175
Chapter 9
Note
It is worth mentioning here that replicated service infrastructures across
multiple localities can impact how the business calendar is used. For
example, if a maintenance activity needs to occur on a local device, the local
business calendar should be consulted. If the device under maintenance is
used by the entire infrastructure, the aggregated calendar will determine its
maintenance time.
176
Chapter 9
ITIL Integration
The ITIL is comprised of a set of industry best practices that identify the necessary
activities that are common to an IT environment. ITIL is specifically comprised of 5 stages
and 24 activities within those stages. Activities within ITIL span the life cycle of service
strategizing, design, transitioning, operations, and continual improvement.
A full discussion on ITIL can take an entire book (or in fact six books, which is what
comprises the entire library to date). For the purposes of ITIL’s linkages with BSM,
consider the activities in Figure 9.10. The 12 activities highlighted in red are those that
stand to gain a direct benefit from the quantitative data gathered and calculated by a
BSM/APM solution.
Figure 9.10: ITIL’s 5 stages and 24 activities. Those that are directly impacted by BSM
are highlighted in red.
Note
You can view more detailed information about the ways in which BSM
improves each of these activities by turning to Chapter 9 of The Definitive
Guide to Business Service Management.
Many of these activities have been discussed in this guide so far, although without
specifically calling them out by their ITIL nomenclature. One in particular of note is the
entire fifth stage of the ITIL service life cycle, Continual Service Improvement. In this stage,
services in operations are analyzed with an eye towards their capacity to meet their
original stated goals as well as the needs of their consumer.
177
Chapter 9
BSM provides a substantial added value to this process through its identification and
quantification of service quality. This quantification enables improvement teams to very
discretely identify areas of gap in service delivery, develop appropriate solutions, and
visibly see how well those solutions impact the overall quality of service delivery. In
essence, using BSM’s metrics, service improvement teams can measure the difference in as-
is and to-be levels of service quality, proving that their improvement activities have indeed
brought about improvement.
The Six Sigma Define, Measure, Analyze, Improve, and Control (DMAIC) process is
comprised of five phases: Defining the services or components that are critical to quality,
measuring their behaviors, analyzing those measurements with an eye towards finding
areas of gap, implementing and validating improvements, and finally, building the
controlling structures that ensure the improvement remains in place over time.
Figure 9.11 highlights each of these five phases as well as some of the common activities
that are accomplished during each phase. Important to recognize here is that each of the
activities noted in Figure 9.11 can actually be augmented through the data provided by a
BSM/APM solution. For example, sampling data can be gathered through APM monitoring
integrations. That data can be used to create a baseline of configuration and behaviors,
which is then continually measured through the same APM monitors.
178
Chapter 9
Those behaviors can then be analyzed for poor quality and associated costs, creating failure
mode effect and Pareto charts to identify areas of highest impact. Ultimately, gaps can be
identified and improved upon, with the same BSM/APM data validating the positive impact
of the improvement. Similar to the process improvement example with ITIL, BSM/APM
quality data establishes the metric by which all improvement activities are ultimately
measured against.
Most importantly, business processes are the backbone of a well-managed business. Their
efficient completion ensures that business activities are executed properly and ultimately
drive value—rather than cost—to the business. The historical problem, however, with
business processes has been their integration into IT technologies. Too often in immature
IT organizations, business processes are forced to function within the capabilities of the IT
infrastructure rather than the opposite. In the most egregious of examples, business
processes are simply not fulfilled by the technologies that are deployed by the IT
organization.
179
Chapter 9
This chapter effectively concludes this guide’s discussion on APM. Its focus on the business
side of technology is fundamentally critical, as every IT organization is a function of its
business, and that most businesses today can’t function without their IT organizations. In
the end, the data-driven approach that such an organization gains through the
implementation of an APM solution gives a far greater situational awareness of the
technology environment than domain-focused point solutions.
Although this chapter concludes the discussion, it does not conclude the guide. The final
chapter, Chapter 10, arrives as a sort of primer on the topics discussed throughout this
entire guide. It summarizes the important points discussed in each chapter and is intended
to be the handout you can use to educate others about what you’ve learned in the other
nine chapters. Most importantly, Chapter 10 serves as a more concise explanation that you
can deliver to the decision makers in your organization should you determine that an APM
solution is a necessary addition to your IT environment.
180
Chapter 10
However, not everyone has the time or the interest in poring through a 200-page tome.
Digesting this guide’s 200-odd pages will consume more than an afternoon making the
topic hard to approach for the busy executive or IT director. To remedy this situation, this
final chapter is published as a sort of “Cheat Sheet” for the other nine. Using excerpts from
each of the previous chapters, this “shortcut” guide summarizes the information you need
to know into a more bite-size format.
So, how is this chapter best utilized? Hand it out to your business leaders as a walkthrough
for APM’s business value. Pass it around your IT department to give them an idea of APM’s
technical underpinnings. Show it to your Service Desk employees as an example of the
future you want to implement. Then, for those who show particular interest, clue them in
on the other chapters for the full story. In the end, you’ll find that APM benefits everyone.
You just have to show them how.
181
Chapter 10
The problem is that the idea of a “service that is down” is often so much more than a simple
binary answer; on versus off, working versus not working. As you can see in Figure 10.1, IT
services are made up of many components that must work in concert. Servers require the
network for communication. Web servers get their information from application servers
and databases. Data and workflow integrations from legacy systems such as mainframes
must occur. These days, even data storage must be accessible over that same network.
182
Chapter 10
Organizations that want to take advantage of APM must lay in place a workflow and
technology infrastructure (see Figure 10.2) that enables the monitoring of hardware,
software, business applications, and, most importantly, the end users’ experience. These
monitoring integrations must be exceptionally deep in the level of detail they elevate to the
attention of an administrator. They must watch for and analyze behaviors across a wide
swath of technology devices and applications, including networks, databases, servers,
applications, mainframes, and even the users themselves as they interact with the system.
Figure 10.2 shows an example of how such a system might look. There, you can see how the
major classes of an IT system—users, networks, servers, applications, and mainframes—
are centered under the umbrella of a unified monitoring system. That system gathers data
from each element into a centralized database. Also housed within that database is a logical
model of the underlying system itself, which is used to power visualizations, suggest
solutions to problems, and assist with the prioritization of responses.
183
Chapter 10
Visualizations
Service
Suggested
Prioritization Model
Solutions
Monitoring Integrations
Figure 10.2: An APM solution leverages monitoring integrations and service model
logic to drive visualizations, prioritize problems, and suggest solutions.
With its monitoring integrations spread across the network, such a system can then assist
troubleshooting administrators with finding and resolving the problem’s root cause. In
situations in which multiple problems occur at once—not unheard of in IT environments—
an APM system can assist in the prioritization of problems. In short, an effective APM
system will drive administrators first to those problems that have the highest level of
impact on users.
184
Chapter 10
But how does an IT organization know when they’ve got that right level of process in place
to best use such a solution? Or, alternatively, if an organization recognizes that they don’t
have the right level, how can an APM solution help them get there?
One way to evaluate and measure the “maturity” of IT is through a model that was
developed in 2007 as part of a Gartner analysis titled Introducing the Gartner IT
Infrastructure and Operations Maturity Model (2007, Scott, Pultz, Holub, Bittman, McGuckin).
This groundbreaking research note defined IT across a spectrum of capabilities, each
relating to the way in which IT actually goes about accomplishing its assigned tasks. An IT
culture with a higher level of process maturity will have the infrastructure frameworks in
place to make better use of technology solutions, solve problems faster, plan better for
expansions, and ultimately align better with the needs and wants of the business they
serve.
Process maturity within an organization is defined as quite a bit more than simply having
the ability to solve problems. Within Gartner’s maturity model, the capacity of IT to solve—
and prevent—ever more complex problems was defined largely by its level of process
maturity.
Secondly, and arguably more importantly, smart organizations can leverage an APM
solution itself to rapidly develop process maturity in an otherwise immature organization. By
reorganizing your IT operations around a data-driven approach with comprehensive
monitoring integrations, you will find that you quickly begin making IT decisions based on
their impact to your business’ applications. You will better plan for augmentations based
on actual data rather than the contrived anticipation of need. You will better budget your
available resources based on actual responses you get out of your existing systems.
185
Chapter 10
• The ways IT looks at itself. In earlier stages of maturity, IT sees itself as a fully-
segregated entity from the business. In many cases, IT can see itself as a different
business entirely! Individuals in IT find themselves concerned with the daily
processing of the servers and the network, to the exclusion of the data that passes
through those systems. As IT matures, the natural culture of IT is to begin thinking
of itself as a partner of the business, and ultimately as the business itself.
• The ways IT looks at data & applications. Data and applications in the immature
IT organization are its bread and butter. These are the elements that make up the
infrastructure, and are worked on as individual and atomic elements. IT in earlier
stages will find itself leveraging manual activities and shunning automation out of
distrust for how it interacts with system components. Applications in early-stage IT
are most often those can be purchased off the shelf, with customization often very
limited or non-existent. Later-stage IT organizations needn’t necessarily build their
own applications; however, they do see applications as solutions for solving
business processes as opposed to fitting the process around the available
application.
• The ways IT looks at the business. Immature IT organizations are incapable of
understanding how their activities impact the business as a whole. Lacking a holistic
view of their systems, they focus on availability as their primary measure of success.
Yet business applications require more than a ping response for them to be truly
available to users. More mature IT organizations find themselves implementing
tools to measure the end user’s experience. When that level of experience is better
understood, IT gains a greater insight into how their operations impact business
operations.
• The tools IT uses. The tools of IT also get more mature as the culture grows in
maturity. IT organizations with low levels of maturity are hesitant to incorporate
holistic solutions often because they can’t see themselves actually using or getting
benefit from those solutions. As such, immature IT organizations lean on point
solutions as stopgap resolutions for their problems. The result is that collections of
tools are brought to bear while unified toolsets are ignored. Mature IT organizations
have a better capability to understand the operational expense of an expanding
toolset, while being more capable—both technically and culturally—of leveraging
the information gained from unified solutions.
186
Chapter 10
As the maturity of IT’s tools grows, so does the predictive capacity of those tools. It was
discussed in Chapter 1 that solution platforms such as those that fulfill APM’s goals extend
their monitoring integrations throughout the technology infrastructure of a business.
Because APM’s reach is so far into each of a business application’s components, it grows
more capable than point solutions for finding the real root cause behind problems or
reductions in performance.
To that end, IT has seen a similar evolution in the approaches used for monitoring its
infrastructure. IT’s early efforts towards understanding its systems’ “under the covers”
behaviors have evolved in many ways similar to Gartner’s depiction of organizational
maturity. Early attempts were exceptionally coarse in the data they provided, with each
new approach involving richer integrations at deeper levels within the system.
IT organizations that manage complex and customer-facing systems are under a greater
level of due diligence than those who manage a simple infrastructure. As such, the tools
used to watch those systems must also have a higher level of due diligence. As monitoring
technologies have evolved over time, new approaches have been developed that extend the
reach of monitoring, enhance data resolution, and enable rich visualizations to assist
administrative and troubleshooting teams:
187
Chapter 10
That statement isn’t written to scare away any business from a potential APM installation.
Although a solution’s installation will require the development of a project plan and
coordination across multiple teams, the benefits gained are tremendous to assuring quality
services to customers. Any APM solution requires the involvement of each of IT’s
traditional silos. Each technology domain—networks, servers, applications, clients, and
mainframes—will have some involvement in the project. That involvement can span from
installing an APM’s agents to servers and clients to configuring SNMP and/or NetFlow
settings on network hardware to integrating APM monitoring into off-the-shelf or
homegrown applications. As a result, an APM solution enables a level of objective analysis
heretofore unseen in traditional monitoring.
The realities of that objective data are best exemplified through APM’s mechanisms to
chart and plot its data. Figure 10.3 shows a sample of the types of simultaneous reports
that are possible when each component of an application infrastructure is consolidated
beneath an APM platform. In Figure 10.3, a set of statistics for a monitored application is
provided across a range of elements.
Take a look at the varied ways in which that application’s behaviors can be charted over
the same period of time. Measuring performance over the time period from 10:00 AM to
7:00 PM, these charts enable the reconstruction of that application’s behaviors across each
of its points of monitoring.
188
Chapter 10
Figure 10.3: APM’s integrations enable real-time and historical monitoring across a
range of IT components, aggregating their data into a single location for analysis.
With the data you see in Figure 10.3, consider the points of integration where you might
want monitors set into place. You will definitely want to watch for server processing. You’ll
need to record your network bandwidth utilization and throughput. You need to know
transaction rates between mainframes and inventory processing.
All these monitors illuminate different behaviors associated with the greater system at
large, and all provide another set of data that fills out the picture in Figure 10.3’s charts and
graphs. Now take a look at Figure 10.4, which shows how some of these monitoring
integrations can be laid into place for an example customer-facing business service.
189
Chapter 10
One end goal of all this monitoring is the ability to create an overall sense of system
“health.” As should be obvious in this chapter, an APM solution has a far-reaching capability
to measure essentially every behavior in your environment. That’s a lot of data. A resulting
problem with this sheer mass of data is in ultimately finding meaning. Essentially, you can
gather lots of data, but it isn’t valuable if you don’t use it to improve the management of your
systems.
As a result, APM solutions include a number of mechanisms to roll up this massive quantity
of data into something that is useable by a human operator. This process for most APM
solutions is relatively automatic, yet requires definition by the IT organization who
manages it.
190
Chapter 10
The concept of “service quality” is used to explain the overarching environment health. Its
concept is quite simple: Essentially, the “quality” of a service is a single metric—like a
stoplight—that tells you how well your system is performing. In effect, if you roll up every
system-centric counter, every application metric, every network behavior, and every
transaction characteristic into a single number, that number goes far in explaining the
highest-level quality of the service’s ability to meet the needs of its users.
Consider the graphic shown in Figure 10.5. Here, a number of services in different locations
are displayed, all with a health of “Normal.” This single stoplight chart very quickly enables
the IT organization to understand when a service is working to demands and when it isn’t.
The graph also shows the duration the service has operated in the “normal” state, as well as
a monthly trend. This single view provides a heads-up display for administrators.
Yet actually getting to a graph like this requires each of the monitoring integrations
explained to this point in this chapter. The numerical analysis that goes into identifying a
service’s “quality” requires inputs from network monitors, on-board agents, transactions,
and essentially each of the monitoring types provided by APM.
191
Chapter 10
Consider, for example, a set of fans watching a baseball game. If you and a friend are both
watching the game but sitting in different parts of the stadium, you’re sure to capture
different things in your view. Your friend who is sitting in the good seats down by the
batter is likely to pick up on more subtle non-verbal conversations between pitcher and
catcher. In contrast, your seats deep in the outfield are far more likely to see the big picture
of the game—the positioning of outfielders, the impact of wind speed on the ball, the
emotion and effects of the crowd on the players—than is possible through your friend’s
close-in view.
Relating this back to applications and performance, it is for this reason that multiple
perspectives are necessary. Their combination assists the business with truly
understanding application behaviors across the entire environment. An agent that is
installed to an individual server will report in great detail about that server’s gross
processing utilization. That same agent, however, is fully incapable of measuring the level
of communication between two completely separate servers elsewhere in the system.
This view is critically necessary because it is not possible—or, at the very least,
exceptionally difficult—to construct this experience using the data from other metrics.
Relating this back to the baseball example, no matter how much data you gather from your
seat in the outfield, it remains very unlikely that you’ll extrapolate from it what the pitcher
is likely to throw next.
For the needs of the business application, end user experience (EUE) enables
administrators, developers, and even management to understand how an application’s
users are faring. First and foremost, this data is critical for discovering how successful that
application is in servicing its customers. Applications whose users experience excessive
delays, drop off before accomplishing tasks, and don’t fulfill the use case model aren’t
meeting their users’ needs. And those that don’t meet user needs will ultimately result in a
failure to the business.
192
Chapter 10
This line of thinking introduces a number of potential use cases where EUE monitoring can
benefit an application’s quality of service. EUE monitoring works for valuating the
experience of the absolute end user as well as in other ways:
Think for a moment about a typical Internet-based application such as the one being
discussed in this chapter. Multiple systems combine to enable the various functions of that
application. Yet there is one set of servers that interfaces directly with the users
themselves: the External Web Cluster. Every interaction between the end user and the
application must proxy in some way through that Web-based system. This centralization
means that every interaction with users can also be measured from that single location.
EUE leverages transaction monitoring between users and Web servers as a primary
mechanism for defining the users’ experience. Every time a user clicks on a Web page, the
time required to complete that transaction can be measured. The more clicks, the more
timing measurements. As users click through pages, an overall sense of that user’s
experience can be gathered by the system and compared with known baselines. These
timing measurements create a quantitative representation of the user’s overall experience
with the Web page, and can be used to validate the quality of service provided by the
application as a whole.
193
Chapter 10
It is perhaps easiest to explain this through the use of an example. Consider the typical
series of steps that a user might undergo to browse an e-commerce Web site, identify an
item of interest, add that item to their basket, and then complete the transaction through a
check out and purchase. Each of these tasks can be quantified into a series of actions. Each
action starts with the Web server, but each action also requires the participation of other
services in the stack for its completion:
• Browse an e-commerce Web site. The External Web Cluster requests potential
items from the Java-based Inventory Processing System, which gathers those items
from the Inventory Mainframe. Resulting items are presented back to the External
Web Cluster, where they are rendered via a Web page or other interface.
• Identify an item of interest. This step requires the user to look through a series of
items, potentially clicking through them for more information. Here, the same
thread of communication between External Web Cluster, Inventory Processing
System, and Inventory Mainframe are leveraged during each click. Further
assistance from the ERP system can be used in identifying additional or alternative
items of interest to the user based on the user’s shopping habits.
• Add that item to the basket. Creating a basket often requires an active account by
the user, handled by the ERP system with its security handled by the Kerberos
Authentication System. The actual process of moving a desired item to a basket can
also require temporarily adjusting its status on the Inventory Mainframe to ensure
that item remains available for the user while the user continues shopping.
Information about the successful addition of the item must be rendered back to the
user by the External Web Cluster.
• Complete the transaction through a check out and purchase. This final phase
leverages each of the aforementioned systems but adds the support of the Credit
Card Proxy System and Order Management System.
In all these conversations, the External Web Cluster remains the central locus for
transferring information back to the user. Every action is initiated through some click by
the user, and every transaction completes once the resulting information is rendered for
the user in the user’s browser. Thus, a monitor at the level of the External Web Cluster can
gather experiential data about user interactions as they occur. Further, as the monitor sits
in parallel with the user, any delay in receiving information from down-level systems is
recognized and logged.
A resulting visualization of this data might look similar to Figure 10.6. In this figure, a top-
level EUE monitor identifies the users who are currently connected into the system.
Information about the click patterns of each user is also represented at a high level by
showing the number of pages rendered, the number of slow pages, the time associated with
each page load, and the numbers of errors seen in producing those pages for the user.
194
Chapter 10
Figure 10.6: User statistics help to identify when an entire application fails to meet
established thresholds for user performance.
Adding in a bit of preprogrammed threshold math into the equation, each user is then given
a metric associated with their overall application experience. In Figure 10.6, you can see
how some users are experiencing a yellow condition. This means that their effective
performance is below the threshold for quality service. Although this information exists at
a very high level, and as such doesn’t identify why performance is lower than expectations,
it does alert administrators that degraded service is being experienced by some users.
An effective APM solution should enable administrators to drill down through high level
information like what is seen in Figure 10.6 towards more detailed statistics. Those
statistics may illuminate more information about why certain users are experiencing
delays while others are not. Perhaps one server in a cluster of servers further down in the
application’s stack is experiencing a problem. Maybe the items being requested by some
users are not being located quickly enough by inventory systems. Troubleshooting
administrators can drill through EUE information to server and network statistics, network
analytics, or even individual transaction measurements to find the root cause of the
problem.
195
Chapter 10
It is this process that requires attention at this point in our discussion. In reading through
the first five chapters of this guide, you’ve made yourself aware of where monitoring fits
into your environment. The next step is in creating meaning out of its raw data. As John
mentioned earlier and as you’ll discover shortly, the real magic in an APM solution comes
through the creation and use of its Service Model.
To fully understand the quantitative approach to Service Quality, one must understand how
the different types of monitoring are aggregated into what is termed a Service Model. This
Service Model is the logical representation of the business service, and is the structure and
hierarchy into which each monitoring integration’s data resides. The Service Model is
functionally little more than “boxes on a whiteboard,” with each box representing a
component of the business service and each connection representing a dependency. It
resides within your APM solution, with the sum total of its elements and interconnections
representing the overall system that the solution is charged with monitoring.
But before actually delving into a conversation of the Service Model, it is important to first
understand its components. Think about all the elements that can make up a business
service. There are various networking elements. Numerous servers process data over that
network. Installed to each server may be one or more applications that house the service’s
business logic. All these reside atop name services, file services, directory services, and
other infrastructure elements that provide core necessities to bind each component.
Take the concepts that surround each of these and abstract them to create an element on
that proverbial whiteboard. This guide’s External Web Cluster becomes a box on a piece of
paper marked “External Web Cluster.” The same happens with the Inventory Processing
System and the Intranet Router, and eventually every other component.
By encapsulating the idea of each service component, it is now possible to connect those
boxes and design the logical structure of the system. This step is generally an after-the-
implementation step, with the implemented service’s architecture defining the model’s
structure and not necessarily the opposite. Figure 10.7 shows a simple example of how this
might occur. There, the External Web Cluster relies on the Inventory Processing System for
some portion of its total processing. Both the External Web Cluster and the Inventory
Processing System rely on the Intranet Router for their networking support. As such, their
boxes are connected to denote the dependency.
196
Chapter 10
This abstraction and connection of service components only creates the logical structure
for your overall business service. Internal to each individual component are metrics that
valuate the internal behaviors of that component. As you already saw back in Figure 10.4,
those metrics for a network device might be Link Utilization, Network Latency, or Network
Performance. An inventory processing database might have metrics such as Database
Performance or Database Transactions per Second. Each individual server might have its
own server-specific metrics, such as Processor Utilization, Memory Utilization, or Disk I/O.
Even the installed applications present their own metrics, illuminating the behaviors
occurring within the application.
With this in mind, let’s redraw Figure 10.7 and map a few of these potential points of
monitoring into the abstraction model. Figure 10.8 shows how some sample metrics can be
associated with the Inventory Processing System. Here, the Database Performance and
Transactions per Second statistics arrive from application analytics integrations plugged
directly into the installed database. Agent-based integrations are also used to gather whole
server metrics such as Memory Utilization and Processor Utilization.
197
Chapter 10
Component Health
Database Performance
Inventory Processing System
Transactions per Second
Memory Utilization
Processor Utilization
Intranet Router
Figure 10.8: Individual monitors for each element are mapped on top of each
abstraction.
You’ll also notice that the colors of each element are changed as well. At the moment Figure
10.8 is drawn, the Inventory Processing System’s box is colored red. This indicates that it is
experiencing a problem. Drilling down into that Inventory Processing System, one can
identify from its associated metrics that the server’s Processor Utilization has gone above
its acceptable level and has switched to red.
Each of the metrics assigned to the Inventory Processing System’s box are themselves part
of a hierarchy. The four assigned metrics fall under a fifth that represents the overall
Component Health. This illustrates the concept of rolling up individual metrics to those that
represent larger and less granular areas of the system. It enables the failure of a down-level
metric to quickly rise to the level of the entire system.
198
Chapter 10
Going a step further, this model flows up individual failures to the greater system through
its individual linkages between components that rely on each other. In this example, the
External Web Cluster relies on the failed Inventory Processing System. Therefore, when the
Inventory Processing System experiences a problem, it is also a problem for the External
Web Cluster. The model as a whole is impacted by the singular problem associated with
Processor Utilization in the Inventory Processing System.
It is the summation of all these individual threshold values that ultimately drives the
numerical determination of Service Quality. A business service operates with high quality
when its configured thresholds remain in the green. That same service operates with low
quality when certain values flip from green to red and is no longer available when other
critical values become unhealthy. The levels of functionality between these states become
mathematical products of each calculation.
In effect, one of APM’s greatest strengths is in its capacity to mathematically calculate the
functionality of your service. Taking this approach one step further, IT organizations can
add data to each element that describes the number of potential users of that component.
Combining this user impact data with the level of Service Quality enables the system to
report on which and how many users are impacted by any particular problem.
It is the word “useful” that is most important in the previous sentence. “Useful” in this
context means that the visualization is providing the right data to the right person. “Useful”
also means providing that data in a way that makes sense for and provides value to its
consumer.
The concept of digestibility was first introduced in this book’s companion, The Definitive
Guide to Business Service Management. In both guides, the digestibility of data relates to the
ways in which it can be usefully presented to various classes of users. For example, data
that may be valuable to a developer is not likely to have the same value for Dan the COO.
Dan’s role might care less about the failure of an individual network component compared
with how that component impacts the system’s customers. Each person in the business has
a role to fill, and as such, different views of data are necessary.
199
Chapter 10
No visualization is effective unless it is created first with its consumer in mind. If that
consumer can’t digest what’s being presented to them, the information being displayed is
valueless. Think about the types of consumers who in your business today might benefit
from the data an APM solution can gather:
Note
With a picture really being worth a thousand words, consider turning back to
Chapter 7 to see examples of APM visualizations for each of these classes of
consumer. There, you’ll see how APM’s graphical representation of your
business services enables a much improved situational awareness of their
inner workings and impacts to the business.
Visibility
Behaviors that occur outside expected thresholds are alerted via high-level visualizations.
Through drill-down support, the perspective and data found in that high-level visualization
can be narrowed to one or more systems or subsystems that triggered the failure. Using
tools such as service quality metrics and hierarchical service health diagrams, triaging
administrators can be quickly advised as to initial steps in problem resolution.
200
Chapter 10
Prioritization
Counts of affected users are predefined within an APM solution’s interface, enabling
triaging teams to identify the actual priority of one incident in relation to others that are
outstanding. As a result, those with higher numbers of affected users or greater impacts on
the business bottom line can be prioritized higher than those with lesser affect.
Improvement
Throughout the entire process, the APM solution continues to gather data about the
system. This occurs both during nominal as well as non-nominal operations. The resulting
data can then be later used by improvement teams to identify whether additional
hardware, code updates, or other assistance is needed to prevent the problem from
reoccurring. By monitoring the environment through the entire process, after-action
review teams can identify whether the resolution is truly a permanent fix or if further work
is needed.
It should be obvious to see how this six-step process is much more data driven than the
earlier traditional approach. Here, every team remains notified about the status of the
problem and can provide input when necessary through the sharing of monitoring data.
When problems occur that cross traditional domain boundaries, those teams can work
together towards a common goal without the need for war rooms and their subsequent
finger-pointing.
201
Chapter 10
Note
For a fictional narrative of the entire six-step process, consider turning back
to Chapter 8. There, a made-up storyline is used to show how a fully-realized
APM solution can and does improve the process of triaging, troubleshooting,
provisioning resources, and eventually solving what would otherwise be an
exceptionally painful problem.
Yet the topics in this chapter’s story so far have been fundamentally focused on the
technologies themselves, along with the performance and availability metrics associated
with those technologies. Its resulting visualizations were heavily focused on the needs of
the technologist:
• Service desk employees were able to track the larger issue directly into its problem
domain.
• Network administrators were able to identify whether metrics for network
utilization were within acceptable parameters.
• Administrators were able to use health and performance metrics to identify
symptoms of the problem.
• Developers were able to ultimately identify the failing lines of code and quickly
implement a fix.
Missing, however, in the previous chapter’s story is another set of business-related metrics
that convert technology behavior into useable data for business leaders. This class of data
tells the tale of how a business service ultimately benefits—or takes away from—the
business’ bottom line. It also creates a standard by which the quality of that service’s
delivery can be measured. It is the gathering, calculation, and reporting on these business-
related metrics that comprise the methodology known as Business Service Management
(BSM).
202
Chapter 10
BSM and APM are two methodologies that are naturally linked by their requirements for
data. The information gathered through an APM solution’s monitoring integrations directly
feed into the requirements of a BSM calculations engine. Performance, availability, and
behavioral data of the overall business service and its components are all metrics that aid
in calculating that service’s overall return. These metrics also provide the kind of raw data
that helps identify how well a business system is meeting the needs of its customers.
Figure 10.9 shows a logical representation of where BSM links into APM. Here, APM begins
with the creation of monitoring integrations across the different elements that make up a
business service. Those monitoring integrations gather behavioral information about the
end users’ experience. They collect application and infrastructure metrics as well as other
customized metrics from technology components. APM’s data by itself is used primarily by
the IT organization for the problem resolution and service improvement processes
discussed to this point in this guide.
Service-level
Expectations
Service-level
Reporting Business Service Management
Business
Service Views
Monitoring Integrations
203
Chapter 10
The addition of BSM creates a new layer atop this APM infrastructure. Here, the business
itself becomes a critical component of the monitoring solution. Business processes and
service level expectations are encoded into a BSM solution, with the goal of creating
business service views that validate and report on how well the technology is meeting the
needs of the business.
The metrics gained through a BSM implementation are also useful when fed into
management frameworks such as ITIL or Six Sigma. Like BSM’s roots in APM data, these
frameworks are often highly data-driven in how they accomplish and improve upon the
tasks of IT. One of the common limitations, however, in successfully implementing ITIL and
Six Sigma framework processes is in gathering enough data of the right kind to be useful.
The data gathering and calculation potential of a BSM/APM solution enables greater
success with both frameworks.
BSM provides a substantial added value to this process through its identification and
quantification of service quality. This quantification enables improvement teams to very
discretely identify areas of gap in service delivery, develop appropriate solutions, and
visibly see how well those solutions impact the overall quality of service delivery. In
essence, using BSM’s metrics, service improvement teams can measure the difference in as-
is and to-be levels of service quality, proving that their improvement activities have indeed
brought about improvement.
204