SLA Metrics, Measurement and Manipulation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Clarity for Complex Change

SLA Metrics, Measurement


and Manipulation
How to identify and correct flawed
SLA measurement and reporting in IT
outsourcing contracts
Offered by:
Manji Kerai
Senior Consultant
manji.kerai@xceedgroup.com
November 2013

Introduction

A lot has already been written about developing good SLA metrics for IT
outsourcing contracts. Most of it is common sense and as IT outsourcing has
become commonplace, SLAs have been defined and refined to the extent that
many standard SLA templates exist that can be used with few changes. In general,
all good SLA metrics have the following characteristics:

Service satisfaction, although


subjective, provides a good double
check on the validity of the SLA
metrics. CIO magazine

1. They are meaningful and reflect business priorities


2. They are objective, measureable and reportable
3. They reflect service performance entirely within the providers control and scope
of responsibility
4. The targets are realistic and fair to both parties
5. They have consequences that provide an incentive for the service provider to do
a good job and keep doing it well
All SLA metrics include a clear definition, a specific target and also consequences
(penalties and rewards) in case of under (or over) achievement. Today, many
organisation who outsource IT infrastructure simply take a SLA template supplied by
the chosen provider, modify the targets and/or penalties within the SLA and feel
comfortable in thinking that the SLA will ensure the provider does a good job.

Seventeen percent of the actual


IT budget in 2012 was spent on
outsourcing. ZDnet.

If SLA metrics are well thought through and follow the five basic rules outlined
above why do IT managers sometimes find that the SLA report appears to shows
that the service performance is within target even when business impacting service
disruptions occur? The most likely reason for this discrepancy is the SLA metric
measurement or calculation is incorrect and therefore does not accurately reflect
the service performance required by the business.
This white paper highlights some common areas where SLA metric mis-calculations
occur inadvertently or sometimes deliberately which leads to reports which reflect
better service performance levels than the actual. It also provides guidance on how
to identify and correct or avoid the anomalies.

Page 2 of 8

info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100

Averaging service availability and response times


Service availability and response time are a very common SLA metric used
to measure service performance in many areas of IT including networks and
applications. Consider this example:
Application response time
Definition:

The application response time is the time in milliseconds it takes for a transaction to be
processed by the application server. It is measured from the time the request is recorded at
the ingress point of the server to the time the response is sent out by the application and is
recorded at the egress point of the server. This metric is reported on a weekly basis.

Target:

Less than 25 milliseconds average (presumably this target is deemed acceptable for this
app).

The provider is likely to have an automated system or tool that monitors and tracks
this metric. The system will typically log the response times for all transactions every
day, the average calculated and saved in a data collection system (simple text file
or spreadsheet more likely). The weekly averages will then be calculated from the
daily averages for reporting purposes. This method of measurement and reporting
is very practical for the service provider and not uncommon.
In a typical month the metric measurements may look something like this:
Day
Transactions
Average
response time

Mon Tue Wed Thu Fri Sat Sun


100 000 99 000 98 000 101 000 130 000 101 000 101 000
21.5

21.1

19.9

19.4

22.1

20.5

21.5

Weekly Average = 20.86 ms


The average response time is below the target maximum so service performance
for this week is perfectly fine.
However, consider a situation where on one very busy day the provider had
capacity issues. The monthly report may look something like this:
Day
Transactions
Average
response time

Mon Tue Wed Thu Fri Sat Sun


100 000 99 000 98 000 101 000 130 000 101 000 101 000
21.5

21.1

19.9

19.4

45.8 20.5 21.5

Weekly Average = 24.24 ms


The average response is still within SLA and all should be well. The reality is, of
course, that for one day, users suffered unacceptable latency which would have
impacted business and crucially, this is not an acceptable level of performance
despite the weekly SLA target being met.

Page 3 of 8

info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100

Averages of averages
The problem with the above calculation, as any 15 year old will tell you as they
would have just learnt it at school, is that you cannot take averages of averages
to give you anything meaningful. Unfortunately this basic arithmetic knowledge is
sometimes forgotten when practicalities get in the way. Normally you would think
nothing of averaging seven values, one for each day of the week to obtain a
weekly average, forgetting that each daily value is itself an average. To address this
type of mis-calculation all SLA metrics should include a very specific measurement
and calculation definition. E.g.
Measurement: The average response will be calculated by taking the response times of each transaction

over the week and dividing by the total number of transactions in the week. Mathematically:

Average response time


Where

t is the response time for each transaction in milliseconds


N is the total number of transaction over the measurement period.

If calculating the average in this way is not possible or practical for the service
provider due to limitations of their automated data collection systems then the metric
should be re-defined, for example, changed to daily average or changed to
using weighted averages.

Weighted average response time


Where

T is the total number of transactions during the measurement period


A is the average response time for the measurement period
N is the number of measurement periods (seven in this case).

In the above example using the correct calculation method would give a weekly
average response time of 25.13 ms which would accurately (and more fairly) reflect
the service providers performance for the week.
Although weighted averages work well for metrics such as response times, they
are not always appropriate for measuring metrics such as availability. Consider the
availability values below:
Day

Mon Tue Wed Thu Fri Sat Sun

Transactions

100 000 99 000 98 000 101 000 130 000 101 000 101 000

Availability

100% 100% 99.99% 99.99% 100% 99.99% 100%

Weighted average will give a perfectly acceptable and fair weekly average of
99.996%. However, suppose that the service on one of the days was down all day
and caused very severe business impact:
Day
Transactions
Availability

Mon Tue Wed Thu Fri Sat Sun


100 000 99 000 98 000 101 000

101 000 101 000

100% 100% 99.99% 99.99% 0% 99.99% 100%

Unfortunately using weighted averages now gives a weekly average figure of


99.995% which clearly does not accurately reflect the weekly performance. In this
example it would be safer, practical and fairer to use simple averages which would
give a weekly performance figure of 86% which is likely to be an SLA failure.

Page 4 of 8

info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100

Large systems
Another common area for incorrect SLA calculation and reporting occurs when
calculating a metric for a large or distributed system. Take the following example:
System Availability
Definition:

The amount of time the system is up and running and is able to process transactions
expressed as a percentage over a one month period.

Target:

99.90% or better (presumably this target is deemed acceptable for this system).

A large system is typically deployed in two ways:


1. The system is distributed over a number of identical servers or sub-systems
and each sub-system can work independently. The transactions would be
distributed over all sub-systems providing load balancing.
2. Each system component is broken down into its parts and the availability
of each part reported separately, e.g. firewall, webserver and a database
server. If one component fails then the whole service will not function correctly.
Consider the following availability performance results over one measurement
period (e.g. a month):
System type 1
Sub-System 1 2
Availability

99.90% 99.90%

3
100%

Calculating system availability in this case is simply a matter of averaging provided


that if one sub-system was down the others could take over (true load-balancing).
System type 2
Component 1 2
Availability

99.90% 99.90%

3
100%

In this case it is tempting to report the SLA to be the average of the 3 components
(99.93%) however this would be incorrect. For a system where each component
it critical to the overall service then the correct availability would be the product of
each component.

System Availability
Where A n is the availability of each component over the measurement period.

Clearly this gives a different value and more accurately reflects the service
performance achieved over the measurement period.

Page 5 of 8

info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100

Redundant systems
Most IT infrastructure include redundancy from firewalls and circuits to servers
and switches, in order to meet specific availability (uptime) performance levels.
However, designing a system to meet specific availability performance is very
different to measuring and reporting achieved availability levels. Consider this
typical situation:

System availability
: 99.90% per month
Required target overall system availability : 99.99% per month

In order to meet this target the service provider is likely to install a secondary
redundant system so if one system fails the secondary system takes over.
Target system availability
99.99%
Primary
(99.90% availability)
Secondary
(99.90% availability)

The theoretical availability of such a system according to redundancy formula is:

Where

P is the probability of a component failing assuming that each component is


independent (99.90% availability = 0.1% un-availability = 0.001)
n is the number of redundant members of the system (2 in this case)

Consider a situation when the primary system fails for 10 hours (98.61% availability)
however, the secondary takes over and service is not disrupted.

Primary

Secondary

Availability

98.61% 100%

The calculated overall system availability is 100% which, as the service was not
disrupted, appears to be perfectly reasonable. Consider a situation where both
systems had an outage during the month but on different days so again service was
unaffected.

Primary

Secondary

Availability

98.61% 95.00%

The calculated system availability is now 99.30% however, clearly, as the overall
service was not affected this should in reality be 100%.

Page 6 of 8

info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100

The reason for this discrepancy is that the formula for calculating the theoretical
availability of a redundant system, as shown above, is not applicable when
reporting actual performance achieved. Reporting the correct achieved service
performance requires further checks to identify any overlap in the downtimes of
each system. In maths jargon, we need to calculate the intersection of the downtime
of all the redundant sub-systems to work out the actual availability.

Availability


Where

D is the sum of the downtimes of each member of the redundant system.


n is the number of members in the redundant system

If there is no overlap in downtime, even if one system was down for the entire period
then the overall achieved system availability would remain 100% which correctly
reflects the actual service performance.

Measurement period
One classic area for inadvertent or sometimes deliberate SLA manipulation is
extending the metric measurement period. Of the three SLA availability targets below:
1.
2.
3.

System performance with maximum one outage per day


System performance with maximum one outage per month
System performance with maximum one outage per year

It is clear that the third target which extends the measurement period over the year
is a much higher target for the service provider to achieve. However, consider the
targets below:
1.
2.
3.

System availability of 99.95% per day


System availability of 99.95% per month
System availability of 99.95% per year

The third target again appears to be a toughest to achieve, however, upon close
inspection the opposite is actually true.

Availability

Downtime over
Downtime over
a day (minutes) a month (minutes)

99.95%

Downtime over
a year (minutes)

0.72 21.6 262.8

Over a day this allows 0.72 minutes of downtime, however, over a year this allows over
262 minutes of downtime. In reality, if a system had an outage, which hopefully would
be a rare occurrence, the provider could and should easily resolve the issue within the
hour. If this failure occurred a few times a year the service provider would still easily
meet the performance target if it was measured over a whole year. Under a shorter
measurement period, such as a day or a month, such an outage would more accurately
be identified as an SLA performance failure and would likely incur penalties.
One point to note is that if the metric relates to downtime (e.g. outage) then the longer
the measurement period, the higher the service performance and if the metric relates
to uptime then the shorter the measurement period the higher the service performance.

Page 7 of 8

info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100

Conclusion
These days SLA service performance metrics used in IT outsourcing contracts are
usually well designed, well defined and generally fit for purpose. Typically, the only
areas that are changed or negotiated are specific targets and penalties, the metrics
themselves remain unchanged.

When it comes to SLAs, it is worth


spending the time to get them right.
CIO magazine.

Unfortunately while the performance metrics are satisfactory and truly reflect business
priorities, one area that is overlooked is how each service performance metric
is measured, calculated and reported. An incorrect measurement or calculation
method can potentially make the reported service performance for a SLA metric
unfit for purpose.
Things to consider when evaluating SLA metrics and setting targets:
1.

Include information about how the performance metric will be measured,


calculated and reported in the SLA metric definition.

2.

Ensure that averages of averages are not used in any metric calculation.
Weighted averages are a good alternative.

3.

For large systems, be clear on whether each component or sub-system


works in series (the whole is the sum of its parts) or parallel (each works
independently).

4.

For redundant systems, understand the design criteria used and ensure that the
measurement, calculation and reporting method is clearly documented.

5.

Ensure that the measurement period for each SLA metric truly reflects service
requirements. In general, the longer the measurement period the easier the
target is to achieve.

For any business critical service it is crucial to understand the measurement,


calculation and reporting methodology and it is also wise to conduct some what-if
scenarios before negotiating, setting and agreeing SLA targets with IT outsourcing
providers.

Page 8 of 8

info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100

You might also like