Professional Documents
Culture Documents
SLA Metrics, Measurement and Manipulation
SLA Metrics, Measurement and Manipulation
SLA Metrics, Measurement and Manipulation
Introduction
A lot has already been written about developing good SLA metrics for IT
outsourcing contracts. Most of it is common sense and as IT outsourcing has
become commonplace, SLAs have been defined and refined to the extent that
many standard SLA templates exist that can be used with few changes. In general,
all good SLA metrics have the following characteristics:
If SLA metrics are well thought through and follow the five basic rules outlined
above why do IT managers sometimes find that the SLA report appears to shows
that the service performance is within target even when business impacting service
disruptions occur? The most likely reason for this discrepancy is the SLA metric
measurement or calculation is incorrect and therefore does not accurately reflect
the service performance required by the business.
This white paper highlights some common areas where SLA metric mis-calculations
occur inadvertently or sometimes deliberately which leads to reports which reflect
better service performance levels than the actual. It also provides guidance on how
to identify and correct or avoid the anomalies.
Page 2 of 8
info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100
The application response time is the time in milliseconds it takes for a transaction to be
processed by the application server. It is measured from the time the request is recorded at
the ingress point of the server to the time the response is sent out by the application and is
recorded at the egress point of the server. This metric is reported on a weekly basis.
Target:
Less than 25 milliseconds average (presumably this target is deemed acceptable for this
app).
The provider is likely to have an automated system or tool that monitors and tracks
this metric. The system will typically log the response times for all transactions every
day, the average calculated and saved in a data collection system (simple text file
or spreadsheet more likely). The weekly averages will then be calculated from the
daily averages for reporting purposes. This method of measurement and reporting
is very practical for the service provider and not uncommon.
In a typical month the metric measurements may look something like this:
Day
Transactions
Average
response time
21.1
19.9
19.4
22.1
20.5
21.5
21.1
19.9
19.4
Page 3 of 8
info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100
Averages of averages
The problem with the above calculation, as any 15 year old will tell you as they
would have just learnt it at school, is that you cannot take averages of averages
to give you anything meaningful. Unfortunately this basic arithmetic knowledge is
sometimes forgotten when practicalities get in the way. Normally you would think
nothing of averaging seven values, one for each day of the week to obtain a
weekly average, forgetting that each daily value is itself an average. To address this
type of mis-calculation all SLA metrics should include a very specific measurement
and calculation definition. E.g.
Measurement: The average response will be calculated by taking the response times of each transaction
over the week and dividing by the total number of transactions in the week. Mathematically:
Where
If calculating the average in this way is not possible or practical for the service
provider due to limitations of their automated data collection systems then the metric
should be re-defined, for example, changed to daily average or changed to
using weighted averages.
Where
In the above example using the correct calculation method would give a weekly
average response time of 25.13 ms which would accurately (and more fairly) reflect
the service providers performance for the week.
Although weighted averages work well for metrics such as response times, they
are not always appropriate for measuring metrics such as availability. Consider the
availability values below:
Day
Transactions
100 000 99 000 98 000 101 000 130 000 101 000 101 000
Availability
Weighted average will give a perfectly acceptable and fair weekly average of
99.996%. However, suppose that the service on one of the days was down all day
and caused very severe business impact:
Day
Transactions
Availability
Page 4 of 8
info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100
Large systems
Another common area for incorrect SLA calculation and reporting occurs when
calculating a metric for a large or distributed system. Take the following example:
System Availability
Definition:
The amount of time the system is up and running and is able to process transactions
expressed as a percentage over a one month period.
Target:
99.90% or better (presumably this target is deemed acceptable for this system).
99.90% 99.90%
3
100%
99.90% 99.90%
3
100%
In this case it is tempting to report the SLA to be the average of the 3 components
(99.93%) however this would be incorrect. For a system where each component
it critical to the overall service then the correct availability would be the product of
each component.
System Availability
Where A n is the availability of each component over the measurement period.
Clearly this gives a different value and more accurately reflects the service
performance achieved over the measurement period.
Page 5 of 8
info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100
Redundant systems
Most IT infrastructure include redundancy from firewalls and circuits to servers
and switches, in order to meet specific availability (uptime) performance levels.
However, designing a system to meet specific availability performance is very
different to measuring and reporting achieved availability levels. Consider this
typical situation:
System availability
: 99.90% per month
Required target overall system availability : 99.99% per month
In order to meet this target the service provider is likely to install a secondary
redundant system so if one system fails the secondary system takes over.
Target system availability
99.99%
Primary
(99.90% availability)
Secondary
(99.90% availability)
Where
Consider a situation when the primary system fails for 10 hours (98.61% availability)
however, the secondary takes over and service is not disrupted.
Primary
Secondary
Availability
98.61% 100%
The calculated overall system availability is 100% which, as the service was not
disrupted, appears to be perfectly reasonable. Consider a situation where both
systems had an outage during the month but on different days so again service was
unaffected.
Primary
Secondary
Availability
98.61% 95.00%
The calculated system availability is now 99.30% however, clearly, as the overall
service was not affected this should in reality be 100%.
Page 6 of 8
info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100
The reason for this discrepancy is that the formula for calculating the theoretical
availability of a redundant system, as shown above, is not applicable when
reporting actual performance achieved. Reporting the correct achieved service
performance requires further checks to identify any overlap in the downtimes of
each system. In maths jargon, we need to calculate the intersection of the downtime
of all the redundant sub-systems to work out the actual availability.
Availability
Where
If there is no overlap in downtime, even if one system was down for the entire period
then the overall achieved system availability would remain 100% which correctly
reflects the actual service performance.
Measurement period
One classic area for inadvertent or sometimes deliberate SLA manipulation is
extending the metric measurement period. Of the three SLA availability targets below:
1.
2.
3.
It is clear that the third target which extends the measurement period over the year
is a much higher target for the service provider to achieve. However, consider the
targets below:
1.
2.
3.
The third target again appears to be a toughest to achieve, however, upon close
inspection the opposite is actually true.
Availability
Downtime over
Downtime over
a day (minutes) a month (minutes)
99.95%
Downtime over
a year (minutes)
Over a day this allows 0.72 minutes of downtime, however, over a year this allows over
262 minutes of downtime. In reality, if a system had an outage, which hopefully would
be a rare occurrence, the provider could and should easily resolve the issue within the
hour. If this failure occurred a few times a year the service provider would still easily
meet the performance target if it was measured over a whole year. Under a shorter
measurement period, such as a day or a month, such an outage would more accurately
be identified as an SLA performance failure and would likely incur penalties.
One point to note is that if the metric relates to downtime (e.g. outage) then the longer
the measurement period, the higher the service performance and if the metric relates
to uptime then the shorter the measurement period the higher the service performance.
Page 7 of 8
info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100
Conclusion
These days SLA service performance metrics used in IT outsourcing contracts are
usually well designed, well defined and generally fit for purpose. Typically, the only
areas that are changed or negotiated are specific targets and penalties, the metrics
themselves remain unchanged.
Unfortunately while the performance metrics are satisfactory and truly reflect business
priorities, one area that is overlooked is how each service performance metric
is measured, calculated and reported. An incorrect measurement or calculation
method can potentially make the reported service performance for a SLA metric
unfit for purpose.
Things to consider when evaluating SLA metrics and setting targets:
1.
2.
Ensure that averages of averages are not used in any metric calculation.
Weighted averages are a good alternative.
3.
4.
For redundant systems, understand the design criteria used and ensure that the
measurement, calculation and reporting method is clearly documented.
5.
Ensure that the measurement period for each SLA metric truly reflects service
requirements. In general, the longer the measurement period the easier the
target is to achieve.
Page 8 of 8
info@xceedgroup.com www.xceedgroup.com
Xceed Consultancy Services Ltd 1 Alie Street, London E1 8DE, Reg in England 04965100