Professional Documents
Culture Documents
Problem Managent
Problem Managent
■ Badly designed or operated back-end fulfilment organization is able to reduce the number and
processes that are incapable of dealing with the impact of incidents over time. In this respect,
volume or nature of the requests being made problem management has a strong interface with
■ Inadequate monitoring capabilities so that knowledge management, and tools such as the
accurate metrics cannot be gathered. KEDB will be used for both.
Although incident and problem management are
4.4 PROBLEM MANAGEMENT separate processes, they are closely related and will
typically use the same tools, and may use similar
Problem management is the process responsible for categorization, impact and priority coding systems.
managing the lifecycle of all problems. ITIL defines This will ensure effective communication when
a ‘problem’ as the underlying cause of one or more dealing with related incidents and problems.
incidents.
The problem management process has both
4.4.1 Purpose and objectives reactive and proactive aspects:
■ Reactive problem management is concerned
4.4.1.1 Purpose with solving problems in response to one or
The purpose of problem management is to manage more incidents.
the lifecycle of all problems from first identification ■ Proactive problem management is concerned
through further investigation, documentation with identifying and solving problems and
and eventual removal. Problem management known errors before further incidents related
seeks to minimize the adverse impact of incidents to them can occur again.
and problems on the business that are caused ■ While reactive problem management activities
by underlying errors within the IT Infrastructure, are performed in reaction to specific incident
and to proactively prevent recurrence of incidents situations, proactive problem management
related to these errors. In order to achieve this, activities take place as ongoing activities
problem management seeks to get to the root targeted to improve the overall availability and
cause of incidents, document and communicate end user satisfaction with IT services. Examples
known errors and initiate actions to improve or of proactive problem management activities
correct the situation. might include conducting periodic scheduled
reviews of incident records to find patterns
4.4.1.2 Objectives and trends in reported symptoms that may
The objectives of the problem management process indicate the presence of underlying errors in
are to: the infrastructure.
■ Prevent problems and resulting incidents from ■ Conducting major incident reviews where
happening review of ‘How can we prevent the recurrence?’
■ Eliminate recurring incidents can provide identification of an underlying
cause or error.
■ Minimize the impact of incidents that cannot
be prevented. ■ Conducting periodic scheduled reviews of
operational logs and maintenance records
identifying patterns and trends of activities that
4.4.2 Scope
may indicate an underlying problem might exist.
Problem management includes the activities
■ Conducting periodic scheduled reviews of event
required to diagnose the root cause of incidents
logs targeting patterns and trends of warning
and to determine the resolution to those problems.
and exception events that may indicate the
It is also responsible for ensuring that the
presence of an underlying problem.
resolution is implemented through the appropriate
■ Conducting brainstorming sessions to identify
control procedures, especially change management
trends that could indicate the existence of
and release and deployment management.
underlying problems.
Problem management will also maintain ■ Using check sheets to proactively collect data
information about problems and the appropriate on service or operational quality issues that may
workarounds and resolutions, so that the help to detect underlying problems.
98 | Service operation processes
Reactive and proactive problem management that technology capabilities may need to be
activities are generally conducted within the in place to track problems separately from
scope of service operation. A close relationship incidents.
exists between proactive problem management ■ All problems should be stored and managed
activities and CSI lifecycle activities that directly in a single management system. This provides
support identifying and implementing service a definitive recognized source for problem
improvements. Proactive problem management information and supports easier access for
supports those activities through trending analysis reporting and investigation efforts. It implies
and the targeting of preventive action. Identified that problem management records are kept.
problems from these activities will become input Supporting technologies should be well
to the CSI register used to record and manage integrated throughout the business and
improvement opportunities. interface easily to other service management
Further information on CSI activities can be found technologies that use or provide problem-
in ITIL Continual Service Improvement. related information.
■ All problems should subscribe to a standard
4.4.3 Value to business classification schema that is consistent across
the business enterprise. This provides for
The value of problem management includes:
faster access to problem and investigative
■ Higher availability of IT services by reducing information. It provides better support for
the number and duration of incidents that problem management diagnostic and proactive
those services may incur. Problem management trending activities. It implies that a well defined
works together with incident management and communicated set of problem classification
and change management to ensure that IT categories is in place.
service availability and quality are increased.
When incidents are resolved, information about 4.4.4.2 Principles and basic concepts
the resolution is recorded. Over time, this
There are some important concepts of problem
information is used to speed up the resolution
management that must be taken into account from
time and identify permanent solutions,
the outset. These include:
reducing the number and resolution time of
incidents. Reactive and proactive problem management
■ Higher productivity of IT staff by reducing activities
unplanned labour caused by incidents and Both reactive and proactive problem management
creating the ability to resolve incidents more activities seek to raise problems, manage them
quickly through recorded known errors and through the problem management process, find
workarounds. the underlying causes of the incidents they are
■ Reduced expenditure on workarounds or fixes associated with and prevent future recurrences of
that do not work. those incidents. The difference between reactive
■ Reduction in cost of effort in fire-fighting or and proactive problem management lies in how
resolving repeat incidents. the problem management process is triggered:
■ With reactive problem management, process
4.4.4 Policies, principles and basic activities will typically be triggered in reaction
concepts to an incident that has taken place. Reactive
problem management complements incident
4.4.4.1 Policies
management activities by focusing on the
Examples of problem management policies might underlying cause of an incident to prevent its
include: recurrence and identifying workarounds when
■ Problems should be tracked separately from necessary.
incidents. This will provide clear separation ■ With proactive problem management, process
between many problem management activities activities are triggered by activities seeking to
that are proactive and incident management improve services. One example might be trend
activities that are mostly reactive. This implies analysis activities to find common underlying
Service operation processes | 99
causes of historical incidents that took place ■ Trend analysis of logged incidents reveals an
to prevent their recurrence. Proactive problem underlying problem might exist
management complements CSI activities ■ A major incident has occurred where problem
by helping to identify workarounds and management activities need to be undertaken
improvement actions that can improve the to identify the root cause
quality of a service. ■ Other IT functions identify that a problem
By redirecting the efforts of an organization condition exists
from reacting to large numbers of incidents to ■ The service desk may have resolved an incident
preventing incidents, an organization provides a but has not determined a definitive cause and
better service to its customers and makes more suspects that it is likely to recur
effective use of the available resources within the ■ Analysis of an incident by a support group
IT support organization. which reveals that an underlying problem
exists, or is likely to exist
Problem models ■ A notification from a supplier that a problem
Many problems will be unique and will require exists that has to be resolved.
handling in an individual way – but it is conceivable
that some incidents may recur because of dormant 4.4.4.3 Problem analysis techniques
or underlying problems (for example, where the
There are many problem analysis, diagnosis and
cost of a permanent resolution will be high and a
solving techniques available and much research
decision has been taken not to go ahead with an
has been done in this area. Examples of frequently
expensive solution but to ‘live with’ the problem).
used techniques are given below.
As well as creating a known error record in the
KEDB (see section 4.4.5.7) to ensure quicker Chronological analysis
diagnosis, the creation of a problem model for When dealing with a difficult problem, there may
handling such problems in the future may be be conflicting reports about exactly what has
helpful. This is very similar in concept to the idea happened and when. It is therefore very helpful
of incident or request models described in previous briefly to document all events in chronological
chapters, but applied to problems. order, to provide a timeline of events. This often
makes it possible to see which events may have
Incidents versus problems been triggered by others – or to discount any
An incident is an unplanned interruption to an claims that are not supported by the sequence of
IT service or reduction in the quality of an IT events.
service. A problem presents a different view of an
incident by understanding its underlying cause, Pain value analysis
which may also be the cause of other incidents. This is where a broader view is taken of the
Incidents do not ‘become’ problems. While incident impact of an incident or problem, or incident/
management activities are focused on restoring problem type. Instead of just analysing the number
services to normal state operations, problem of incidents/problems of a particular type in a
management activities are focused on finding ways particular period, a more in-depth analysis is done
to prevent incidents from happening in the first to determine exactly what level of pain has been
place. It is quite common to have incidents that are caused to the organization/business by these
also problems. incidents/problems. A formula can be devised to
calculate this pain level. Typically this might include
The rules for invoking problem management
taking into account:
during an incident can vary and are at the
discretion of individual organizations. Some ■ The number of people affected
general situations where it may be desired to ■ The duration of the downtime caused
invoke problem management during an incident ■ The cost to the business (if this can be readily
might include situations where: calculated or estimated).
■ Incident management cannot match an incident By taking all of these factors into account, a much
to existing problems and known errors more detailed picture of those incidents/problems
100 | Service operation processes
or incident/problem types that are causing most the next CI in the chain of events, which in turn
pain can be determined, to allow a better focus is checked, the next CI and then the next until
on those things that really matter and deserve a fault is encountered. If the fault cannot be
the highest priority when determining resolution recreated, a variation of this technique can be
actions. tried that involves interrogating the healthy state
of the CIs involved with the transaction or event.
Kepner and Tregoe For example, if one CI is deemed to be at fault,
Charles Kepner and Benjamin Tregoe developed all other CIs in the transaction or event path from
a useful way of problem analysis that can be used source to destination are probed for health.
formally to investigate deeper-rooted problems.
They defined the following stages: Affinity mapping
This technique can be used to organize large
■ Defining the problem
amounts of data (ideas, opinions, issues) into
■ Describing the problem in terms of identity,
groupings based on common characteristics. It is
location, time and size
typically performed in a brainstorming session with
■ Establishing possible causes
key support staff. Key concepts, such as potential
■ Testing the most probable cause solutions, are written on individual cards and stuck
■ Verifying the true cause. to a wall or whiteboard. Participants and/or the
facilitator should then move the cards so that they
The method is described in more detail in
are grouped by similar traits. A ‘header’ should
Appendix C.
then be developed for each group for future
Brainstorming identification. Each of the cards under the ‘header’
It can often be valuable to gather together the should be examined for potential of a root cause
relevant people, either physically or by electronic that may underlie all of them.
means, and to ‘brainstorm’ the problem, with Hypothesis testing
people throwing in ideas on what the potential
This method can be used to generate a list of
cause may be and potential actions to resolve
possible root causes based on educated guessing
the problem. Brainstorming sessions can be very
and then determining whether each hypothesis
constructive and innovative but it is equally
is true or false. Educated guesses may relate to
important that someone, perhaps the problem
relationships between variables or potential
manager, documents the outcome and any agreed
root causes of a problem. Using information
actions and keeps a degree of control in the
gathered from incidents and other operational
session(s).
information, a team is assembled to brainstorm
5-Whys a list of potential causes that may be underlying
This simple yet highly effective approach is helpful the incidents being studied. Each cause is then
as a way to get to the underlying root cause of a converted into testable statements or hypotheses
problem. It works by starting out with a description and assigned to one or more support staff. Further
of what event took place and then asking ‘why this data should then be gathered as needed for each
occurred’. The resulting answer is given, followed assigned statement and an appropriate analysis
by another round of ‘why this occurred’. Usually by performed to accept or reject each hypothesis.
the fifth iteration, a true root cause will have been Technical observation post
found.
In some cases problems may be linked to incidents
Fault isolation that occur intermittently for unknown reasons or
This approach involves re-executing the causes. This approach consists of a prearranged
transactions or events that led to a problem in a gathering of specialist technical support staff
careful stepwise fashion, one CI at a time, until from within the IT support organization brought
the CI at fault is identified. The re-execution effort together to focus on a specific problem. Its purpose
moves to the first CI encountered at the start of is to monitor events, real-time as they occur,
the transaction or event, which is then checked with the specific aim of catching and identifying
for correct operation. The effort then moves to the specific situation and possible causes for the
problem.
Service operation processes | 101
Table 4.2 Problem situations and the most useful techniques for identifying root causes
Problem situation Suggested analysis techniques
Complex problems where a sequence of events needs to be assembled to Chronological analysis
determine exactly what happened Technical observation post
Uncertainty over which problems should be addressed first Pain value analysis
Brainstorming
Uncertain whether a presented root cause is truly the root cause 5-Whys
Hypothesis testing
Intermittent problems that appear to come and go and cannot be recreated or Technical observation post
repeated in a test environment Kepner–Tregoe
Hypothesis testing
Brainstorming
Uncertainty over where to start for problems that appear to have multiple causes Pareto analysis
Kepner–Tregoe
Ishikawa diagrams
Brainstorming
Struggling to identify the exact point of failure for a problem Fault isolation
Ishikawa diagrams
Kepner–Tregoe
Affinity mapping
Brainstorming
Uncertain where to start when trying to find root cause 5-Whys
Kepner–Tregoe
Brainstorming
Affinity mapping
102 | Service operation processes
Proactive
Service Event Incident problem Supplier or
desk management management management contractor
Problem
detection
Problem
logging
Problem
categorization
Problem
prioritization
Problem
CMS investigation
and diagnosis
No
No
Problem
resolution
No Resolved?
Yes
)"""!
Problem )%&!"$#s
closure Service
knowledge
management
system
No Continual
service
improvement
End
)!%##"
)$#s
raise incidents that have to be re-diagnosed and this case, a problem record is raised once the
resolved all over again! underlying trend or cause is discovered.
■ Activities taken to improve the quality of
4.4.5 Process activities, methods and a service that result in the need to raise a
techniques problem record to identify further improvement
The problem management process flow for actions that should be taken.
handling a recognized problem is shown in Figure Frequent and regular analysis of incident and
4.7. This is a simplified chart to show the normal problem data must be performed to identify
process flow, but in reality some of the states may any trends as they become discernible. This will
be iterative or variations may have to be made in require meaningful and detailed categorization
order to handle particular situations. For example, of incidents/problems and regular reporting of
proactive problem management activities may raise patterns and areas of high occurrence. ‘Top ten’
new problem records which in turn can become reporting, with drill-down capabilities to lower
input to this process flow. levels, is useful in identifying trends.
4.4.5.4 Problem prioritization try various ways of finding the most appropriate
Problems should be prioritized the same way using and cost-effective resolution to the problem. It
the same reasons as incidents. The frequency and may be possible to recreate the problem in a test
impact of related incidents must also be taken environment that mirrors the live environment.
into account. The coding system described earlier This allows for investigation and diagnosis activities
in Table 4.1 (which combines incident impact with to proceed effectively without causing further
urgency to give an overall priority level) can also disruption to users.
be used to prioritize problems. Definition and
guidance should be provided to support staff on 4.4.5.6 Workarounds
what constitutes a problem, and the related service In some cases it may be possible to find a
targets for each priority code level in the table. workaround to the incidents caused by the
problem – a temporary way of overcoming the
Problem prioritization should also take into
difficulties. For example, a manual amendment
account the severity of the problems. Severity in
may be made to an input file to allow a program
this context refers to how serious the problem is
to complete its run successfully and allow a
from a service or customer perspective as well as an
billing process to complete satisfactorily, but it is
infrastructure perspective, for example:
important that work on a permanent resolution
■ Can the system be recovered, or does it need to continues where this is justified – in this example
be replaced? the reason for the file becoming corrupted in the
■ How much will it cost? first place must be found and corrected to prevent
■ How many people, with what skills, will be this happening again.
needed to fix the problem? When a workaround is found, it is therefore
■ How long will it take to fix the problem? important that the problem record remains open
■ How extensive is the problem (e.g. how many and details of the workaround are documented
CIs are affected)? within the problem record.
In some cases there may be multiple workarounds
4.4.5.5 Problem investigation and diagnosis associated with a problem. As problem
At this stage, an investigation should be conducted investigation and diagnosis activities carry on,
to try to diagnose the root cause of the problem there may be a series of improvements that do
– the speed and nature of this investigation will not resolve the problem, but lead to a progressive
vary depending upon the impact, severity and improvement in the quality of the workarounds
urgency of the problem – but the appropriate available. These may impact on the prioritization
level of resources and expertise should be applied of the problem as successive workaround solutions
to finding a resolution commensurate with the may reduce the impact of future related incidents,
priority code allocated and the service target in either by reducing their likelihood or improving
place for that priority level. the speed of their resolution.
There are a number of useful problem-solving
techniques that can be used to help diagnose and 4.4.5.7 Raising a known error record
resolve problems, and these should be used as A known error is defined as a problem with a
appropriate. Such techniques have been described documented root cause and workaround. The
in more detail in section 4.4.4. known error record should identify the problem
record it relates to and document the status of
The CMS must be used to help determine the
actions being taken to resolve the problem, its root
level of impact and pinpoint and diagnose the
cause and workaround. All known error records
exact point of failure. The KEDB should also be
should be stored in the KEDB. The KEDB and the
accessed and problem-matching techniques (such
way it should be used are described in more detail
as keyword searches) should be used to see if the
in section 4.4.7.2.
problem has occurred before and, if so, to find the
resolution. As soon as the diagnosis is complete, and
particularly where a workaround has been found
It is often valuable to try to recreate the failure
(even though it may not yet be a permanent
to understand what has gone wrong, and then
Service operation processes | 105
4.4.6 Triggers, inputs, outputs and ■ Agreed criteria for prioritizing and escalating
interfaces problems
■ Output from risk management and risk
4.4.6.1 Triggers assessment activities.
With reactive problem management, the vast
majority of problem records will be triggered in 4.4.6.3 Outputs
reaction to one or more incidents, and many will Examples of outputs from the problem
be raised or initiated via service desk staff. Other management process may include:
problem records, and corresponding known error
■ Resolved problems and actions taken to achieve
records, may be triggered in testing, particularly
their resolution
the latter stages of testing such as user acceptance
testing/trials (UAT), if a decision is made to go ■ Updated problem management records with
ahead with a release even though some faults accurate problem detail and history
are known. Suppliers may trigger the need for ■ RFCs to remove infrastructure errors
some problem records through the notification ■ Workarounds for incidents
of potential faults or known deficiencies in their ■ Known error records
products or services (e.g. a warning may be given ■ Problem management reports
regarding the use of a particular CI and a problem ■ Output and improvement recommendations
record may be raised to facilitate investigation by from major problem review activity.
technical staff of the condition of such CIs within
the organization’s IT infrastructure).
4.4.6.4 Interfaces
With proactive problem management, problem The primary relationship between incident and
records may be triggered by identification of problem management has been discussed in detail
patterns and trends in incidents when reviewing in section 4.4.4. Examples of other key interfaces
historical incident records. A review of other are listed below for each service lifecycle stage.
sources such as operation logs, operation
communications or event logs may also proactively Service strategy
trigger problem records when the appearance of ■ Financial management for IT services Assists
an underlying issue becomes apparent. in assessing the impact of proposed resolutions
or workarounds, as well as pain value analysis.
4.4.6.2 Inputs Problem management provides management
Examples of inputs to the problem management information about the cost of resolving and
process may include: preventing problems, which is used as input
into the budgeting and accounting systems and
■ Incident records for incidents that have total cost of ownership calculations.
triggered problem management activities
■ Incident reports and histories that will be used Service design
to support proactive problem trending ■ Availability management Is involved with
■ Information about CIs and their status determining how to reduce downtime and
■ Communication and feedback about incidents increase uptime. As such, it has a close
and their symptoms relationship with problem management,
■ Communication and feedback about RFCs especially the proactive areas. Much of the
and releases that have been implemented or management information available in problem
planned for implementation management will be communicated to
■ Communication of events that were triggered availability management.
from event management ■ Capacity management Some problems will
■ Operational and service level objectives require investigation by capacity management
teams and techniques, e.g. performance issues.
■ Customer feedback on success of problem
Capacity management will also help in assessing
resolution activities and overall quality of
proactive measures. Problem management
problem management activities
provides management information relative
Service operation processes | 107
to the quality of decisions made during the CSI register. Proactive problem management
capacity planning process. activities may also identify underlying problems
and service issues that if addressed, can
■ IT service continuity management Problem
contribute to increases in service quality and
management acts as an entry point into
end user/customer satisfaction.
IT service continuity management where a
significant problem is not resolved before it
starts to have a major impact on the business.
4.4.7 Information management
■ Service level management The occurrence Most information used in problem management
of incidents and problems affects the level of comes from the following sources:
service delivery measured by SLM. Problem
management contributes to improvements in 4.4.7.1 Configuration management system
service levels, and its management information The CMS will hold details of all of the components
is used as the basis of some of the SLA review of the IT infrastructure, as well as the relationships
components. SLM also provides parameters between these components. It will act as a valuable
within which problem management works, source for problem diagnosis and for evaluating
such as impact information and the effect on the impact of problems (e.g. if this disk is down,
services of proposed resolutions and proactive what data is on that disk; which services use that
measures. data; which users use those services?). As it will also
hold details of previous activities, it can also be
Service transition used as a valuable source of historical data to help
■ Change management Problem management identify trends or potential weaknesses – a key part
ensures that all resolutions or workarounds that of proactive problem management.
require a change to a CI are submitted through
change management through an RFC. Change 4.4.7.2 Known error database
management will monitor the progress of The purpose of a KEDB is to allow storage of
these changes and keep problem management previous knowledge of incidents and problems –
advised. Problem management is also involved and how they were overcome – to allow quicker
in rectifying the situation caused by failed diagnosis and resolution if they recur.
changes.
■ Service asset and configuration management
The known error record should hold exact details
of the fault and the symptoms that occurred,
Problem management uses the CMS to identify
together with precise details of any workaround
faulty CIs and also to determine the impact of
or resolution action that can be taken to restore
problems and resolutions.
the service and/or resolve the problem. An
■ Release and deployment management This
incident count will also be useful to determine the
process is responsible for deploying problem
frequency with which incidents are likely to recur
fixes out into the live environment. It also
and influence priorities etc.
assists in ensuring that the associated known
errors are transferred from the development It should be noted that a business case for a
KEDB into the live known error database. permanent resolution for some problems may not
Problem management will help resolve exist. For example, if a problem does not cause
problems caused by faults during the release serious disruption and a workaround exists and/or
process. the cost of resolving the problem far outweighs the
■ Knowledge management The SKMS can be benefits of a permanent resolution, then a decision
used to form the basis for the KEDB and hold or may be taken to tolerate the problem. However, it
integrate with the problem records. will still be desirable to diagnose and implement a
workaround as quickly as possible, which is where
Continual service improvement the KEDB can help.
■ The seven-step improvement process The It is essential that any data put into the database
occurrence of incidents and problems provides can be quickly and accurately retrieved. The
a basis for identifying opportunities for problem manager should be fully trained and
service improvement and adding them to the familiar with the search methods/algorithms used
108 | Service operation processes
by the selected database and should carefully and the way it should be used. They should be able
ensure that when new records are added, the readily to retrieve and use data.
relevant search key criteria are correctly included.
The KEDB is part of the CMS and may be part
Care should be taken to avoid duplication of of a larger SKMS illustrated in Figure 4.8. Note
records (i.e. the same problem described in two that SCMIS stands for supplier and contract
or more ways as separate records). To avoid this, management information system. More
the problem manager should be the only person information on the SKMS can be found in ITIL
able to enter a new record. Other support groups Service Transition.
should be encouraged to propose new records, but
these should be vetted by the problem manager 4.4.8 Critical success factors and key
before entry to the KEDB. In large organizations performance indicators
where a single KEDB is used (recommended) with
The following list includes some sample CSFs for
problem management staff in multiple locations,
problem management. Each organization should
a procedure must be agreed to ensure that
identify appropriate CSFs based on its objectives
duplication of KEDB records cannot occur. This may
for the process. Each sample CSF is followed by
involve designating just one staff member as the
a small number of typical KPIs that support the
central KEDB manager.
CSF. These KPIs should not be adopted without
The KEDB should be used during the incident and careful consideration. Each organization should
problem diagnosis phases to try to speed up the develop KPIs that are appropriate for its level of
resolution process – and new records should be maturity, its CSFs and its particular circumstances.
added as quickly as possible when a new problem Achievement against KPIs should be monitored and
has been identified and diagnosed. used to identify opportunities for improvement,
which should be logged in the CSI register for
All support staff should be fully trained and
evaluation and possible implementation.
conversant with the value that the KEDB can offer
SKMS
Service portfolio
Figure 4.8 Examples of data and information in the service knowledge management system
Service operation processes | 109
■ CSF Minimize the impact to the business of 4.4.9 Challenges and risks
incidents that cannot be prevented
● KPI The number of known errors added to 4.4.9.1 Challenges
the KEDB The following challenges will exist for successful
● KPI The percentage accuracy of the KEDB problem management:
(from audits of the database) ■ A major dependency for problem management
● KPI Percentage of incidents closed by the is the establishment of an effective incident
service desk without reference to other management process and tools. This will
levels of support (often referred to as ‘first ensure that problems are identified as soon as
point of contact’) possible and that as much work is done on pre-
● KPI Average incident resolution time for qualification as possible. A critical challenge
those incidents linked to problem records exists in making sure that the two processes
■ CSF Maintain quality of IT services through have formal interfaces and common working
elimination of recurring incidents practices.
● KPI Total numbers of problems (as a control ■ The skills and capabilities for problem
measure) resolution staff to identify the true root cause
● KPI Size of current problem backlog for of incidents is sometimes a challenge. Many
each IT service times, support staff will describe the root cause
● KPI Number of repeat incidents for each IT based on symptoms or resolution actions taken.
service The techniques described in section 4.4.4 can
■ CSF Provide overall quality and be used to help determine the true underlying
professionalism of problem handling activities cause of an incident. Creating a focus around
to maintain business confidence in IT ‘why did this happen?’ or ‘what can be done to
capabilities prevent the incident from happening again?’
can also be helpful.
● KPI The number of major problems
(opened and closed and backlog) ■ The ability to relate incidents to problems
can be a challenge if the tools used to record
● KPI The percentage of major problem
incidents are different from those of problems.
reviews successfully performed
In some cases, incident tools might exist with no
● KPI The percentage of major problem
capabilities to track problems separately.
reviews completed successfully and on time
■ The ability to integrate problem management
● KPI Number and percentage of problems
activities with the CMS to determine
incorrectly assigned
relationships between CIs and to refer to
● KPI Number and percentage of problems
the history of CIs when performing problem
incorrectly categorized support activities.
● KPI The backlog of outstanding problems
■ Ensuring that problem management is able
and the trend (static, reducing or to use all knowledge and service asset and
increasing?) configuration management resources available
● KPI Number and percentage of problems to investigate and resolve problems.
that exceeded their target resolution times ■ Ensuring that ongoing training of technical
● KPI Percentage of problems resolved within staff in both technical aspects of their job as
SLA targets (and the percentage that are well as the business implications of the services
not!) they support and the processes they use is in
● KPI Average cost per problem. place.
It is also helpful to break down and categorize ■ The ability to have a good working relationship
problem metrics by category, time frame, impact, between the second- and third-line staff
urgency, service impacted, location and priority working on problem support activities and first-
and compare these with previous periods. This can line staff.
provide input to CSI and other processes seeking to ■ Making sure that business impact is well
identify issues, problem trends or other situations. understood by all staff working on problem
resolution.
110 | Service operation processes