Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Service operation processes | 97

■ Badly designed or operated back-end fulfilment organization is able to reduce the number and
processes that are incapable of dealing with the impact of incidents over time. In this respect,
volume or nature of the requests being made problem management has a strong interface with
■ Inadequate monitoring capabilities so that knowledge management, and tools such as the
accurate metrics cannot be gathered. KEDB will be used for both.
Although incident and problem management are
4.4 PROBLEM MANAGEMENT separate processes, they are closely related and will
typically use the same tools, and may use similar
Problem management is the process responsible for categorization, impact and priority coding systems.
managing the lifecycle of all problems. ITIL defines This will ensure effective communication when
a ‘problem’ as the underlying cause of one or more dealing with related incidents and problems.
incidents.
The problem management process has both
4.4.1 Purpose and objectives reactive and proactive aspects:
■ Reactive problem management is concerned
4.4.1.1 Purpose with solving problems in response to one or
The purpose of problem management is to manage more incidents.
the lifecycle of all problems from first identification ■ Proactive problem management is concerned
through further investigation, documentation with identifying and solving problems and
and eventual removal. Problem management known errors before further incidents related
seeks to minimize the adverse impact of incidents to them can occur again.
and problems on the business that are caused ■ While reactive problem management activities
by underlying errors within the IT Infrastructure, are performed in reaction to specific incident
and to proactively prevent recurrence of incidents situations, proactive problem management
related to these errors. In order to achieve this, activities take place as ongoing activities
problem management seeks to get to the root targeted to improve the overall availability and
cause of incidents, document and communicate end user satisfaction with IT services. Examples
known errors and initiate actions to improve or of proactive problem management activities
correct the situation. might include conducting periodic scheduled
reviews of incident records to find patterns
4.4.1.2 Objectives and trends in reported symptoms that may
The objectives of the problem management process indicate the presence of underlying errors in
are to: the infrastructure.
■ Prevent problems and resulting incidents from ■ Conducting major incident reviews where
happening review of ‘How can we prevent the recurrence?’
■ Eliminate recurring incidents can provide identification of an underlying
cause or error.
■ Minimize the impact of incidents that cannot
be prevented. ■ Conducting periodic scheduled reviews of
operational logs and maintenance records
identifying patterns and trends of activities that
4.4.2 Scope
may indicate an underlying problem might exist.
Problem management includes the activities
■ Conducting periodic scheduled reviews of event
required to diagnose the root cause of incidents
logs targeting patterns and trends of warning
and to determine the resolution to those problems.
and exception events that may indicate the
It is also responsible for ensuring that the
presence of an underlying problem.
resolution is implemented through the appropriate
■ Conducting brainstorming sessions to identify
control procedures, especially change management
trends that could indicate the existence of
and release and deployment management.
underlying problems.
Problem management will also maintain ■ Using check sheets to proactively collect data
information about problems and the appropriate on service or operational quality issues that may
workarounds and resolutions, so that the help to detect underlying problems.
98 | Service operation processes

Reactive and proactive problem management that technology capabilities may need to be
activities are generally conducted within the in place to track problems separately from
scope of service operation. A close relationship incidents.
exists between proactive problem management ■ All problems should be stored and managed
activities and CSI lifecycle activities that directly in a single management system. This provides
support identifying and implementing service a definitive recognized source for problem
improvements. Proactive problem management information and supports easier access for
supports those activities through trending analysis reporting and investigation efforts. It implies
and the targeting of preventive action. Identified that problem management records are kept.
problems from these activities will become input Supporting technologies should be well
to the CSI register used to record and manage integrated throughout the business and
improvement opportunities. interface easily to other service management
Further information on CSI activities can be found technologies that use or provide problem-
in ITIL Continual Service Improvement. related information.
■ All problems should subscribe to a standard
4.4.3 Value to business classification schema that is consistent across
the business enterprise. This provides for
The value of problem management includes:
faster access to problem and investigative
■ Higher availability of IT services by reducing information. It provides better support for
the number and duration of incidents that problem management diagnostic and proactive
those services may incur. Problem management trending activities. It implies that a well defined
works together with incident management and communicated set of problem classification
and change management to ensure that IT categories is in place.
service availability and quality are increased.
When incidents are resolved, information about 4.4.4.2 Principles and basic concepts
the resolution is recorded. Over time, this
There are some important concepts of problem
information is used to speed up the resolution
management that must be taken into account from
time and identify permanent solutions,
the outset. These include:
reducing the number and resolution time of
incidents. Reactive and proactive problem management
■ Higher productivity of IT staff by reducing activities
unplanned labour caused by incidents and Both reactive and proactive problem management
creating the ability to resolve incidents more activities seek to raise problems, manage them
quickly through recorded known errors and through the problem management process, find
workarounds. the underlying causes of the incidents they are
■ Reduced expenditure on workarounds or fixes associated with and prevent future recurrences of
that do not work. those incidents. The difference between reactive
■ Reduction in cost of effort in fire-fighting or and proactive problem management lies in how
resolving repeat incidents. the problem management process is triggered:
■ With reactive problem management, process
4.4.4 Policies, principles and basic activities will typically be triggered in reaction
concepts to an incident that has taken place. Reactive
problem management complements incident
4.4.4.1 Policies
management activities by focusing on the
Examples of problem management policies might underlying cause of an incident to prevent its
include: recurrence and identifying workarounds when
■ Problems should be tracked separately from necessary.
incidents. This will provide clear separation ■ With proactive problem management, process
between many problem management activities activities are triggered by activities seeking to
that are proactive and incident management improve services. One example might be trend
activities that are mostly reactive. This implies analysis activities to find common underlying
Service operation processes | 99

causes of historical incidents that took place ■ Trend analysis of logged incidents reveals an
to prevent their recurrence. Proactive problem underlying problem might exist
management complements CSI activities ■ A major incident has occurred where problem
by helping to identify workarounds and management activities need to be undertaken
improvement actions that can improve the to identify the root cause
quality of a service. ■ Other IT functions identify that a problem
By redirecting the efforts of an organization condition exists
from reacting to large numbers of incidents to ■ The service desk may have resolved an incident
preventing incidents, an organization provides a but has not determined a definitive cause and
better service to its customers and makes more suspects that it is likely to recur
effective use of the available resources within the ■ Analysis of an incident by a support group
IT support organization. which reveals that an underlying problem
exists, or is likely to exist
Problem models ■ A notification from a supplier that a problem
Many problems will be unique and will require exists that has to be resolved.
handling in an individual way – but it is conceivable
that some incidents may recur because of dormant 4.4.4.3 Problem analysis techniques
or underlying problems (for example, where the
There are many problem analysis, diagnosis and
cost of a permanent resolution will be high and a
solving techniques available and much research
decision has been taken not to go ahead with an
has been done in this area. Examples of frequently
expensive solution but to ‘live with’ the problem).
used techniques are given below.
As well as creating a known error record in the
KEDB (see section 4.4.5.7) to ensure quicker Chronological analysis
diagnosis, the creation of a problem model for When dealing with a difficult problem, there may
handling such problems in the future may be be conflicting reports about exactly what has
helpful. This is very similar in concept to the idea happened and when. It is therefore very helpful
of incident or request models described in previous briefly to document all events in chronological
chapters, but applied to problems. order, to provide a timeline of events. This often
makes it possible to see which events may have
Incidents versus problems been triggered by others – or to discount any
An incident is an unplanned interruption to an claims that are not supported by the sequence of
IT service or reduction in the quality of an IT events.
service. A problem presents a different view of an
incident by understanding its underlying cause, Pain value analysis
which may also be the cause of other incidents. This is where a broader view is taken of the
Incidents do not ‘become’ problems. While incident impact of an incident or problem, or incident/
management activities are focused on restoring problem type. Instead of just analysing the number
services to normal state operations, problem of incidents/problems of a particular type in a
management activities are focused on finding ways particular period, a more in-depth analysis is done
to prevent incidents from happening in the first to determine exactly what level of pain has been
place. It is quite common to have incidents that are caused to the organization/business by these
also problems. incidents/problems. A formula can be devised to
calculate this pain level. Typically this might include
The rules for invoking problem management
taking into account:
during an incident can vary and are at the
discretion of individual organizations. Some ■ The number of people affected
general situations where it may be desired to ■ The duration of the downtime caused
invoke problem management during an incident ■ The cost to the business (if this can be readily
might include situations where: calculated or estimated).
■ Incident management cannot match an incident By taking all of these factors into account, a much
to existing problems and known errors more detailed picture of those incidents/problems
100 | Service operation processes

or incident/problem types that are causing most the next CI in the chain of events, which in turn
pain can be determined, to allow a better focus is checked, the next CI and then the next until
on those things that really matter and deserve a fault is encountered. If the fault cannot be
the highest priority when determining resolution recreated, a variation of this technique can be
actions. tried that involves interrogating the healthy state
of the CIs involved with the transaction or event.
Kepner and Tregoe For example, if one CI is deemed to be at fault,
Charles Kepner and Benjamin Tregoe developed all other CIs in the transaction or event path from
a useful way of problem analysis that can be used source to destination are probed for health.
formally to investigate deeper-rooted problems.
They defined the following stages: Affinity mapping
This technique can be used to organize large
■ Defining the problem
amounts of data (ideas, opinions, issues) into
■ Describing the problem in terms of identity,
groupings based on common characteristics. It is
location, time and size
typically performed in a brainstorming session with
■ Establishing possible causes
key support staff. Key concepts, such as potential
■ Testing the most probable cause solutions, are written on individual cards and stuck
■ Verifying the true cause. to a wall or whiteboard. Participants and/or the
facilitator should then move the cards so that they
The method is described in more detail in
are grouped by similar traits. A ‘header’ should
Appendix C.
then be developed for each group for future
Brainstorming identification. Each of the cards under the ‘header’
It can often be valuable to gather together the should be examined for potential of a root cause
relevant people, either physically or by electronic that may underlie all of them.
means, and to ‘brainstorm’ the problem, with Hypothesis testing
people throwing in ideas on what the potential
This method can be used to generate a list of
cause may be and potential actions to resolve
possible root causes based on educated guessing
the problem. Brainstorming sessions can be very
and then determining whether each hypothesis
constructive and innovative but it is equally
is true or false. Educated guesses may relate to
important that someone, perhaps the problem
relationships between variables or potential
manager, documents the outcome and any agreed
root causes of a problem. Using information
actions and keeps a degree of control in the
gathered from incidents and other operational
session(s).
information, a team is assembled to brainstorm
5-Whys a list of potential causes that may be underlying
This simple yet highly effective approach is helpful the incidents being studied. Each cause is then
as a way to get to the underlying root cause of a converted into testable statements or hypotheses
problem. It works by starting out with a description and assigned to one or more support staff. Further
of what event took place and then asking ‘why this data should then be gathered as needed for each
occurred’. The resulting answer is given, followed assigned statement and an appropriate analysis
by another round of ‘why this occurred’. Usually by performed to accept or reject each hypothesis.
the fifth iteration, a true root cause will have been Technical observation post
found.
In some cases problems may be linked to incidents
Fault isolation that occur intermittently for unknown reasons or
This approach involves re-executing the causes. This approach consists of a prearranged
transactions or events that led to a problem in a gathering of specialist technical support staff
careful stepwise fashion, one CI at a time, until from within the IT support organization brought
the CI at fault is identified. The re-execution effort together to focus on a specific problem. Its purpose
moves to the first CI encountered at the start of is to monitor events, real-time as they occur,
the transaction or event, which is then checked with the specific aim of catching and identifying
for correct operation. The effort then moves to the specific situation and possible causes for the
problem.
Service operation processes | 101

Ishikawa diagrams 4.4.4.4 Errors detected in the development


Kaoru Ishikawa (1915–1989), a leader in environment
Japanese quality control, developed a method of It is rare for any new applications, systems or
documenting causes and effects that can be useful software releases to be completely error-free.
in helping identify where something may be going It is more likely that during testing of such new
wrong, or be improved. Such a diagram is typically applications, systems or releases a prioritization
the outcome of a brainstorming session where system will be used to eradicate the more serious
problem solvers can offer suggestions. The main faults, but it is possible that minor faults are
goal is represented by the trunk of the diagram, not rectified – often because of the balance
and primary factors are represented as branches. that has to be maintained between delivering
Secondary factors are then added as stems, and new functionality to the business as quickly as
so on. Creating the diagram stimulates discussion possible and ensuring totally fault-free code or
and often leads to increased understanding of a components.
complex problem. An example diagram is given in
Where a decision is made to release something
Appendix D.
into the live environment that includes known
Pareto analysis deficiencies, these should be logged as known
This is a technique for separating the most errors in the KEDB, together with details of
important potential causes of failures from more workarounds or resolution activities. There should
trivial issues. Details of this approach are given in be a formal step in the testing sign-off that ensures
Appendix H. this handover always takes place (see ITIL Service
Transition).
Table 4.2 may be helpful in identifying the kinds of
situations that each technique shown above might Experience has shown that if this does not
be used for. happen, it will lead to far higher support costs
when the users start to experience the faults and

Table 4.2 Problem situations and the most useful techniques for identifying root causes
Problem situation Suggested analysis techniques
Complex problems where a sequence of events needs to be assembled to Chronological analysis
determine exactly what happened Technical observation post
Uncertainty over which problems should be addressed first Pain value analysis
Brainstorming
Uncertain whether a presented root cause is truly the root cause 5-Whys
Hypothesis testing
Intermittent problems that appear to come and go and cannot be recreated or Technical observation post
repeated in a test environment Kepner–Tregoe
Hypothesis testing
Brainstorming
Uncertainty over where to start for problems that appear to have multiple causes Pareto analysis
Kepner–Tregoe
Ishikawa diagrams
Brainstorming
Struggling to identify the exact point of failure for a problem Fault isolation
Ishikawa diagrams
Kepner–Tregoe
Affinity mapping
Brainstorming
Uncertain where to start when trying to find root cause 5-Whys
Kepner–Tregoe
Brainstorming
Affinity mapping
102 | Service operation processes

Proactive
Service Event Incident problem Supplier or
desk management management management contractor

Problem
detection

Problem
logging

Problem
categorization

Problem
prioritization

Problem
CMS investigation
and diagnosis

Incident Implement Workaround


workaround Yes
management needed?

No

Raise known Known


error record error
if required database

Change RFC Change


Yes
management needed?

No

Problem
resolution

No Resolved?

Yes
)"""!
Problem ) %&!"$#s
closure Service
knowledge
management
system

Major Major problem


Yes review
problem?

No Continual
service
improvement
End
)!%##"
)$#s

Figure 4.7 Problem management process flow


Service operation processes | 103

raise incidents that have to be re-diagnosed and this case, a problem record is raised once the
resolved all over again! underlying trend or cause is discovered.
■ Activities taken to improve the quality of
4.4.5 Process activities, methods and a service that result in the need to raise a
techniques problem record to identify further improvement
The problem management process flow for actions that should be taken.
handling a recognized problem is shown in Figure Frequent and regular analysis of incident and
4.7. This is a simplified chart to show the normal problem data must be performed to identify
process flow, but in reality some of the states may any trends as they become discernible. This will
be iterative or variations may have to be made in require meaningful and detailed categorization
order to handle particular situations. For example, of incidents/problems and regular reporting of
proactive problem management activities may raise patterns and areas of high occurrence. ‘Top ten’
new problem records which in turn can become reporting, with drill-down capabilities to lower
input to this process flow. levels, is useful in identifying trends.

4.4.5.1 Problem detection Further details of how detected trends should


be handled are included in ITIL Continual Service
It is likely that multiple ways of detecting problems
Improvement.
will exist in all organizations. These can include
triggers for reactive and proactive problem
4.4.5.2 Problem logging
management:
Regardless of the detection method, all the
Reactive problem management triggers: relevant details of the problem must be recorded
■ Suspicion or detection of a cause of one or so that a full historic record exists. This must be
more incidents by the service desk, resulting date and time stamped to allow suitable control
in a problem record being raised – the desk and escalation.
may have resolved the incident but has not A cross-reference must be made to the incident(s)
determined a definitive cause and suspects which initiated the problem record – and all
that it is likely to recur, so will raise a problem relevant details must be copied from the incident
record to allow the underlying cause to be record(s) to the problem record. It is difficult to
resolved. Alternatively, it may be immediately be exact, as cases may vary, but typically this will
obvious from the outset that an incident, or include details such as:
incidents, has been caused by a major problem,
so a problem record will be raised without ■ User details
delay. ■ Service details
■ Analysis of an incident by a technical support ■ Equipment details
group which reveals that an underlying ■ Date/time initially logged
problem exists, or is likely to exist. ■ Priority and categorization details
■ Automated detection of an infrastructure ■ Incident description
or application fault, using event/alert tools ■ Incident record numbers or other cross-
automatically to raise an incident which may reference
reveal the need for a problem record. ■ Details of all diagnostic or attempted recovery
■ A notification from a supplier or contractor that actions taken.
a problem exists that has to be resolved.

Proactive problem management triggers: 4.4.5.3 Problem categorization


Problems should be categorized in the same way as
■ Analysis of incidents that result in the need to
incidents (and it is advisable to use the same coding
raise a problem record so that the underlying
system) so that the true nature of the problem
fault can be investigated further.
can be easily traced in the future and meaningful
■ Trending of historical incident records to
management information can be obtained. This
identify one or more underlying causes that
also allows for incidents and problems to be more
if removed, can prevent their recurrence. In
readily matched.
104 | Service operation processes

4.4.5.4 Problem prioritization try various ways of finding the most appropriate
Problems should be prioritized the same way using and cost-effective resolution to the problem. It
the same reasons as incidents. The frequency and may be possible to recreate the problem in a test
impact of related incidents must also be taken environment that mirrors the live environment.
into account. The coding system described earlier This allows for investigation and diagnosis activities
in Table 4.1 (which combines incident impact with to proceed effectively without causing further
urgency to give an overall priority level) can also disruption to users.
be used to prioritize problems. Definition and
guidance should be provided to support staff on 4.4.5.6 Workarounds
what constitutes a problem, and the related service In some cases it may be possible to find a
targets for each priority code level in the table. workaround to the incidents caused by the
problem – a temporary way of overcoming the
Problem prioritization should also take into
difficulties. For example, a manual amendment
account the severity of the problems. Severity in
may be made to an input file to allow a program
this context refers to how serious the problem is
to complete its run successfully and allow a
from a service or customer perspective as well as an
billing process to complete satisfactorily, but it is
infrastructure perspective, for example:
important that work on a permanent resolution
■ Can the system be recovered, or does it need to continues where this is justified – in this example
be replaced? the reason for the file becoming corrupted in the
■ How much will it cost? first place must be found and corrected to prevent
■ How many people, with what skills, will be this happening again.
needed to fix the problem? When a workaround is found, it is therefore
■ How long will it take to fix the problem? important that the problem record remains open
■ How extensive is the problem (e.g. how many and details of the workaround are documented
CIs are affected)? within the problem record.
In some cases there may be multiple workarounds
4.4.5.5 Problem investigation and diagnosis associated with a problem. As problem
At this stage, an investigation should be conducted investigation and diagnosis activities carry on,
to try to diagnose the root cause of the problem there may be a series of improvements that do
– the speed and nature of this investigation will not resolve the problem, but lead to a progressive
vary depending upon the impact, severity and improvement in the quality of the workarounds
urgency of the problem – but the appropriate available. These may impact on the prioritization
level of resources and expertise should be applied of the problem as successive workaround solutions
to finding a resolution commensurate with the may reduce the impact of future related incidents,
priority code allocated and the service target in either by reducing their likelihood or improving
place for that priority level. the speed of their resolution.
There are a number of useful problem-solving
techniques that can be used to help diagnose and 4.4.5.7 Raising a known error record
resolve problems, and these should be used as A known error is defined as a problem with a
appropriate. Such techniques have been described documented root cause and workaround. The
in more detail in section 4.4.4. known error record should identify the problem
record it relates to and document the status of
The CMS must be used to help determine the
actions being taken to resolve the problem, its root
level of impact and pinpoint and diagnose the
cause and workaround. All known error records
exact point of failure. The KEDB should also be
should be stored in the KEDB. The KEDB and the
accessed and problem-matching techniques (such
way it should be used are described in more detail
as keyword searches) should be used to see if the
in section 4.4.7.2.
problem has occurred before and, if so, to find the
resolution. As soon as the diagnosis is complete, and
particularly where a workaround has been found
It is often valuable to try to recreate the failure
(even though it may not yet be a permanent
to understand what has gone wrong, and then
Service operation processes | 105

resolution), a known error record must be raised 4.4.5.9 Problem closure


and placed in the KEDB so that if further incidents When a final resolution has been applied, the
or problems arise, they can be identified and the problem record should be formally closed – as
service restored more quickly. In some cases it may should any related incident records that are still
be advantageous to raise a known error record open. A check should be performed at this time
even earlier in the overall process, even though the to ensure that the record contains a full historical
diagnosis may not be complete or a workaround description of all events – and if not, the record
found. This might be used for information should be updated.
purposes or to identify a root cause or workaround
that appears to address the problem but hasn’t The status of any related known error record
been fully confirmed. Therefore, it is inadvisable to should be updated to show that the resolution has
set a concrete procedural point for exactly when been applied.
a known error record must be raised. It should be
done as soon as it becomes useful to do so! 4.4.5.10 Major problem review
After every major problem (as determined by the
4.4.5.8 Problem resolution organization’s priority system), and while memories
Once a root cause has been found and a solution are still fresh, a review should be conducted to
to remove it has been developed, it should learn any lessons for the future. Specifically, the
be applied to resolve the problem. In reality, review should examine:
safeguards may be needed to ensure that the ■ Those things that were done correctly
resolution does not cause further difficulties. If any ■ Those things that were done wrong
change in functionality is required, an RFC should ■ What could be done better in the future
be raised and authorized before the resolution can
■ How to prevent recurrence
be applied. If the problem is very serious and an
■ Whether there has been any third-party
urgent fix is needed for business reasons, then an
responsibility and whether follow-up actions
emergency RFC should be raised. The resolution
are needed.
should be applied only when the change has
been authorized and scheduled for release. In the Such reviews can be used as part of training
meantime, the KEDB should be used to help resolve and awareness activities for support staff – and
quickly any further occurrences of the incidents/ any lessons learned should be documented
problems that occur. in appropriate procedures, work instructions,
There may be some problems for which a business diagnostic scripts or known error records. The
case for resolution cannot be justified (e.g. where problem manager facilitates the session and
the impact is limited but the cost of resolution documents any agreed actions.
would be extremely high). In such cases a decision Major problem reviews can also be a source of
may be taken to leave the problem record open input to proactive problem management through
but to use a workaround description in the known identification of underlying causes that may be
error record to detect and resolve any recurrences discovered in the course of the review (see section
quickly. Care should be taken to use the 4.4.2 for more details about proactive problem
appropriate code to flag the open problem record management).
so that it does not count against the performance
The knowledge gained from the review should be
of the team performing the process and so that
incorporated into a service review meeting with
unauthorized rework does not take place.
the business customer to ensure the customer
In some cases, workarounds may be in place to is aware of the actions taken and the plans to
mitigate the impacts of the problem without prevent future major incidents from occurring. This
a final resolution being found. In this event, helps to improve customer satisfaction and assure
the problem should be re-prioritized based on the business that service operation is handling
the impact of the workarounds applied and major incidents responsibly and actively working to
investigative and diagnostic activities should be prevent their future recurrence.
continued.
106 | Service operation processes

4.4.6 Triggers, inputs, outputs and ■ Agreed criteria for prioritizing and escalating
interfaces problems
■ Output from risk management and risk
4.4.6.1 Triggers assessment activities.
With reactive problem management, the vast
majority of problem records will be triggered in 4.4.6.3 Outputs
reaction to one or more incidents, and many will Examples of outputs from the problem
be raised or initiated via service desk staff. Other management process may include:
problem records, and corresponding known error
■ Resolved problems and actions taken to achieve
records, may be triggered in testing, particularly
their resolution
the latter stages of testing such as user acceptance
testing/trials (UAT), if a decision is made to go ■ Updated problem management records with
ahead with a release even though some faults accurate problem detail and history
are known. Suppliers may trigger the need for ■ RFCs to remove infrastructure errors
some problem records through the notification ■ Workarounds for incidents
of potential faults or known deficiencies in their ■ Known error records
products or services (e.g. a warning may be given ■ Problem management reports
regarding the use of a particular CI and a problem ■ Output and improvement recommendations
record may be raised to facilitate investigation by from major problem review activity.
technical staff of the condition of such CIs within
the organization’s IT infrastructure).
4.4.6.4 Interfaces
With proactive problem management, problem The primary relationship between incident and
records may be triggered by identification of problem management has been discussed in detail
patterns and trends in incidents when reviewing in section 4.4.4. Examples of other key interfaces
historical incident records. A review of other are listed below for each service lifecycle stage.
sources such as operation logs, operation
communications or event logs may also proactively Service strategy
trigger problem records when the appearance of ■ Financial management for IT services Assists
an underlying issue becomes apparent. in assessing the impact of proposed resolutions
or workarounds, as well as pain value analysis.
4.4.6.2 Inputs Problem management provides management
Examples of inputs to the problem management information about the cost of resolving and
process may include: preventing problems, which is used as input
into the budgeting and accounting systems and
■ Incident records for incidents that have total cost of ownership calculations.
triggered problem management activities
■ Incident reports and histories that will be used Service design
to support proactive problem trending ■ Availability management Is involved with
■ Information about CIs and their status determining how to reduce downtime and
■ Communication and feedback about incidents increase uptime. As such, it has a close
and their symptoms relationship with problem management,
■ Communication and feedback about RFCs especially the proactive areas. Much of the
and releases that have been implemented or management information available in problem
planned for implementation management will be communicated to
■ Communication of events that were triggered availability management.
from event management ■ Capacity management Some problems will
■ Operational and service level objectives require investigation by capacity management
teams and techniques, e.g. performance issues.
■ Customer feedback on success of problem
Capacity management will also help in assessing
resolution activities and overall quality of
proactive measures. Problem management
problem management activities
provides management information relative
Service operation processes | 107

to the quality of decisions made during the CSI register. Proactive problem management
capacity planning process. activities may also identify underlying problems
and service issues that if addressed, can
■ IT service continuity management Problem
contribute to increases in service quality and
management acts as an entry point into
end user/customer satisfaction.
IT service continuity management where a
significant problem is not resolved before it
starts to have a major impact on the business.
4.4.7 Information management
■ Service level management The occurrence Most information used in problem management
of incidents and problems affects the level of comes from the following sources:
service delivery measured by SLM. Problem
management contributes to improvements in 4.4.7.1 Configuration management system
service levels, and its management information The CMS will hold details of all of the components
is used as the basis of some of the SLA review of the IT infrastructure, as well as the relationships
components. SLM also provides parameters between these components. It will act as a valuable
within which problem management works, source for problem diagnosis and for evaluating
such as impact information and the effect on the impact of problems (e.g. if this disk is down,
services of proposed resolutions and proactive what data is on that disk; which services use that
measures. data; which users use those services?). As it will also
hold details of previous activities, it can also be
Service transition used as a valuable source of historical data to help
■ Change management Problem management identify trends or potential weaknesses – a key part
ensures that all resolutions or workarounds that of proactive problem management.
require a change to a CI are submitted through
change management through an RFC. Change 4.4.7.2 Known error database
management will monitor the progress of The purpose of a KEDB is to allow storage of
these changes and keep problem management previous knowledge of incidents and problems –
advised. Problem management is also involved and how they were overcome – to allow quicker
in rectifying the situation caused by failed diagnosis and resolution if they recur.
changes.
■ Service asset and configuration management
The known error record should hold exact details
of the fault and the symptoms that occurred,
Problem management uses the CMS to identify
together with precise details of any workaround
faulty CIs and also to determine the impact of
or resolution action that can be taken to restore
problems and resolutions.
the service and/or resolve the problem. An
■ Release and deployment management This
incident count will also be useful to determine the
process is responsible for deploying problem
frequency with which incidents are likely to recur
fixes out into the live environment. It also
and influence priorities etc.
assists in ensuring that the associated known
errors are transferred from the development It should be noted that a business case for a
KEDB into the live known error database. permanent resolution for some problems may not
Problem management will help resolve exist. For example, if a problem does not cause
problems caused by faults during the release serious disruption and a workaround exists and/or
process. the cost of resolving the problem far outweighs the
■ Knowledge management The SKMS can be benefits of a permanent resolution, then a decision
used to form the basis for the KEDB and hold or may be taken to tolerate the problem. However, it
integrate with the problem records. will still be desirable to diagnose and implement a
workaround as quickly as possible, which is where
Continual service improvement the KEDB can help.
■ The seven-step improvement process The It is essential that any data put into the database
occurrence of incidents and problems provides can be quickly and accurately retrieved. The
a basis for identifying opportunities for problem manager should be fully trained and
service improvement and adding them to the familiar with the search methods/algorithms used
108 | Service operation processes

by the selected database and should carefully and the way it should be used. They should be able
ensure that when new records are added, the readily to retrieve and use data.
relevant search key criteria are correctly included.
The KEDB is part of the CMS and may be part
Care should be taken to avoid duplication of of a larger SKMS illustrated in Figure 4.8. Note
records (i.e. the same problem described in two that SCMIS stands for supplier and contract
or more ways as separate records). To avoid this, management information system. More
the problem manager should be the only person information on the SKMS can be found in ITIL
able to enter a new record. Other support groups Service Transition.
should be encouraged to propose new records, but
these should be vetted by the problem manager 4.4.8 Critical success factors and key
before entry to the KEDB. In large organizations performance indicators
where a single KEDB is used (recommended) with
The following list includes some sample CSFs for
problem management staff in multiple locations,
problem management. Each organization should
a procedure must be agreed to ensure that
identify appropriate CSFs based on its objectives
duplication of KEDB records cannot occur. This may
for the process. Each sample CSF is followed by
involve designating just one staff member as the
a small number of typical KPIs that support the
central KEDB manager.
CSF. These KPIs should not be adopted without
The KEDB should be used during the incident and careful consideration. Each organization should
problem diagnosis phases to try to speed up the develop KPIs that are appropriate for its level of
resolution process – and new records should be maturity, its CSFs and its particular circumstances.
added as quickly as possible when a new problem Achievement against KPIs should be monitored and
has been identified and diagnosed. used to identify opportunities for improvement,
which should be logged in the CSI register for
All support staff should be fully trained and
evaluation and possible implementation.
conversant with the value that the KEDB can offer

SKMS

Service portfolio

Pipeline Catalogue Retired

Service Service models CMS


strategy DML
Incidents
DML
Service design CMDB
Financial packages Service requests
data Release
CMDB
plans Problems

SLAs CMDB Known errors


Demand
data Deployment
Changes
plans
ITSCM plans
Releases
Business
cases Test plans
and reports
Technical Standard
AMIS Events operating
documentation procedures
Policies and CMIS
plans SCMIS
SMIS Management Service CSI Improvement
reports reports register plans

Figure 4.8 Examples of data and information in the service knowledge management system
Service operation processes | 109

■ CSF Minimize the impact to the business of 4.4.9 Challenges and risks
incidents that cannot be prevented
● KPI The number of known errors added to 4.4.9.1 Challenges
the KEDB The following challenges will exist for successful
● KPI The percentage accuracy of the KEDB problem management:
(from audits of the database) ■ A major dependency for problem management
● KPI Percentage of incidents closed by the is the establishment of an effective incident
service desk without reference to other management process and tools. This will
levels of support (often referred to as ‘first ensure that problems are identified as soon as
point of contact’) possible and that as much work is done on pre-
● KPI Average incident resolution time for qualification as possible. A critical challenge
those incidents linked to problem records exists in making sure that the two processes
■ CSF Maintain quality of IT services through have formal interfaces and common working
elimination of recurring incidents practices.
● KPI Total numbers of problems (as a control ■ The skills and capabilities for problem
measure) resolution staff to identify the true root cause
● KPI Size of current problem backlog for of incidents is sometimes a challenge. Many
each IT service times, support staff will describe the root cause
● KPI Number of repeat incidents for each IT based on symptoms or resolution actions taken.
service The techniques described in section 4.4.4 can
■ CSF Provide overall quality and be used to help determine the true underlying
professionalism of problem handling activities cause of an incident. Creating a focus around
to maintain business confidence in IT ‘why did this happen?’ or ‘what can be done to
capabilities prevent the incident from happening again?’
can also be helpful.
● KPI The number of major problems
(opened and closed and backlog) ■ The ability to relate incidents to problems
can be a challenge if the tools used to record
● KPI The percentage of major problem
incidents are different from those of problems.
reviews successfully performed
In some cases, incident tools might exist with no
● KPI The percentage of major problem
capabilities to track problems separately.
reviews completed successfully and on time
■ The ability to integrate problem management
● KPI Number and percentage of problems
activities with the CMS to determine
incorrectly assigned
relationships between CIs and to refer to
● KPI Number and percentage of problems
the history of CIs when performing problem
incorrectly categorized support activities.
● KPI The backlog of outstanding problems
■ Ensuring that problem management is able
and the trend (static, reducing or to use all knowledge and service asset and
increasing?) configuration management resources available
● KPI Number and percentage of problems to investigate and resolve problems.
that exceeded their target resolution times ■ Ensuring that ongoing training of technical
● KPI Percentage of problems resolved within staff in both technical aspects of their job as
SLA targets (and the percentage that are well as the business implications of the services
not!) they support and the processes they use is in
● KPI Average cost per problem. place.
It is also helpful to break down and categorize ■ The ability to have a good working relationship
problem metrics by category, time frame, impact, between the second- and third-line staff
urgency, service impacted, location and priority working on problem support activities and first-
and compare these with previous periods. This can line staff.
provide input to CSI and other processes seeking to ■ Making sure that business impact is well
identify issues, problem trends or other situations. understood by all staff working on problem
resolution.
110 | Service operation processes

4.4.9.2 Risks 4.5.2 Scope


The risks to successful problem management are Access management is effectively the execution of
actually similar to some of the challenges and the the policies in information security management,
reverse of some of the CSFs mentioned above. in that it enables the organization to manage the
They include: confidentiality, availability and integrity of the
■ Being inundated with problems that cannot be
organization’s data and intellectual property.
handled within acceptable timescales due to a Access management ensures that users are given
lack of available or properly trained resources the right to use a service, but it does not ensure
■ Problems being bogged down and not that this access is available at all agreed times – this
progressed as intended because of inadequate is provided by availability management.
support tools for investigation
Access management is a process that is executed
■ Lack of adequate and/or timely information by all technical and application management
sources because of inadequate tools or lack of functions and is usually not a separate function.
integration However, there is likely to be a single control
■ Problem support staff that may not be properly point of coordination, usually in IT operations
trained to investigate problems, find their management or on the service desk.
underlying causes or identify appropriate
Access management can be initiated by a service
actions to remove errors
request.
■ Mismatches in objectives or actions because of
poorly aligned or non-existent OLAs and/or UCs.
4.5.3 Value to business
The value of access management includes:
4.5 ACCESS MANAGEMENT
■ Ensuring that controlled access to services will
Access management is the process of granting allow the organization to maintain effective
authorized users the right to use a service, while confidentiality of its information
preventing access to non-authorized users. It has ■ Ensuring that employees have the right level of
also been referred to as rights management or access to execute their jobs effectively
identity management in different organizations. ■ Reducing errors made in data entry or in the
use of a critical service by an unskilled user (e.g.
4.5.1 Purpose and objectives production control systems)
4.5.1.1 Purpose ■ Providing capabilities to audit use of services
and to trace the abuse of services
The purpose of access management is to provide
■ Providing capabilities to revoke access rights
the right for users to be able to use a service or
when needed on a timely basis – an important
group of services. It is therefore the execution of
security consideration
policies and actions defined in information security
management. ■ Providing and demonstrating compliance with
regulatory requirements (e.g. SOX, HIPAA and
4.5.1.2 Objectives COBIT).

The objectives of the access management process


are to:
4.5.4 Policies, principles and basic
concepts
■ Manage access to services based on policies
and actions defined in information security 4.5.4.1 Policies
management (see ITIL Service Design) Examples of access management policies might
■ Efficiently respond to requests for granting include:
access to services, changing access rights or
■ Access management administration and
restricting access, ensuring that the rights being
provided or changed are properly granted associated activities should be guided and
directed by the policies and controls as defined
■ Oversee access to services and ensure rights
in the information security policy (see ITIL
being provided are not improperly used.
Service Design).

You might also like