Professional Documents
Culture Documents
Incident Management (ITIL Standard)
Incident Management (ITIL Standard)
Description/Summary
Incident Management provides the process, tools and concept for the fast recovery of service
quality in a defined service. It deals with service issues, and with all other service and user
requests recorded by a service desk. It also monitors the completion of requests by the service
desk or by all other service units. Finally, Incident Management has the task of informing the
service requester on the status of a service request.
Objectives
The purpose of Incident Management is to recover normal (i.e. agreed) service operation,
as quickly as possible, after an incident has been detected/recorded.
Process Owner: Initiator of the process, responsible for defining its strategic goals and
allocating all required resources
Process Manager (Incident Manager): Manager of the entire process, responsible for its
effectiveness and efficiency
First Level Support – in general this is equivalent to the function of the Service Desk: A
person working in the Service Desk function; acts as a Ticket Agent and Ticket Owner
Expert / Service Specialist
o Second Level Support member of staff: A person working in an internal IT
department and providing expert qualities in one or more specific areas; acts as a
ticket agent (see below)
o Third/n-th Level Support member of staff: A person working in an internal IT
department, or for an external supplier, and providing specialist qualities in one or
more specific areas; acts as a ticket agent (see below)
Information Artifacts
This section outlines information/data required or recorded by the process. In general, a process
record (here: the incident record) contains all information needed to execute this process and also
represents the current progress of a process. Additional information items (artifacts) typically can
be realized by considering the information from one or more (up to all) process records. This can
be done by either filtering, merging, correlating or interpreting information from these records;
sometimes this can also be done in the context of information and data from other sources.
The Incident Record holds any management-relevant information for a specific incident. On
creation, it is based on (filled with) the information provided by the user or system/tool reporting
the new incident.
Unique Identifier
Incident Owner
Incident Agent
Caller
User
Customer
Status of the Incident (Record)
Description of the Incident Symptoms
Service Level Agreement (SLA defined urgency)
Current Assessed Impact of the Incident
Priority (according to urgency and impact)
Services Affected by the Incident
Related Configuration Item
General check list result
Specific check list result
Investigation results
Problem(s) or Error(s) Related to the Incident
Applicable Workaround(s) (link to workaround database)
Request(s) for change(s) triggered
Resolution/Recovery describtion
Resolution date and time
Testing result
Result of Service Recovery Confirmation Request
Closure date and time
Additional Remarks
Key Concepts
Major Incident Handling
The incident of priority „1“ is a Major Incident. In order to accelerate process execution in such a
case, the following special procedures for the handling of Major Incidents are proposed. The
Service Owner takes over the Ownership of the Ticket from the Service Desk. He is now
accountable for the Incident. See Ticket, Ticket Owner and Ticket AgentTicket Owner for
detailed information.
Priority
Priority is a control parameter in the process and supports the Incident Management Staff in the
efficient management of resources (staff, capacity, time etc.). Services with higher service levels
will have a higher priority („1“ is the highest priority). Priority is defined by combining impact
and urgency:
Impact – this describes the situation of an incident and is defined by the following
factors:
o Number of affected users
o Percentage of affected users
Urgency – this defines the priority of incidents when the impact of those incidents are the
same.
Information Duty
First level support / line support has the duty to inform the user about the status of an incident
especially, when the incident cannot be fixed immediately. The user should know about the
status of an incident, between the status „work on incident started“ and „work on incident
stopped“. If user and caller are different persons, both persons should be informed on status.
Regular status information helps to avoid escalations because the user is informed and know that
his request is taken care of. A information can be triggered by event and by time schedules.
Hierarchical Escalation
Incidents can be escalated hierarchically when the process can not be fulfilled in „regular“ way.
This case occurs when:
Incident Controls
status description
new A new incident is identified or reported.
A new incident is registered if all the necessary Information has been stored in
registered
the Incident Record.
service
The affected service is identified.
selected
All important control factors have been determined, e.g.priority of the affected
classified service respectively the Service Level Agreement. Also the type of procedure is
defined.
forwarded The record has been forwarded.
If an incident could not be resolved by the 1st Level, it has to be
assigned
forwarded/functionally escalated to the expert team.
solution
An incident where the solution is found.
found
recovered An incident, which is resolved (with or without Change Management).
An incident, with formerly status recovered, which has not successfully passed
failed
the required tests or the user has the solution rejected is failed.
solved An incident has passed the required test.
The user has confirmed the solution of an incident and that the request has been
confirmed
handled successfully.
aborted An incident which did not completely pass the lifecycle.
end The final status of a completed incident.
Process
High Level Process Flow Chart
Critical Success Factors
These can be monitored using the following measures: per priority, per service, per user, per
customer, per location, etc
Process Trigger
Event Triggers
Any reported incident triggers the incident management process, which may be reported
using one of the following methods:
o User calls, mails, web form
o Event management monitoring tools
o Technical staff calls or mails
Time Triggers
for each reported Incident a new Incident Record with an unique identifier is created.
a change of the Incident Owner/Agent is only allowed, if the new Incident Owner/Agent
agrees.
preferably, a person rather than a group should be the Incident Owner/Agent.
inform user/caller on the status of ticket handling after each activity
Process Activities
Incident Recording
All upcoming incidents are be logged; a ticket in a trouble ticket system is created for each
incident. In order to facilitate incident ownership, tracking and escalation, a ticket owner is
defined according to the roles indicated above.
Incident Classification
This classification aims at recording the class/type of the incident and provides essential
information for subsequent prioritization. This activity is crucial for the success of the following
sub processes
In a first step, the affected service and the adequate SLAs are classified (set). Then the priority
can be defined based on SLAs for a certain service. Other control factors will have to be set as
well.
If one of those questions can be answered with „YES“, this may decrease the handling time
needed for an incident’s solution. If none of questions above can be answered, a service specific
check list need to be used. This service specific check list is provided in the service description.
Depending on the quality of the investigation, less incidents will need to be escalated to the
Service Expert Support Level.
Following the result of the Incident Initial Support, an incident should be assigned to the owner
of similar incidents. This should avoid that two different owners handling the similar incident.
Incidents occurring simultaneously will in most cases have the same reason and therefore can be
consolidated. Good documentation about solutions of solved incidents can help to decrease
handling time of new incidents. This correlation of incidents to problems is only possible if a
Problem Management Process exists. Known workarounds of incidents can also improve the
handling times of new incidents.
otherwise
Incident Investigation
Incident Investigation is the most complex activity of the Incident Management Process. The
incident is investigated by using available information on the incident symptoms with the aim of
achieving a quick resolution of the incident and a restoration of the disrupted service(s). This
available information is provided by the general and specific checklists. Additional information
can be: affected services, CIs, users, related incidents, errors, information out of the CMDB, and
technical expert knowledge.
The first action in this activity is to check whether the classification, especially the determination
of the Service, has been completed correctly. If not, the incident is returned to the Service Desk
for reclassification.
If the classification was correct, then the results of the general and service specific checklist are
reviewed. If the initial support was wrong, then the incident is returned to the Service Desk to
complete.
If both decisions are passed successfully, then the 2nd level support (or 3rd) starts its
investigation. If a solution is found they then execute the recovery activity, then they can pass the
Incident Ticket back to the Service Desk.
Incident Recovery
Once a solution has been found, then different recovery options are possible. If a Change is
needed, then the Change Management Process has to be triggered. Otherwise the Incident Agent
can either perform the recovery or can guide the user/caller to do so. The individual steps must
be documented.
Otherwise
the Incident Agent has to perform the tasks to recover the service
or
the Incident Agent has to guide the user/caller to perform the tasks:
o document recovery
o go to the control activity recovered.
Incident Testing
If a incident is recovered, the functionality of the live-system needs to be tested. The following
actions need to be performed during this testing:
The test result must be documented in the Incident Record. If Change Management was involved
in the recovery and a test was already performed, then this testing can be skipped. If possible,
this tester should be different from the Person who tested the recovery.
otherwise
Incident Closure
When an incident is reported as tested successful, the user has to be informed. An answer is
required from the user (caller/Customer) to confirm or deny that there has been a successful
recovery to normal service operation. If an alert (event) triggered the Incident, then no
conformation is needed.
The solution must then be documented. This is necessary in order to build a Knowledge
Database. The final task to close the Incident Ticket must then be performed. If the user
(caller/Customer) refuses to confirm this is successful, then the Ticket has to be send back to
Initial Support.
if the user confirms testing is successful, set the Incident Agent to the user
when the user refuses to confirm, go to control activity „closed – failed“
otherwise