Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Incident Management

Description/Summary
Incident Management provides the process, tools and concept for the fast recovery of service
quality in a defined service. It deals with service issues, and with all other service and user
requests recorded by a service desk. It also monitors the completion of requests by the service
desk or by all other service units. Finally, Incident Management has the task of informing the
service requester on the status of a service request.

Objectives
The purpose of Incident Management is to recover normal (i.e. agreed) service operation,
as quickly as possible, after an incident has been detected/recorded.

Incident Management contributes to an integrated Service Management approach by achieving


the following goals:

 Every incident and all required data is recorded.


 Every incident runs through a set of standardized activities and procedures, in order to
ensure effective and efficient processing.
 Every incident is categorized and prioritized regarding its (potential) impact and urgency,
in order to schedule its resolution in a business-oriented way.
 Functional and hierarchical escalation procedures are in place in order to ensure that each
incident is investigated by qualified members of staff, either by internal or external
experts.
 Well-defined functional escalation levels ensures that all incidents are handled in a cost-
efficient way and that experts are relieved from non-expert diagnosis and resolution
activities.

Roles and Functions


Incident Management Specific Roles

Static Process Roles

 Process Owner: Initiator of the process, responsible for defining its strategic goals and
allocating all required resources
 Process Manager (Incident Manager): Manager of the entire process, responsible for its
effectiveness and efficiency

Dynamic Process Roles


 Incident Agent: see IT Service Management Principles Ticket, Ticket Owner and Ticket
Agent for details
 Incident Owner: see IT Service Management Principles Ticket, Ticket Owner and Ticket
Agent for details

Service Specific Roles

 First Level Support – in general this is equivalent to the function of the Service Desk: A
person working in the Service Desk function; acts as a Ticket Agent and Ticket Owner
 Expert / Service Specialist
o Second Level Support member of staff: A person working in an internal IT
department and providing expert qualities in one or more specific areas; acts as a
ticket agent (see below)
o Third/n-th Level Support member of staff: A person working in an internal IT
department, or for an external supplier, and providing specialist qualities in one or
more specific areas; acts as a ticket agent (see below)

Customer Specific Roles

 User: Person who is affected by the Incident


 Caller: The person who reports the Incident. This may be the same as the „User“
 Service Desk

Information Artifacts
This section outlines information/data required or recorded by the process. In general, a process
record (here: the incident record) contains all information needed to execute this process and also
represents the current progress of a process. Additional information items (artifacts) typically can
be realized by considering the information from one or more (up to all) process records. This can
be done by either filtering, merging, correlating or interpreting information from these records;
sometimes this can also be done in the context of information and data from other sources.

The Incident Record

The Incident Record holds any management-relevant information for a specific incident. On
creation, it is based on (filled with) the information provided by the user or system/tool reporting
the new incident.

The following attributes must be considered for an Incident Record:

 Unique Identifier
 Incident Owner
 Incident Agent
 Caller
 User
 Customer
 Status of the Incident (Record)
 Description of the Incident Symptoms
 Service Level Agreement (SLA defined urgency)
 Current Assessed Impact of the Incident
 Priority (according to urgency and impact)
 Services Affected by the Incident
 Related Configuration Item
 General check list result
 Specific check list result
 Investigation results
 Problem(s) or Error(s) Related to the Incident
 Applicable Workaround(s) (link to workaround database)
 Request(s) for change(s) triggered
 Resolution/Recovery describtion
 Resolution date and time
 Testing result
 Result of Service Recovery Confirmation Request
 Closure date and time
 Additional Remarks

Key Concepts
Major Incident Handling

The incident of priority „1“ is a Major Incident. In order to accelerate process execution in such a
case, the following special procedures for the handling of Major Incidents are proposed. The
Service Owner takes over the Ownership of the Ticket from the Service Desk. He is now
accountable for the Incident. See Ticket, Ticket Owner and Ticket AgentTicket Owner for
detailed information.

Priority

Priority is a control parameter in the process and supports the Incident Management Staff in the
efficient management of resources (staff, capacity, time etc.). Services with higher service levels
will have a higher priority („1“ is the highest priority). Priority is defined by combining impact
and urgency:

 Impact – this describes the situation of an incident and is defined by the following
factors:
o Number of affected users
o Percentage of affected users
 Urgency – this defines the priority of incidents when the impact of those incidents are the
same.
Information Duty

First level support / line support has the duty to inform the user about the status of an incident
especially, when the incident cannot be fixed immediately. The user should know about the
status of an incident, between the status „work on incident started“ and „work on incident
stopped“. If user and caller are different persons, both persons should be informed on status.
Regular status information helps to avoid escalations because the user is informed and know that
his request is taken care of. A information can be triggered by event and by time schedules.

Hierarchical Escalation

Incidents can be escalated hierarchically when the process can not be fulfilled in „regular“ way.
This case occurs when:

 available resources are not sufficient


 agreements in ULA/ UC contracts can not be fulfilled
 service level can not be fulfilled

Escalation can be performed at each point of an incident’s life cycle. See Hierarchical


Escalation for detailed information

Incident Controls

 
status description
new A new incident is identified or reported.
A new incident is registered if all the necessary Information has been stored in
registered
the Incident Record.
service
The affected service is identified.
selected
All important control factors have been determined, e.g.priority of the affected
classified service respectively the Service Level Agreement. Also the type of procedure is
defined.
forwarded The record has been forwarded.
If an incident could not be resolved by the 1st Level, it has to be
assigned
forwarded/functionally escalated to the expert team.
solution
An incident where the solution is found.
found
recovered An incident, which is resolved (with or without Change Management).
An incident, with formerly status recovered, which has not successfully passed
failed
the required tests or the user has the solution rejected is failed.
solved An incident has passed the required test.
The user has confirmed the solution of an incident and that the request has been
confirmed
handled successfully.
aborted An incident which did not completely pass the lifecycle.
end The final status of a completed incident.

Process
High Level Process Flow Chart
Critical Success Factors

 Incident overload and backlog


 Reliable data from the Configuration Management
 Extensive “Knowledge Database” which includes Problem Descriptions, Known Errors,
Resolutions and Workarounds
 Automated Ticket Tracking Systems
 Lack of Service Level Agreements
 Users and IT staff bypassing Incident Management process

Performance Indicators (KPI)

 Total Number of Incidents


 Average Resolution Time
 Percentage of Incidents resolved by first-line support (without routing)
 Number(or percentage) of Incidents with initial incorrect classification
 Average Support Costs per Incident
 Number(or percentage) of Incidents Routed Incorrectly

These can be monitored using the following measures: per priority, per service, per user, per
customer, per location, etc

Process Trigger

Event Triggers

 Any reported incident triggers the incident management process, which may be reported
using one of the following methods:
o User calls, mails, web form
o Event management monitoring tools
o Technical staff calls or mails

Time Triggers

 None (Incident Management is typically event-triggered)

Process Specific Rules

 for each reported Incident a new Incident Record with an unique identifier is created.
 a change of the Incident Owner/Agent is only allowed, if the new Incident Owner/Agent
agrees.
 preferably, a person rather than a group should be the Incident Owner/Agent.
 inform user/caller on the status of ticket handling after each activity

Process Activities
Incident Recording

All upcoming incidents are be logged; a ticket in a trouble ticket system is created for each
incident. In order to facilitate incident ownership, tracking and escalation, a ticket owner is
defined according to the roles indicated above.

Activity Specific Rules:

 Default Incident Owner is set to Service Desk


 in case an Incident was reported by a caller; the member of Service Desk recording the
incident is, from this point in time on, the Incident Owner.
 set Incident Agent = Incident Owner
 set User and Caller
 set Customer
 „Default Incident Priority“ is set to 1
 „Default Incident Status“ is set to new
 document the Incident symptoms in the Incident Describtion
 if the user/customer is valid and is allowed to trigger an incident for this service, then go
to control activity „recorded – accepted“

otherwise go to „recorded – rejected“

 set Configuration Item to „-„

Incident Classification
This classification aims at recording the class/type of the incident and provides essential
information for subsequent prioritization. This activity is crucial for the success of the following
sub processes

In a first step, the affected service and the adequate SLAs are classified (set). Then the priority
can be defined based on SLAs for a certain service. Other control factors will have to be set as
well.

Activity Specific Rules:

 Service is set to the affected Service or Sub Service


 Service Level Agreement is set to the relevant SLA for the selected user (customer) and
service
 Impact is set by a consideration of the Incident Symptoms Description
 Priority is set according to urgency and impact, as described in the Service Level
Agreement
 Other control factors are set
 Go to control activity „classified“

Incident Initial Support


This activity checks if an incident has already been reported or if a solution exists for the
incident (known incidents).

Classified incidents are checked according to the following general checklist:

 are there any open incidents of same service reported?


 are there any closed incidents of same service documented?
 are there any open problems of same service reported?
 are any workarounds known?
 are there any recently implemented changes addressing same service?

If one of those questions can be answered with „YES“, this may decrease the handling time
needed for an incident’s solution. If none of questions above can be answered, a service specific
check list need to be used. This service specific check list is provided in the service description.

Depending on the quality of the investigation, less incidents will need to be escalated to the
Service Expert Support Level.

Following the result of the Incident Initial Support, an incident should be assigned to the owner
of similar incidents. This should avoid that two different owners handling the similar incident.
Incidents occurring simultaneously will in most cases have the same reason and therefore can be
consolidated. Good documentation about solutions of solved incidents can help to decrease
handling time of new incidents. This correlation of incidents to problems is only possible if a
Problem Management Process exists. Known workarounds of incidents can also improve the
handling times of new incidents.

Activity Specific Rules:

 execute general checklist


 execute service specific checklist
 describe the result of the general checklist
 describe the result of the service specific checklist
 describe the solution
 if a solution is found within this activity then go to control activity „supported –
successful“

otherwise

 set the Incident Agent to Service Expert by functional escalation


 go to control activity „support – failed“

Incident Investigation
Incident Investigation is the most complex activity of the Incident Management Process. The
incident is investigated by using available information on the incident symptoms with the aim of
achieving a quick resolution of the incident and a restoration of the disrupted service(s). This
available information is provided by the general and specific checklists. Additional information
can be: affected services, CIs, users, related incidents, errors, information out of the CMDB, and
technical expert knowledge.

The first action in this activity is to check whether the classification, especially the determination
of the Service, has been completed correctly. If not, the incident is returned to the Service Desk
for reclassification.

If the classification was correct, then the results of the general and service specific checklist are
reviewed. If the initial support was wrong, then the incident is returned to the Service Desk to
complete.

If both decisions are passed successfully, then the 2nd level support (or 3rd) starts its
investigation. If a solution is found they then execute the recovery activity, then they can pass the
Incident Ticket back to the Service Desk.

If no solution is found, it can be functionally escalated to a Specialist.

Activity Specific Rules:

 if classification was NOT correct


o set Incident Agent to either Service Desk or the former Incident Agent to
reclassify
o go to control activity „recorded -accepted“
 if initial support was NOT correct
o set Incident Agent to either Service Desk or the former Incident Agent to re-
execute initial support
o go to control activity „classified“
 if no solution was found
o set Service Agent to an appropriate Service Specialist
o inform the Service Owner
o go to control activity „supported- failed“
 otherwise
o if the next activities (recovery, testing, …) are to executed by Service Desk
 set Service Agent to Service Desk
o go to control activity „supported- successful“

Incident Recovery

Once a solution has been found, then different recovery options are possible. If a Change is
needed, then the Change Management Process has to be triggered. Otherwise the Incident Agent
can either perform the recovery or can guide the user/caller to do so. The individual steps must
be documented.

Activity Specific Rules:

 if a change is needed to recover the Incident Change Management has to be triggered

Otherwise

 the Incident Agent has to perform the tasks to recover the service

or

 the Incident Agent has to guide the user/caller to perform the tasks:
o document recovery
o go to the control activity recovered.

Incident Testing
If a incident is recovered, the functionality of the live-system needs to be tested. The following
actions need to be performed during this testing:

 see the Service Description for Specific Testing Procedures


 testing (test especially if all incident Symptoms have disappeared)
 approval of the implementation

The test result must be documented in the Incident Record. If Change Management was involved
in the recovery and a test was already performed, then this testing can be skipped. If possible,
this tester should be different from the Person who tested the recovery.

Activity Specific Rules:

 set Incident Agent to Service Desk


 execute test
 record test result
 if test was successful, go to control activity „tested – successful“

otherwise

 set Incident Agent to a Service Expert


 go to control activity „tested – failed“

Incident Closure
When an incident is reported as tested successful, the user has to be informed. An answer is
required from the user (caller/Customer) to confirm or deny that there has been a successful
recovery to normal service operation. If an alert (event) triggered the Incident, then no
conformation is needed.

The solution must then be documented. This is necessary in order to build a Knowledge
Database. The final task to close the Incident Ticket must then be performed. If the user
(caller/Customer) refuses to confirm this is successful, then the Ticket has to be send back to
Initial Support.

Activity Specific Rules:

 if the user confirms testing is successful, set the Incident Agent to the user
 when the user refuses to confirm, go to control activity „closed – failed“

otherwise

 set Incident Agent to Service Desk


 document the solution
 close the Incident
 go to control activity „closed – successful“

You might also like