Ch06 WSP Problem Management v2023

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Service Delivery

Reference Manual

for
WSP Global, Inc.

Chapter 6: Problem Management

19 December 2023
Version 5.0
WSP Problem Management

© Copyright 2023, Atos All rights reserved. Reproduction in whole or in part is prohibited without
the prior written consent of the copyright owner.

AUTHOR(S) : Mistee Mouledous


DOCUMENT NUMBER : N/A
VERSION : 5 .0
STATUS : Final
SOURCE : Atos
DOCUMENT DATE : 19 December 2023
NUMBER OF PAGES : 18

Release Name (Role) Date


Author Mistee Mouledous, KM 06.28.2018
Atos Reviewers Rizwan Shaikh, PM 12.14.2023
WSP Reviewers Peter Hultgren, Pascal Beauchemin, 12.18.2023
Keith Ruth
Owner Daniel Herrera 12.14.2023

Atos 19 December 2023 6-2 of 18


WSP Problem Management

Table of Contents
6 Problem Management ..................................................................................... 6-4
6.1 Objectives and Scope.................................................................................. 6-4
6.2 Process Overview ....................................................................................... 6-4
6.2.1 Process Diagram ................................................................................. 6-5
6.3 Roles and Responsibilities ............................................................................. 6-6
6.4 Identify Problems ....................................................................................... 6-7
6.4.1 Problem Identification Diagram.............................................................. 6-7
6.4.2 Problem Identification Tasks ................................................................. 6-7
6.5 Record, Classify, and Update Problems .......................................................... 6-8
6.6 Update and Communicate Problem Status ..................................................... 6-8
6.6.1 Communicate Problem Status ............................................................... 6-9
6.7 Problem Escalation ..................................................................................... 6-9
6.8 Root Cause Analysis (RCA) .......................................................................... 6-9
6.8.1 Prepare, Review, and Approve Formal RCA ............................................ 6-10
6.8.2 RCA Process Diagram ......................................................................... 6-11
6.9 Develop and Document Problem Resolution Plan ........................................... 6-12
6.9.1 Problem Resolution Plan Diagram ........................................................ 6-12
6.9.2 Problem Resolution Plan Tasks ............................................................ 6-12
6.10 Interface with Other Service Management Processes ..................................... 6-13
6.11 Implement Problem Resolution ................................................................... 6-14
6.11.1 Implement Problem Resolution Diagram ............................................... 6-14
6.11.2 Implement Problem Resolution Tasks ................................................... 6-14
6.12 Problem Closure ....................................................................................... 6-16
6.12.1 Problem Closure Diagram ................................................................... 6-16
6.12.2 Problem Closure Tasks ....................................................................... 6-16
6.13 Governance Meetings................................................................................ 6-17
6.14 Terminology ............................................................................................ 6-17
6.14.1 Acronyms ......................................................................................... 6-17
6.14.2 Terms .............................................................................................. 6-17
6.15 Revision History ....................................................................................... 6-18

Atos 19 December 2023 6-3 of 18


WSP Problem Management

6 Problem Management
Problem Management identifies, manages, and resolves problems affecting the
 quality of service for WSP enterprise services, and
 efficiency of services provided to WSP.

Problem vs. Incident Management


Problem Management minimizes disruptions to the business by eliminating recurring
incidents and minimizing the impact of incidents that cannot be prevented. A problem is the
cause of one or more incidents.
Incident Management restores normal service as soon as possible, often using a
workaround. An incident is any event not part of the standard service operation that causes
or may cause an interruption to or reduction in the quality of a service. See Chapter 5,
Incident Management, for more information.

6.1 Objectives and Scope


The Problem Management process applies to all environments where centralized
coordination, tracking, monitoring, reporting, and resolution of problems are essential to
Atos’ delivery of services in support of WSP.
Problem Management is implemented across all hardware, software, and firmware elements
composing the Atos-to-WSP service delivery components.
The objectives of Problem Management are to
 minimize the disruption of IT services by organizing IT resources to resolve problems
according to business needs and preventing them from recurring, and
 improve Atos’ handling of problems by recording problem information, resulting in
higher levels of availability and productivity.
The Problem Management process
 establishes a method of categorizing, prioritizing, and assigning reported problems
within the Atos organization based on assessment of impact and risk;
 reduces the ratio of problems vs. incidents; and
 maintains an accurate known-error database.

6.2 Process Overview


Problem Management minimizes the disruption of IT services to WSP by
 organizing IT resources to resolve problems according to business needs,
 reviewing incidents for trends, which could lead to potential problems (proactive
problem management),
 preventing problems from recurring by performing root cause analysis (RCA) and
resolution, and
 recording information to improve the IT problem process.

Atos 19 December 2023 6-4 of 18


WSP Problem Management

6.2.1 Process Diagram


The following diagram provides a high-level overview of the Problem Management process.
Numbers in the diagram refer to sections in this chapter.

Problem Management

Start
Atos

End

6.4
Identify
Problem
Account
Team
Atos
Atos Service
Desk

6.11
6.8 Implement
6.5 Produce 6.9 Problem
6.6 6.12
Record, preliminary Develop and resolution Notify WSP
Update and Close Problem
classify, and RCA and document (Share lessons
Atos TOSG

communicate
update Perform root Problem learned with
status
Problem cause resolution plan other groups)
analysis
6.10
Other service
management
processes
PM Site Rep

6.7
Escalation

Atos 19 December 2023 6-5 of 18


WSP Problem Management

6.3 Roles and Responsibilities


The following table provides the roles and responsibilities for Problem Management
activities.

Role Responsibility
Problem Manager and • Corrects identified problems in a timely manner to minimize
Technical/Operational Owner user impact.
• Generates RCAs.
(Problem Record Owner) • Facilitates Atos RCA review.
• Makes recommendations for infrastructure, application,
configuration, code, and/or procedure changes based on problem
analysis.
• Informs WSP and the Account Team as required during problem
resolution for coordination and communication with other
business and technical groups.
• Reviews problems daily and ensures problem correction is
progressing.
• Ensures problem investigation tickets are updated.
• Practices proactive problem identification and resolution.
• Communicates and coordinates problem resolution activities
among various groups including Atos, WSP, and third-parties.
• Maintains records of known errors and work arounds
Problem Manager Support • Opens and assigns problem investigation tickets.
Group and
• Ensures problem investigation tickets are opened from incident
Technical/Operational
tickets when the root cause or known error is not identified.
Support Group (TOSG)
• Ensures the problem investigation and incident tickets are
interrelated in the ITSM system.
(Support Groups)
• Assigns Problem Manager and Technical/Operational Owner
(problem owner).
Problem Manager Support • Oversees, monitors, and ensures compliance with Problem
Group and Management process.
Technical/Operational
• Monitors problem status, resolutions, trends, and overall
Support Group (TOSG)
effectiveness of the process.
Management
• Oversees, monitors, and escalates problem resolution to ensure
compliance with WSP service level agreements (SLAs).
Service Delivery Manager • Identifies and reports problems.
• Reviews RCAs.
• Provides feedback on RCAs.
• Escalates problem resolution as needed.
• Reviews problem closure.
• Communicates RCA to WSP.

Atos Problem Management • Provides oversight for Problem Management process results.
Process Representatives
• Monitor’s problem status, resolutions, trends, and overall process
effectiveness.
• Reviews Problem Ticket Closure with WSP.

Atos 19 December 2023 6-6 of 18


WSP Problem Management

6.4 Identify Problems


A problem is typically identified when abnormal conditions/incidents are recognized by Atos
or WSP personnel. Incidents with an unknown error/cause and issues proactively identified
as potential problems require a problem investigation ticket.

6.4.1 Problem Identification Diagram


The following diagram provides an overview of problem identification.

Detecting and Logging Problems


Atos Technical

2
1 6.5
Experts

Problem Mgt
Incident Mgt 3 Record,
Open Assign Problem
Start Resolve Classify,
problem
incident using Manager Update
investigation
workaround Problems
ticket

6.4.2 Problem Identification Tasks


The following table lists the problem identification tasks.

Step Responsible Party Action

1 Atos Technical/ Resolve an incident with an unknown cause using a workaround


Operational Support to restore normal service operation.
Group
2 Open a problem investigation ticket in ticketing tool.

3 • Assign a Problem Manager.


• Go to tasks in Section 6.5 to investigate and diagnose
problem.

Minimize and Avoid Problems


A workaround (temporary solution) minimizes the impact of a problem until Problem
Management identifies the root cause and implements a permanent solution. During RCA
development, the Problem Manager reviews the workaround to ensure it is the best solution
available.

Atos 19 December 2023 6-7 of 18


WSP Problem Management

6.5 Record, Classify, and Update Problems


By recording problems in ITSM tool, Atos develops a knowledge base which can be searched
to resolve and circumvent future incidents and problems.
Classifying a problem involves determining
 impact on the business,
 urgency to the business,
 size, scope, and complexity of the problem, and
 resources available for correcting the problem and providing a permanent solution.
The Problem Manager classifies problems using the priorities and conditions in the following
table.

Priority Conditions
Priority 1 • Related incident is Critical/Priority 1.
• Major business impact to a network or infrastructure (significant loss of
revenue, significant expense impact, or wide spread impact) may occur.
• Service level targets are missed.
• Without immediate resolution, normal business operations may be severely
affected.
• Multiple business units and or production systems could be affected.
Priority 2 • Related incident is High/Priority 2.
• Potential for substantial/high business impact.
• Normal business operations could be impeded until resolution is achieved.
• Affected application or system has continual or repeated problems.
Priority 3 • Related incident is Medium/Priority 3.
• Business impact is limited.
• Problem is neither continual nor repeated.
Priority 4 • The related incident is Low/Priority 4.
• No business impact.
• Normal business operations are not impeded.

6.6 Update and Communicate Problem Status


As additional information on the problem becomes available, the problem investigation
ticket is updated in ITSM tool and communicated to the appropriate Atos and WSP
personnel.
The problem-owning group has primary responsibility for updating the ticket.
 Updates are made any time the status changes.
 Any recurring incidents are associated to the ticket.
 Other groups may update the ticket or provide information to the problem- owning
group during problem resolution.
 All feedback from vendors is updated in the ticket.

Atos 19 December 2023 6-8 of 18


WSP Problem Management

6.6.1 Communicate Problem Status


As problem status is updated, the Problem Manager notifies appropriate Atos Service
Delivery Manager (SDM) and WSP personnel. Problem status is also communicated through
 notification by the Problem Manager, service desk, or Account Team of significant
changes in ticket status; and
 conference calls when available options or business demands require WSP input.
See Chapter 5, Incident Management for information on communication regarding
Critical/Priority 1 incidents.

6.7 Problem Escalation


A problem investigation ticket is escalated based on
 potential impact to WSP or the Atos IT environment (problem priority); and
 time required to determine root cause.
Reactive Problem Priorities 1 - 2
 If problem root cause is not identified within 3 days, an email escalation is sent to the
appropriate problem-owning manager.
 If problem root cause is not identified within 5 days, an email escalation is sent to the
appropriate problem-owning manager.
Reactive/Proactive Problem Priorities 3 – 4
If problem root cause is not identified within 15-20 days, an escalation email is sent to the
appropriate problem-owning manager.

6.8 Root Cause Analysis (RCA)


The purpose of RCA is to
 determine underlying cause of a problem significantly impacting service;
 document problem resolution, including any corrective actions;
 identify action(s) to prevent recurrence of the problem;
 provide lessons learned.
 Additionally, the RCA may identify service / process improvements, or
 other incidents caused by known errors.
The Problem Manager is responsible for conducting the RCA. A formal RCA may be required
by
 any miss to a defined service level;
 contractual requirement;
 any Priority 1;
 WSP request for any other Priority of ticket.

Atos 19 December 2023 6-9 of 18


WSP Problem Management

The problem investigation ticket is closed only after the RCA is


 Finalized (including any associated plans) and submitted to the appropriate areas;
and
 attached to the ticket in ITSM tool

6.8.1 Prepare, Review, and Approve Formal RCA


The Problem Manager researches all errors associated with the problem and prepares a
formal RCA to
 document the problem in clear terms;
 confirm configuration item(s) at fault;
 identify root cause of the incident; and
 identify preventive actions to avoid recurrence. The RCA is reviewed and approved by
Atos management.
Note: If needed, the Problem Manager creates and associates a known error ticket to the
problem investigation ticket once the root cause is diagnosed.

Atos 19 December 2023 6-10 of 18


WSP Problem Management

6.8.2 RCA Process Diagram


The following diagram provides an overview of the RCA process.

Atos / WSP - Root Cause Analysis (RCA) Process

Atos Atos Atos WSP WSP Atos/WSP


Problem Management Service Owner Service Delivery Mgr. Service Owner Regional Service Owner Oversight

Advise SDM of
Active RCA

RCA RCA RCA


Initiated by Advise SO of
Initiated P2 escalation Advise SDL of
Trending Active RCA
from a P1 Client Active RCA
incident requested P1 Incidents/
Client Events
requested RCA

Is WSP
Problem
YES Technical input
Investigation
required?

NO
Determine Root
Cause and Follow
Update RCA No further action
Up Action Items

RCA ready for NO


Address RCA
Review and Signoff delivery concerns Review
Signoff and send
delivered RCA
to SDL for Review
Review RCA
YES ticket closures
YES
with WSP

Deliver RCA to SO Review


for Technical delivered RCA
Review NO
Decision if RCA is
acceptable for
closure
Review and Close
NO
Decision if RCA is RCA
acceptable for
Is RCA acceptable NO
NO closure
to present to WSP
SDL ? Review
Is RCA acceptable
YES delivered RCA
to present to YES
YES WSP ?
End
Signoff and send to
Signoff and send Review Oversight for
to SDM for delivered RCA Review and Closure
Review

Atos 19 December 2023 6-11 of 18


WSP Problem Management

6.9 Develop and Document Problem Resolution


Plan
The problem resolution plan identifies action(s) required to prevent further problem
occurrence.

6.9.1 Problem Resolution Plan Diagram


The following diagram provides an overview of the problem resolution plan process.

Problem Resolution Plan

Start

1
Document
• Corrective action
• Responsible
person 2
Atos TOSG

• Resolution Change
timeline No End
Management
• Anticipated required?
results
• Follow-up action
Yes

5 6
3 4
Evaluate Evaluate Lessons Learned – 7
Initiate Schedule, apply,
need for Document Known Errors Review
Change Mgt and verify
monitoring/ relevant for a wider community operational
process in correction in test
automation (DR Plans, Knowledge Articles, processes
ServiceNow environment
customization Monitoring updates)

6.9.2 Problem Resolution Plan Tasks


The following table provides an overview of problem resolution plan development.

Step Responsible Party Action

1 Atos TOSG Document:


• Actions required to ensure the problem does not recur.
• Person(s) responsible for performing the actions.
• When actions are to be complete (timeline).
• Anticipated results of actions.
• Required follow-up actions.

2 Is Change Management required?


• If YES, go to Step 3.
• If NO, go to Step 4.
3 • Initiate the Change Management process in ITSM tool to
implement identified resolution actions (see Chapter 7).
• Go to Step 4.
4 Schedule, apply, and verify the correction in a test environment
(when available) before applying correction to the production
environment.

Atos 19 December 2023 6-12 of 18


WSP Problem Management

Step Responsible Party Action

5 TOSG Evaluate need for customization of monitoring and/or


automation to reduce the problem-detection timeframe.

6 • Evaluate Lessons Learned.


• Document Known Errors relevant for a wider community (DR
Plans, Knowledge Articles, Monitoring updates, etc.).

7 • Review operational processes to determine whether process


changes are required to improve service delivery to WSP.
• Implement problem resolution; see Section 6.12.

6.10 Interface with Other Service Management


Processes
Once the root cause of a problem is found, other service management processes may be
required to implement preventative/corrective actions.
 Change Management is usually initiated to correct underlying cause of the problem.
To schedule the correction for implementation, the Problem Manager submits a
request for change.
 Release Management is initiated through the Change Management process to
introduce a release into the production environment.
 Knowledge Management is initiated for creation of known errors or new knowledge
articles.

Atos 19 December 2023 6-13 of 18


WSP Problem Management

6.11 Implement Problem Resolution


Implementing problem resolution includes
 following the problem resolution plan,
 verifying correction of the problem root cause, and
 evaluating resolution for relevance to other systems susceptible to the identified
problem.

6.11.1 Implement Problem Resolution Diagram


The following diagram provides an overview of implementing problem resolution.

Implement Problem Resolution Plan

11
Initiate the
Develop Problem Problem Closure Verify permanent
Change Mgmt.
Resolution Plan (Section 6.12) resolution has
process in ITSM
(Section 6.9) corrected root cause
tool
No
Yes
TOSG/WSP

2 5 9
1
Initiate the Permanent 8 Change
Change
Management Yes Change Mgmt. resolution Solicit information on Management
required? process in ITSM successful? other relevant systems required?
tool
No
No No

Yes 6
3 4 Resolution 10
applicable to Yes 7
Follow procedures in Verify permanent Follow procedures in
Notify other groups as
Problem Resolution resolution has other Problem Resolution
needed
Plan corrected root cause systems? Plan

6.11.2 Implement Problem Resolution Tasks


The following table provides an overview of the tasks to implement problem resolution.

Step Responsible Party Action


1 Atos TOSG/WSP Is Change Management required?
• If YES, go to Step 2.
• If NO, go to Step 3.
2 • Initiate the Change Management process in ITSM tool (see
Chapter 7).
• Go to Step 4.
3 Follow procedures in the Problem Resolution Plan.

4 Verify the permanent resolution has corrected the root cause of


the problem.
5 Is the root cause corrected?
• If YES, go to Step 6.
• If NO, return to problem resolution plan development (see
Section 6.9.

Atos 19 December 2023 6-14 of 18


WSP Problem Management

Step Responsible Party Action


6 Atos TOSG/WSP Is the resolution applicable to other systems?
• If YES, go to Step 7.
• If NO, go to Section 6.13, Problem Closure.
7 Notify other groups as needed.

8 Solicit information on other relevant systems.

9 Is Change Management required?


• If YES, Initiate the Change Management process (see
Chapter 7). Go to Step 11.
• If NO, go to Step 10.
10 Follow procedures in the Problem Resolution Plan.

11 • Verify the permanent resolution has corrected the root cause


of the problem.
• Go to Problem Closure, Section 6.12.

Atos 19 December 2023 6-15 of 18


WSP Problem Management

6.12 Problem Closure


Problem closure ensures
 the root cause is identified and documented;
 corrective and/or preventive actions are implemented in accordance with the Change
Management process; and
 WSP is notified of problem correction / closure.

6.12.1 Problem Closure Diagram


The following diagram provides an overview of problem closure.

Problem Closure

Implement
resolution
6.12
Atos TOSG/WSP

1
2 3 5
Ensure root cause Flag problem Review Problem
Ensure corrective/
is identified and investigation ticket Ticket Closure with
preventive actions
documented in closed WSP.
are completed
ITSM tool
Atos SDM

4
Notify WSP of
End
problem correction/
resolution.

6.12.2 Problem Closure Tasks


The following table provides an overview of problem closure.

Step Responsible Party Action

1 Atos TOSG/WSP Ensure root cause is identified and documented in ITSM tool.

2 Ensure corrective/preventive actions are completed.

3 Flag problem investigation ticket closed.

4 Atos SDM Notify WSP of problem correction/resolution.

5 Atos TOSG/WSP Review Problem Ticket Closure with WSP

Atos 19 December 2023 6-16 of 18


WSP Problem Management

6.13 Governance Meetings


The following table provides the governance meetings for Problem Management activities.

Meeting Purpose Participants


Weekly RCA Review Review preliminary RCAs and Atos SDM
opened Corrective Actions Atos Problem Manager
Weekly Corrective Action Review outstanding corrective Atos SDM
Review actions Atos TOSG
Atos Problem Manager
Weekly RCA Review Review preliminary RCAs and Atos Problem Manager
opened Corrective Actions WSP Process Owner
Weekly Proactive Problem Review top talkers Atos Problem Manager
Review WSP Process Owner
Biweekly Problem Management Review metrics Atos Problem Manager
Review WSP Process Owner

6.14 Terminology
The following sections provide definitions for acronyms and terms used throughout this
chapter.

6.14.1 Acronyms
The following table provides definitions for acronyms used throughout this chapter.

Acronym Definition
RCA Root cause analysis
RFC Request for change
SDM Service Delivery Manager
TOSG Technical/Operational Support Group

6.14.2 Terms
The following table provides definitions for terms used throughout this chapter.

Term Definition
Known Error Condition identified by successful diagnosis of the root cause, when the
configuration item at fault is identified and confirmed.
Problem A state (identified from incidents) indicating an error in the IT infrastructure or
application software. A problem remains in this state until a cause is found.
Problem Documented description of corrective actions required for problem resolution.
Resolution Plan The problem resolution plan is stored in ITSM tool.
Root Cause Investigation to determine underlying cause of an incident and identify
Analysis (RCA) corrective actions to prevent reoccurrence.

Atos 19 December 2023 6-17 of 18


WSP Problem Management

Term Definition
Service Level A written agreement documenting the required levels of service. The SLA is
Agreement (SLA) agreed to by Atos and the WSP, or Atos and a third-party provider.

Priority 1 • Significant event not part of standard service operation that disrupts or
(Severity 1) halts business operations;
(Major incident) • Situation where an external or unknown entity has compromised an WSP
system or network; or
• Compromise of physical security of any vendor facility or hardware used to
provide services where any WSP data is in jeopardy of loss or disclosure.

Priority 2 Significant event not part of the standard service operation with high impact
(Severity 2) where business is proceeding but is significantly impaired (degraded service).

Priority 3 Event not part of the standard service operation with medium impact but not
(Severity 3) significant current business impact.

Priority 4 Non-critical incident requiring no further action beyond monitoring for follow-
(Severity 4) up.
(Minor Incident)

6.15 Revision History


Revision
Date Writer Reference
Number
1.0 06.28.2018 M. Mouledous, KM First publication.
1.1 01.16.2020 M. Mouledous, KM Updated the RCA table.
2.0 11.23.2020 B. Nolte, KM • Revised numbering.
• Changed to “ITSM tool”
2.1 08.14.2021 M. Mouledous, KM • Updates diagram in Section 6.2.1
• Added step to diagram in Section 6.8.2
• Updated Sections 6.3, 6.12, 6.12.1, and
6.12.2 to reflect updated notification
and review process of Problem Ticket
Closure.
3.0 12.16.2021 B. Nolte, KM Annual review and approval
4.0 12.15.2022 M. Mouledous, KM Annual review and approval
5.0 12.19.2023 M. Mouledous, KM Annual review and approval

Atos 19 December 2023 6-18 of 18

You might also like