Kepner-Tregoe Problem Solving (PDFDrive)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Managing Major Problems

Stuart Rance
Consultant, trainer, author
Information security and IT service management

@StuartRance
Agenda

• Introductions
• Quick Refresh – incidents, problems, major incidents
• How to identify a major problem
• Examples of major problems
• Steps for managing major problems
• Diagnosis techniques
• Workarounds
• Summary and close

@StuartRance 2
Introduction

What will we do today?


• This is NOT training
– But there will be some “content” to facilitate discussions
• Everyone needs to contribute their experience
– So we can talk about the real world, not just theory
– The more you put in the more you will get out
• You can help to resolve other people’s issues
– And they can help you to resolve yours
• You should take away practical ideas
– To contribute to your continual service improvement

@StuartRance 3
Introduction

Who are you?


• Your name
• Where you work
• Your role
• Your experience in problem management
• What you hope to get out of today’s session

@StuartRance 4
Quick refresh

Working in small groups, document on a flipchart:

• What is an incident?
• What’s the goal of incident management

• What is a problem?
• What’s the goal of problem management

@StuartRance 5
Major incidents

• In the ideal world:


– How do you identify a major incident?
– What is the goal of major incident management?
– What should you do differently for a major incident?
– Who should be involved in major incident management?

@StuartRance 6
Major incidents

• In the ideal world:


– How do you identify a major incident?
– What is the goal of major incident management?
– What should you do differently for a major incident?
– Who should be involved in major incident management?

• In your organization:
– How frequently do you have major incidents?
– How effective is major incident management?
– Is there anything missing from your process?

@StuartRance 7
How to prioritize problems

What should you take into account?

@StuartRance 8
How to prioritize problems

What should you take into account?


• Number of incidents
• Frequency of incidents
• Business impact or severity of incidents
– Incident duration, Number of users, Cost to the business
and to IT
• Recency of incidents
• Effectiveness of workaround
What combination would make a “major problem”
How are problems prioritized in your organization?

@StuartRance 9
What is a major problem?

Major incident Major problem

@StuartRance 10
What is a major problem?

Major incident Major problem


– Service IS NOT currently working – Service MAY BE currently working
– Incident has a significant impact – There have been one or more
on the business incidents that had a significant
– You need to recover the service business impact
so the business can carry on – You (or your customer) expect
working that there may be repeats of the
same incidents
– You need to ensure that future
incidents have minimal business
impact

@StuartRance 11
Examples of major problems

• Group exercise
– Think of examples of major problems from your own
organization
– Don’t worry if they were handled as major problems, we
can think about how they could have been managed later
– How many incidents, how severe, over what time period?
– How was this identified as a problem?
– How long did it take to diagnose the problem?
– How long to implement a workaround?
– How long to completely resolve?

@StuartRance 12
Example of a major problem

@StuartRance
Example of a major problem – the service

• Call handling for a large field service organization


• Thousands of agents, in almost every country
• Thousands of remote engineers accessing the
service to collect and update their calls
• Complex application with many tiers
– Each tier has many physical or virtual servers
– Lots of different databases
– Lots of feeds to customers, logistics organizations, etc.
• Lots of per-country customization
– Interfaces & feeds, legal and regulatory, working practices

@StuartRance 14
Example of a major problem – the incidents

Remote engineers unable to collect their calls


• Very slow performance, or no response at all
• May impact all remote engineers, or just a subset
• Error messages for some incidents, but not usually
• Incidents last 10 minutes to many hours
• Several incidents per day for many months

@StuartRance 15
Example of a major problem – business impact

• Remote engineers phoning call centres


– Cause huge telephone backlogs
– Takes a long time so they do fewer calls in a day
– Takes a long time for customers to get through
• Customers unable to log incidents
• Engineers unable to collect and update calls
• Missed contractual commitments with customers
• Increased costs
– Extra agents recruited into call centres
– Engineer overtime

@StuartRance 16
Example of a major problem – incident resolution

• Level 1 or Level 2 engineers restart servers


– There are hundreds of servers in this configuration
• Eventually the symptoms go away
• Can take from a few minutes to a few hours

• Level 3 engineers are busy analysing root cause


– Many different Level 3 teams, in various countries
– Each team delivers regular “fixes” but nothing seems to
get better

@StuartRance 17
Example of a major problem – discussion

• What advice would you give this customer?

• Who are the stakeholders?


– What does each stakeholder expect?

• Who should be involved in managing this problem?


– How can they turn this situation round?
– What is the most urgent thing to do next?

@StuartRance 18
Now lets consider some of your problems

• What advice would you give for this problem?

• Who are the stakeholders?


– What does each stakeholder expect?

• Who should be involved in managing this problem?


– How can they turn this situation round?
– What is the most urgent thing to do next?

@StuartRance 19
Things you can do to manage a
major problem

@StuartRance
How to use these process steps

• These steps are a pragmatic approach that has


worked for me
• Remember the mantra “adopt and adapt” – there
may be some ideas here that will work in your
environment, but don’t just copy these steps
• Document your own major problem management
steps, and make sure that people know how to
follow them

@StuartRance 21
STEP 1: Form a problem team

• Identify the best people to work on the problem


• Get them physically together in a single location
– Even if this involves flights and hotel accommodation

• What roles do you think are required?


• For each role:
– What skills are required?
– What knowledge is required?
– What authority is required?

@StuartRance 22
STEP 2: Separate problems

Poorly defined symptoms may hide multiple problems


• If you don’t separate out individual problems then
you can’t identify root causes or solutions
SO
• First step in every major problem should be
identifying how many distinct problems you have!
BUT
• Be careful not to confuse the customer if they
perceive the situation as one problem

@StuartRance 23
STEP 2: Separate problems

STOP ROOT CAUSE ANALYSIS TILL YOU'VE DONE THIS


• Review all of the incidents related to the problem
• Define the problem symptoms as clearly as you can
• Check that every related incident matches the
symptoms
• If some incidents don’t match then define a new
problem
• Define as many problems as you need to clearly
separate each distinct cluster of symptoms

@StuartRance 24
STEP 3: Document match criteria for each problem

• Define criteria to match incidents to each problem


• Investigate whether you can automate the match
– Ideally you should do this with infrastructure tools
– May be able to automate matching with service desk tool
• If matching is manual then train service desk and
ensure they understand what needs to be done
• Ensure that service desk log new problems for
anything similar that doesn’t exactly match

@StuartRance 25
STEP 4: Prioritize problems

If you have separated a problem out into multiple new


technical problems then:
• Document the business impact of each problem
• Document frequency and recency of each problem
• Prioritize the problems relative to each other
• Select the most critical problems to work on, others
can wait till later

@StuartRance 26
STEP 5: Set up reporting

• Discuss problem reporting with key stakeholders


– Could be daily reports for early stages of a major problem
• Document at least the following:
– Overall business impact
– Number and impact of incidents for each problem
– Agreed priority of each problem
– Actions that have been taken since previous report
– Next steps that will be taken
• What else do your stakeholders want reported?

@StuartRance 27
STEP 6: Agree initial workarounds

DON’T WAIT TILL YOU UNDERSTAND THE ROOT CAUSE


• This won’t be perfect, but get the most senior
technical people to think through the best options
– Workarounds can be refined and improved later
• Stop RCA until you have acceptable workarounds
• Train service desk staff in workarounds if needed
• Document what should happen if problem recurs?
• Document value of workarounds in your reports
• Automate the workarounds if you can

@StuartRance 28
STEP 7: Monitor and improve workarounds

Each time one of the problems occurs


• Check how effective the matching was
– If it didn’t work well then stop RCA work while you
improve the problem matching
• Check how effective the workaround was
– If it didn’t work well then stop RCA work while you
improve the workaround
• Describe workaround effectiveness in next report

@StuartRance 29
STEP 8: Provide regular stakeholder reports

• Demonstrate that workarounds are effective


– Include charts showing change in business impact
– For my example major problem, initial workarounds
reduced average weekly downtime by nearly 90%
– This can reduce pressure to provide instant fixes, if you
have the problem under control the customer will give
you time to analyse it properly
• Show how you are picking off problems
– Each time you fix one part of the problem you reduce
overall impact. Show this in your reports.

@StuartRance 30
STEP 9: Investigate and diagnose

There are lots of different techniques you can use. You


should be familiar with at least:
• Creating service models
• Timeline analysis
• Expanded incident lifecycle
• Kepner-Tregoe problem solving

We will discuss these later in the session

What other techniques do people in the room use?

@StuartRance 31
STEP 10: Regular review

• Don’t wait till problem closure to carry out a review


• Carry out internal reviews before customer reports
– This could mean daily at the beginning of a major problem
• Carry out external reviews after customer reports
– Make sure you give the customer a chance to contribute

• What is working well, what needs improving


• Are all problem priorities still right
– Good workarounds should enable you to reduce priorities

@StuartRance 32
Step 11: Resolve problems

• It’s quite acceptable to retain a workaround


indefinitely, if this is good for the customer
• Use change management, don’t make uncontrolled
changes just because it’s a big problem
– If you’ve done a good job of workarounds you may not
even need emergency changes!
• As you resolve each problem this frees resources to
work on next highest priority problems

@StuartRance 33
STEP 12: Final review

• Major problem stays open till customer is satisfied


• Involve all stakeholders in the review
• Blame-free post mortems result in high quality input
and most improvement

• Are you confident that underlying issue is resolved?


• Have you updated risk registers (or CSI registers)?
• What future improvements could you make to
infrastructure, processes, contracts, training etc.

@StuartRance 34
12 Steps for major problem management
Initial Activities Ongoing Activities
1. Form problem team 7. Monitor / improve workarounds

2. Separate problems 8. Provide agreed reports

3. Document match criteria 9. Investigate and diagnose

4. Prioritize problems 10. Regular review

5. Set up reporting 11. Resolve problem(s)

6. Agree initial workarounds 12. Final review


@StuartRance 35
Problem diagnosis techniques

@StuartRance
Richard Feynman’s problem solving method

The famous physicist Richard Feynman was once


asked by a journalist how he solved physics problems.

He replied that he had a very simple method:

• Write down the problem


• Think very hard
• Write down the answer

@StuartRance 37
Techniques to consider

We will discuss
• Service models
• Timeline analysis
• Expanded incident lifecycle
• Kepner Tregoe problem solving

What other techniques do you use?

Which member(s) of the problem solving team should


have expertise in these techniques?

@StuartRance 38
Service models – what are they?

Essential tool for understanding complex services

Static service model


• How do the bits fit together
• Servers, storage, networks etc…

Dynamic service model


• Relationships and timing of transactions

@StuartRance 39
Service models – how do you create them?

• Check whether they already exist


– May have been created by developers
– Technical support people may have their own

• Get all the relevant technical people in one room


– Give them plenty of whiteboards, flipcharts etc.
– Ask them to document how the bits fit together and how
transactions get routed
– Provide plenty of coffee / pizza / encouragement
– Wait for magic to happen

@StuartRance 40
Service models – how do you use them?

Make them visible to people working on the problem


• Maybe on the wall where you are all working

Use them to stimulate ideas of what might have failed


• Could it have failed here, there, etc.

Use them to review suggested causes


• Trace the suggested failure mode through the model
• If this really is the cause then how could it explain…

@StuartRance 41
Timeline analysis – what is it?

Extremely simple tool for visualising what happened


• Also known as chronological analysis

• Collect all available data about the incident(s)


• Record date and time in a consistent format
– I use a simple spreadsheet with a column for each source
• Sort data by date and time, regardless of source
• Look for correlations

@StuartRance 42
Timeline analysis - example

Date Time Interview Error log Incident BMS log


with A from Record
System X
20 Oct 10:42 Sudden
temperature
increase
20 Oct 11:04 Disk error
“xxxxxx”
20 Oct 11:22 912432
from user X
“……”
20 Oct 11:25 Noticed red
light on air
handling
unit
@StuartRance 43
Expanded incident lifecycle – what is it?

Simple tool for understanding all the factors that


contribute to business impact of an incident

Useful at early stages of developing workaround

Look at entire lifecycle of the incident(s) and find ways


to reduce the duration

Helps identify how to reduce impact of future incidents


EVEN IF YOU DON’T UNDERSTAND ROOT CAUSE

@StuartRance 44
Expanded incident lifecycle

Incident Incident
Start End

Uptime Downtime

Service Service
Available Unavailable

Detect Diagnose Repair Recover Restore

@StuartRance 45
Kepner-Tregoe problem solving – overview

Define and describe the problem

Establish possible causes

Determine most probable cause

Verify the true cause

Think beyond the fix

@StuartRance 46
Kepner-Tregoe – define and describe the problem

IS IS NOT
What

Where

When

Extent

@StuartRance 47
Kepner-Tregoe – establish possible causes

IS IS NOT Differences Changes


What

Where

When

Extent

@StuartRance 48
Kepner-Tregoe – Determine Most Probable Cause

Consider each possible cause that you identified


• Can it explain all of What / Where / When / Extent?
• Can it explain the exact IS and IS NOT?

Select the possible cause that best explains the


symptoms

@StuartRance 49
Kepner-Tregoe – Verify the True Cause

DO NOT implement changes based on a probable


cause

Think of a test can to verify that your most probable


cause really is the true cause?

This reduces the risk of introducing new errors when


trying to fix the original problem

@StuartRance 50
Kepner-Tregoe – Think Beyond the Fix

• Don’t just fix the problem. Think about:


– What are other possible consequences of the cause?
– Could other things be impacted by the same cause?
– Could there be other damage you haven’t yet noticed?
– Could the resolution itself lead to further problems?

You should adopt this idea even if you don’t use any
other aspect of the Kepner-Tregoe approach

@StuartRance 51
Kepner-Tregoe – Example

IS IS NOT
What Poor performance of Data errors
transactions Incorrect transactions
Error messages
Where Remote engineers Call centre agents
Every country in the world Just some countries

When Almost every day Only on specific days


Multiple times per day Only at specific times or
workloads
Extent Sometimes affects all remote Always all users
users Always a subset of users
Sometimes only affects a
subset of users

@StuartRance 52
Kepner-Tregoe – Your Problems

IS IS NOT
What

Where

When

Extent

@StuartRance 53
Workarounds

@StuartRance 54
What is the purpose of a workaround?

• To reduce the impact of incidents


– On the business, AND on the IT organization
• To reduce the duration of incidents
– This is usually a good way to reduce impact
• To reduce the frequency of incidents
– This also helps to reduce impact

If you have an effective workaround then you may not


even need to identify and fix the root cause
• If failover takes 1 second then nobody minds failures

@StuartRance 55
Components of a Workaround

• Things to do NOW to reduce impact or frequency


– If it only fails when with >100 simultaneous users, then…
• Things to do when the problem occurs
– To reduce the duration or impact
– Consider each stage of the expanded incident lifecycle
• A trigger to identify when problem occurs
– Ideally an automated trigger to ensure it is recognised
– But could be a set of criteria for service desk

Improving any of these will help reduce the problem

@StuartRance 56
How to document workarounds

• Best is to fully automate the trigger and the actions


– If trigger isn’t automated then workaround may be missed
• Think about how service desk works and ensure
they can find information when they need it
– A known error database only works if people look at it
• Problem record should always include links to detail
of the workaround

@StuartRance 57
Managing a Major Problem – Summary
Initial Activities Ongoing Activities
1. Form problem team 7. Monitor / improve workarounds

2. Separate problems 8. Provide agreed reports

3. Document match criteria 9. Investigate and diagnose

4. Prioritize problems 10. Regular review

5. Set up reporting 11. Resolve problem(s)

6. Agree initial workarounds 12. Final review


@StuartRance 58
Closing

What are your key learnings from the session?

What will you do differently as a result of this session?

@StuartRance 59
Thank you
@StuartRance
StuartR@OptimalServiceManagement.com

You might also like