Kepner-Tregoe Problem Solving (PDFDrive)

Managing Major Problems
Stuart Rance
Consultant, trainer, author
Information security and IT service management
@StuartRance
Agenda
• Introductions
• Quick Refresh – incidents, problems, major incidents
• How to identify a major problem
• Examples of major problems
• Steps for managing major problems
• Diagnosis techniques
• Workarounds
• Summary and close
@StuartRance 2
Introduction
What will we do today?

• This is NOT training
– But there will be some “content” to facilitate discussions
• Everyone needs to contribute their experience
– So we can talk about the real world, not just theory
– The more you put in the more you will get out
• You can help to resolve other people’s issues
– And they can help you to resolve yours
• You should take away practical ideas
– To contribute to your continual service improvement
@StuartRance 3
Introduction
Who are you?

• Your name
• Where you work
• Your role
• Your experience in problem management
• What you hope to get out of today’s session
@StuartRance 4
Quick refresh
Working in small groups, document on a flipchart:
• What is an incident?
• What’s the goal of incident management
• What is a problem?
• What’s the goal of problem management
@StuartRance 5
Major incidents
• In the ideal world:

– How do you identify a major incident?
– What is the goal of major incident management?
– What should you do differently for a major incident?
– Who should be involved in major incident management?
@StuartRance 6
Major incidents
• In the ideal world:

– How do you identify a major incident?
– What is the goal of major incident management?
– What should you do differently for a major incident?
– Who should be involved in major incident management?
• In your organization:
– How frequently do you have major incidents?
– How effective is major incident management?
– Is there anything missing from your process?
@StuartRance 7
How to prioritize problems
What should you take into account?
@StuartRance 8
How to prioritize problems
What should you take into account?

• Number of incidents
• Frequency of incidents
• Business impact or severity of incidents
– Incident duration, Number of users, Cost to the business
and to IT
• Recency of incidents
• Effectiveness of workaround
What combination would make a “major problem”
How are problems prioritized in your organization?
@StuartRance 9
What is a major problem?
Major incident Major problem
@StuartRance 10
What is a major problem?
Major incident Major problem

– Service IS NOT currently working – Service MAY BE currently working
– Incident has a significant impact – There have been one or more
on the business incidents that had a significant
– You need to recover the service business impact
so the business can carry on – You (or your customer) expect
working that there may be repeats of the
same incidents
– You need to ensure that future
incidents have minimal business
impact
@StuartRance 11
Examples of major problems
• Group exercise
– Think of examples of major problems from your own
organization
– Don’t worry if they were handled as major problems, we
can think about how they could have been managed later
– How many incidents, how severe, over what time period?
– How was this identified as a problem?
– How long did it take to diagnose the problem?
– How long to implement a workaround?
– How long to completely resolve?
@StuartRance 12
Example of a major problem
@StuartRance
Example of a major problem – the service
• Call handling for a large field service organization

• Thousands of agents, in almost every country
• Thousands of remote engineers accessing the
service to collect and update their calls
• Complex application with many tiers
– Each tier has many physical or virtual servers
– Lots of different databases
– Lots of feeds to customers, logistics organizations, etc.
• Lots of per-country customization
– Interfaces & feeds, legal and regulatory, working practices
@StuartRance 14
Example of a major problem – the incidents
Remote engineers unable to collect their calls

• Very slow performance, or no response at all
• May impact all remote engineers, or just a subset
• Error messages for some incidents, but not usually
• Incidents last 10 minutes to many hours
• Several incidents per day for many months
@StuartRance 15
Example of a major problem – business impact
• Remote engineers phoning call centres

– Cause huge telephone backlogs
– Takes a long time so they do fewer calls in a day
– Takes a long time for customers to get through
• Customers unable to log incidents
• Engineers unable to collect and update calls
• Missed contractual commitments with customers
• Increased costs
– Extra agents recruited into call centres
– Engineer overtime
@StuartRance 16
Example of a major problem – incident resolution
• Level 1 or Level 2 engineers restart servers

– There are hundreds of servers in this configuration
• Eventually the symptoms go away
• Can take from a few minutes to a few hours
• Level 3 engineers are busy analysing root cause

– Many different Level 3 teams, in various countries
– Each team delivers regular “fixes” but nothing seems to
get better
@StuartRance 17
Example of a major problem – discussion
• What advice would you give this customer?
• Who are the stakeholders?

– What does each stakeholder expect?
• Who should be involved in managing this problem?

– How can they turn this situation round?
– What is the most urgent thing to do next?
@StuartRance 18
Now lets consider some of your problems
• What advice would you give for this problem?
• Who are the stakeholders?

– What does each stakeholder expect?
• Who should be involved in managing this problem?

– How can they turn this situation round?
– What is the most urgent thing to do next?
@StuartRance 19
Things you can do to manage a
major problem
@StuartRance
How to use these process steps
• These steps are a pragmatic approach that has

worked for me
• Remember the mantra “adopt and adapt” – there
may be some ideas here that will work in your
environment, but don’t just copy these steps
• Document your own major problem management
steps, and make sure that people know how to
follow them
@StuartRance 21
STEP 1: Form a problem team
• Identify the best people to work on the problem

• Get them physically together in a single location
– Even if this involves flights and hotel accommodation
• What roles do you think are required?

• For each role:
– What skills are required?
– What knowledge is required?
– What authority is required?
@StuartRance 22
STEP 2: Separate problems
Poorly defined symptoms may hide multiple problems

• If you don’t separate out individual problems then
you can’t identify root causes or solutions
SO
• First step in every major problem should be
identifying how many distinct problems you have!
BUT
• Be careful not to confuse the customer if they
perceive the situation as one problem
@StuartRance 23
STEP 2: Separate problems
STOP ROOT CAUSE ANALYSIS TILL YOU'VE DONE THIS

• Review all of the incidents related to the problem
• Define the problem symptoms as clearly as you can
• Check that every related incident matches the
symptoms
• If some incidents don’t match then define a new
problem
• Define as many problems as you need to clearly
separate each distinct cluster of symptoms
@StuartRance 24
STEP 3: Document match criteria for each problem
• Define criteria to match incidents to each problem

• Investigate whether you can automate the match
– Ideally you should do this with infrastructure tools
– May be able to automate matching with service desk tool
• If matching is manual then train service desk and
ensure they understand what needs to be done
• Ensure that service desk log new problems for
anything similar that doesn’t exactly match
@StuartRance 25
STEP 4: Prioritize problems
If you have separated a problem out into multiple new

technical problems then:
• Document the business impact of each problem
• Document frequency and recency of each problem
• Prioritize the problems relative to each other
• Select the most critical problems to work on, others
can wait till later
@StuartRance 26
STEP 5: Set up reporting
• Discuss problem reporting with key stakeholders

– Could be daily reports for early stages of a major problem
• Document at least the following:
– Overall business impact
– Number and impact of incidents for each problem
– Agreed priority of each problem
– Actions that have been taken since previous report
– Next steps that will be taken
• What else do your stakeholders want reported?
@StuartRance 27
STEP 6: Agree initial workarounds
DON’T WAIT TILL YOU UNDERSTAND THE ROOT CAUSE

• This won’t be perfect, but get the most senior
technical people to think through the best options
– Workarounds can be refined and improved later
• Stop RCA until you have acceptable workarounds
• Train service desk staff in workarounds if needed
• Document what should happen if problem recurs?
• Document value of workarounds in your reports
• Automate the workarounds if you can
@StuartRance 28
STEP 7: Monitor and improve workarounds
Each time one of the problems occurs

• Check how effective the matching was
– If it didn’t work well then stop RCA work while you
improve the problem matching
• Check how effective the workaround was
– If it didn’t work well then stop RCA work while you
improve the workaround
• Describe workaround effectiveness in next report
@StuartRance 29
STEP 8: Provide regular stakeholder reports
• Demonstrate that workarounds are effective

– Include charts showing change in business impact
– For my example major problem, initial workarounds
reduced average weekly downtime by nearly 90%
– This can reduce pressure to provide instant fixes, if you
have the problem under control the customer will give
you time to analyse it properly
• Show how you are picking off problems
– Each time you fix one part of the problem you reduce
overall impact. Show this in your reports.
@StuartRance 30
STEP 9: Investigate and diagnose
There are lots of different techniques you can use. You

should be familiar with at least:
• Creating service models
• Timeline analysis
• Expanded incident lifecycle
• Kepner-Tregoe problem solving
We will discuss these later in the session
What other techniques do people in the room use?
@StuartRance 31
STEP 10: Regular review
• Don’t wait till problem closure to carry out a review

• Carry out internal reviews before customer reports
– This could mean daily at the beginning of a major problem
• Carry out external reviews after customer reports
– Make sure you give the customer a chance to contribute
• What is working well, what needs improving

• Are all problem priorities still right
– Good workarounds should enable you to reduce priorities
@StuartRance 32
Step 11: Resolve problems
• It’s quite acceptable to retain a workaround

indefinitely, if this is good for the customer
• Use change management, don’t make uncontrolled
changes just because it’s a big problem
– If you’ve done a good job of workarounds you may not
even need emergency changes!
• As you resolve each problem this frees resources to
work on next highest priority problems
@StuartRance 33
STEP 12: Final review
• Major problem stays open till customer is satisfied

• Involve all stakeholders in the review
• Blame-free post mortems result in high quality input
and most improvement
• Are you confident that underlying issue is resolved?

• Have you updated risk registers (or CSI registers)?
• What future improvements could you make to
infrastructure, processes, contracts, training etc.
@StuartRance 34
12 Steps for major problem management
Initial Activities Ongoing Activities
1. Form problem team 7. Monitor / improve workarounds
2. Separate problems 8. Provide agreed reports
3. Document match criteria 9. Investigate and diagnose
4. Prioritize problems 10. Regular review
5. Set up reporting 11. Resolve problem(s)
6. Agree initial workarounds 12. Final review

@StuartRance 35
Problem diagnosis techniques
@StuartRance
Richard Feynman’s problem solving method
The famous physicist Richard Feynman was once

asked by a journalist how he solved physics problems.
He replied that he had a very simple method:
• Write down the problem

• Think very hard
• Write down the answer
@StuartRance 37
Techniques to consider
We will discuss
• Service models
• Timeline analysis
• Expanded incident lifecycle
• Kepner Tregoe problem solving
What other techniques do you use?
Which member(s) of the problem solving team should

have expertise in these techniques?
@StuartRance 38
Service models – what are they?
Essential tool for understanding complex services
Static service model

• How do the bits fit together
• Servers, storage, networks etc…
Dynamic service model

• Relationships and timing of transactions
@StuartRance 39
Service models – how do you create them?
• Check whether they already exist

– May have been created by developers
– Technical support people may have their own
• Get all the relevant technical people in one room

– Give them plenty of whiteboards, flipcharts etc.
– Ask them to document how the bits fit together and how
transactions get routed
– Provide plenty of coffee / pizza / encouragement
– Wait for magic to happen
@StuartRance 40
Service models – how do you use them?
Make them visible to people working on the problem

• Maybe on the wall where you are all working
Use them to stimulate ideas of what might have failed

• Could it have failed here, there, etc.
Use them to review suggested causes

• Trace the suggested failure mode through the model
• If this really is the cause then how could it explain…
@StuartRance 41
Timeline analysis – what is it?
Extremely simple tool for visualising what happened

• Also known as chronological analysis
• Collect all available data about the incident(s)

• Record date and time in a consistent format
– I use a simple spreadsheet with a column for each source
• Sort data by date and time, regardless of source
• Look for correlations
@StuartRance 42
Timeline analysis - example
Date Time Interview Error log Incident BMS log

with A from Record
System X
20 Oct 10:42 Sudden
temperature
increase
20 Oct 11:04 Disk error
“xxxxxx”
20 Oct 11:22 912432
from user X
“……”
20 Oct 11:25 Noticed red
light on air
handling
unit
@StuartRance 43
Expanded incident lifecycle – what is it?
Simple tool for understanding all the factors that

contribute to business impact of an incident
Useful at early stages of developing workaround
Look at entire lifecycle of the incident(s) and find ways

to reduce the duration
Helps identify how to reduce impact of future incidents

EVEN IF YOU DON’T UNDERSTAND ROOT CAUSE
@StuartRance 44
Expanded incident lifecycle
Incident Incident
Start End
Uptime Downtime
Service Service
Available Unavailable
Detect Diagnose Repair Recover Restore
@StuartRance 45
Kepner-Tregoe problem solving – overview
Define and describe the problem
Establish possible causes
Determine most probable cause
Verify the true cause
Think beyond the fix
@StuartRance 46
Kepner-Tregoe – define and describe the problem
IS IS NOT
What
Where
When
Extent
@StuartRance 47
Kepner-Tregoe – establish possible causes
IS IS NOT Differences Changes

What
Where
When
Extent
@StuartRance 48
Kepner-Tregoe – Determine Most Probable Cause
Consider each possible cause that you identified

• Can it explain all of What / Where / When / Extent?
• Can it explain the exact IS and IS NOT?
Select the possible cause that best explains the

symptoms
@StuartRance 49
Kepner-Tregoe – Verify the True Cause
DO NOT implement changes based on a probable

cause
Think of a test can to verify that your most probable

cause really is the true cause?
This reduces the risk of introducing new errors when

trying to fix the original problem
@StuartRance 50
Kepner-Tregoe – Think Beyond the Fix
• Don’t just fix the problem. Think about:

– What are other possible consequences of the cause?
– Could other things be impacted by the same cause?
– Could there be other damage you haven’t yet noticed?
– Could the resolution itself lead to further problems?
You should adopt this idea even if you don’t use any
other aspect of the Kepner-Tregoe approach
@StuartRance 51
Kepner-Tregoe – Example
IS IS NOT
What Poor performance of Data errors
transactions Incorrect transactions
Error messages
Where Remote engineers Call centre agents
Every country in the world Just some countries
When Almost every day Only on specific days

Multiple times per day Only at specific times or
workloads
Extent Sometimes affects all remote Always all users
users Always a subset of users
Sometimes only affects a
subset of users
@StuartRance 52
Kepner-Tregoe – Your Problems
IS IS NOT
What
Where
When
Extent
@StuartRance 53
Workarounds
@StuartRance 54
What is the purpose of a workaround?
• To reduce the impact of incidents

– On the business, AND on the IT organization
• To reduce the duration of incidents
– This is usually a good way to reduce impact
• To reduce the frequency of incidents
– This also helps to reduce impact
If you have an effective workaround then you may not

even need to identify and fix the root cause
• If failover takes 1 second then nobody minds failures
@StuartRance 55
Components of a Workaround
• Things to do NOW to reduce impact or frequency

– If it only fails when with >100 simultaneous users, then…
• Things to do when the problem occurs
– To reduce the duration or impact
– Consider each stage of the expanded incident lifecycle
• A trigger to identify when problem occurs
– Ideally an automated trigger to ensure it is recognised
– But could be a set of criteria for service desk
Improving any of these will help reduce the problem
@StuartRance 56
How to document workarounds
• Best is to fully automate the trigger and the actions

– If trigger isn’t automated then workaround may be missed
• Think about how service desk works and ensure
they can find information when they need it
– A known error database only works if people look at it
• Problem record should always include links to detail
of the workaround
@StuartRance 57
Managing a Major Problem – Summary
Initial Activities Ongoing Activities
1. Form problem team 7. Monitor / improve workarounds
2. Separate problems 8. Provide agreed reports
3. Document match criteria 9. Investigate and diagnose
4. Prioritize problems 10. Regular review
5. Set up reporting 11. Resolve problem(s)
6. Agree initial workarounds 12. Final review

@StuartRance 58
Closing
What are your key learnings from the session?
What will you do differently as a result of this session?
@StuartRance 59
Thank you
@StuartRance
StuartR@OptimalServiceManagement.com

Kepner-Tregoe Problem Solving (PDFDrive)

Uploaded by

Copyright:

Available Formats

You might also like

Kepner-Tregoe Problem Solving (PDFDrive)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kepner-Tregoe Problem Solving (PDFDrive)

Uploaded by

Copyright:

Available Formats

Managing Major Problems

What will we do today?

Who are you?

Working in small groups, document on a flipchart:

• In the ideal world:

• In the ideal world:

What should you take into account?

What should you take into account?

Major incident Major problem

Major incident Major problem

• Call handling for a large field service organization

Remote engineers unable to collect their calls

• Remote engineers phoning call centres

• Level 1 or Level 2 engineers restart servers

• Level 3 engineers are busy analysing root cause

• What advice would you give this customer?

• Who are the stakeholders?

• Who should be involved in managing this problem?

• What advice would you give for this problem?

• Who are the stakeholders?

• Who should be involved in managing this problem?

• These steps are a pragmatic approach that has

• Identify the best people to work on the problem

• What roles do you think are required?

Poorly defined symptoms may hide multiple problems

STOP ROOT CAUSE ANALYSIS TILL YOU'VE DONE THIS

• Define criteria to match incidents to each problem

If you have separated a problem out into multiple new

• Discuss problem reporting with key stakeholders

DON’T WAIT TILL YOU UNDERSTAND THE ROOT CAUSE

Each time one of the problems occurs

• Demonstrate that workarounds are effective

There are lots of different techniques you can use. You

We will discuss these later in the session

What other techniques do people in the room use?

• Don’t wait till problem closure to carry out a review

• What is working well, what needs improving

• It’s quite acceptable to retain a workaround

• Major problem stays open till customer is satisfied

• Are you confident that underlying issue is resolved?

2. Separate problems 8. Provide agreed reports

3. Document match criteria 9. Investigate and diagnose

4. Prioritize problems 10. Regular review

5. Set up reporting 11. Resolve problem(s)

6. Agree initial workarounds 12. Final review

The famous physicist Richard Feynman was once

He replied that he had a very simple method:

• Write down the problem

What other techniques do you use?

Which member(s) of the problem solving team should

Essential tool for understanding complex services

Static service model

Dynamic service model

• Check whether they already exist

• Get all the relevant technical people in one room

Make them visible to people working on the problem

Use them to stimulate ideas of what might have failed