Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Lean RW Final.

qxp 4/16/2007 9:56 AM Page 225

/ Reliability Basics

Basic Elements of a Comprehensive


Root Cause Analysis Program
By: Mark Galley, ThinkReliability, Houston, TX
The purpose of incident investigation, problem solving, between different groups. Changing investigation methods from one
troubleshooting and root cause analysis is to determine why a group to another can create discontinuities that cause groups to
particular issue occurred and to identify specific actions to be taken remain disconnected and separate. It’s easier to establish alignment
to prevent negative consequences from occurring. A complete across groups when the focus is on the principles, not technique.
investigation program defines the overall approach for a single There may be subtle variations in different group approaches, but the
person, group, site or even an entire company. Five of the basic principles remain consistent for the people participating and leading
elements are; 1 - methodology, 2 - measurements, 3- the process, 4 - investigations. The more the investigation can clearly and accurately
facilitation, documentation and storage and 5 – regular review of the reflect that actual incident the better it is for the investigators,
program. The principles that are part of any single incident participants, managers and executives. When big issues, mid-size
investigation will be the same for a complete investigation program, incidents and day-to-day troubleshooting of problems are worked in a
only the scope of the implementation will change. A company wide similar manner it establishes a problem solving culture in the
approach can be started with one person in one group solving organization. A consistent approach for managers and front-line
problems effectively. It can then be scaled to a department, site and personnel helps align problem solving to meet the overall goals.
division. The principles are the same, but the scale regarding the
number of people and level of coordination will change. Each of the The simpler the root cause analysis method the better the results.
five elements of a complete investigation program will be summarized Simple doesn’t not mean incomplete or inadequate. By simple we
in the next sections. mean fundamental. An organization that focuses on the basics doesn’t
have to be creative with the language of cause-and-effect. Most
1. RCA Methodology companies are inadvertently drawn to unnecessary adjectives when
There are four basic characteristics of an effective root cause using cause-and-effect analysis. There is no need for differentiating
analysis (RCA) methodology to be effective. 1 - It must be based on types of causes during an investigation. The focus in the analysis
cause-and-effect relationships supported with evidence, 2 – it must step should be on the causes, not the type of cause. Any discussion
be scalable, 3 – it must be simple and 4 – it must connect back to about the types of causes (the adjectives) during an investigation can
work processes. take the groups focus away from evidence-based causes by wasting
time on arbitrary labels. Companies have plenty of acronyms and
The basis of any sound investigation methodology is an extreme terminology for their particular industry that simplifies the
bias for the cause-and-effect principle supported with evidence. communication when people in the same field are talking about an
Regardless of the type of problem worked, a safety issue, an issue. It is important to remember however how differently people
equipment failure, a service outage or a people issue, every incident see problems. What one person calls a problem is what another calls
breaks down into cause-and-effect relationships. The more an a symptom. Trying to add terms to an investigation or problem
organization aligns with the basic lessons of cause-and-effect the solving method can create disagreements and miscommunications.
more effective all investigations, troubleshooting and communications One of the most effective ways to summarize this point is that the
will be. The opposite of this is also true - the more an organization investigation methodology has got to be simple enough that people
makes decisions based on speculation, conjecture, hearsay and will use it. As soon as it’s confusing, frustrating or takes too long it is
opinion while disregarding facts and data the more ineffective that less likely to be used.
organization will be.
The last point is the importance of connecting every investigation
Because cause-and-effect is a principle it can be applied back to the work processes that played a role in creating the original
universally to any size and type of issue. The amount of detail situation. Before any incident occurs there were already work
required in a large investigation is obviously more than in a small processes in place. People were already doing some tasks when the
incident, but the method shouldn’t change. Some organization use incident occurred. The actions items from a particular investigation
one method for their large issues and a different investigation method are implemented back in the work processes. Every action item from
for their smaller issues. Large or small, the lessons of the cause-and- an investigation creates some addition, modification or refinement in
effect principle do not change – only the level of detail. Likewise the original work process. Even though a group wants to improve the
some companies use one methodology for safety issues and different way they investigate issues they don’t want to do more
approach for production or equipment issues. An actual incident investigations. The ideal number of investigations a group can have
investigation usually involves several different departments. Having a is zero. No group want’s the investigation. What they really want is
single, principle-based approach helps show the interconnectedness to get better at executing the process without errors or difficulties.
The objective isn’t to conduct more and more investigations - the

2007 Conference Proceedings 225


Lean RW Final.qxp 4/16/2007 9:56 AM Page 226

objective is to conduct fewer and fewer because the processes are relatively large amount of detail compared to a minor injury that
being performed effectively. The entire concept of reliability is to be actually occurred.
less reactive and more proactive. This bias for connecting every
incident back to work process keeps people’s focus, not on fire Defining the overall goals (ideal state) up front in an investigation
fighting and investigations, but on being proactive by improving program aligns the entire focus on the program. Once the goals are
processes. defined the problems and their severity is simply a matter of checking
the impact or deviation from the goals (this is also referred to as the
2. Measurements - Criteria gap). Some companies use the measurements of overall equipment
Any time an improvement is made there is a baseline that we effectiveness (OEE), asset utilization (AU), or maintenance spending
measure from. Someone knows that they lost 5 pounds after one as ways of providing direction when determining the performance of
month of exercise because they knew what they weighed one month equipment or operating units. The performance measurements can
ago. They had to have a baseline for weight and the starting date. also be done by area, by incident type or by work process. These
Establishing a baseline in an investigation program is required to different types of measurements are helpful, but it’s important to
determine the specific improvements. remember that they all provide some relative measurement based on
one or more of the overall goals. An investigation program should be
“What do you want to improve?” is one of the first questions when tied directly to the overall goals of the organization so the small, mid-
beginning an investigation program. The same approach applies to size and large issues can clearly be defined to give direction to
looking at a single incident or failure. If a piece of equipment has individuals and groups. Every group that conducts investigations
failed 6 times in the past two years we want to reduce those failures. should be measuring against the overall goals of their organization.
This would be a fairly straightforward measurement. When
establishing an investigation program for a particular group a Top 10 3. Investigation Process - Roles
list may be identified. The list would have the top ten items that Just as detailed steps can be defined for any work process when
impact that organization’s goals. For an investigation program across operating a piece of equipment, the detailed steps of the complete
an entire company there would be numerous baselines. It’s investigation process must be defined for it to be established and
important to understand what the baselines are for safety rates, maintained. This investigation process is not an explanation of the
environmental issues, customer problems, production outages or investigation methodology, but an overall view of all parts of the
delays and excessive material and labor costs. investigation program. It must define the specific roles within the
program and what each one is responsible for. It also needs to
These baselines are the basis for measuring improvements. A include the measurement criteria for what makes something an
baseline is just a starting point (in performance) and a date. Without incident and defines its severity by type of issue. The severity helps
a baseline the gains cannot be validated with evidence. Measuring dictate how detailed the investigation needs to be and the people in
any deviation from the goals of the organization drives the entire root the organization that may need to be involved.
cause analysis program. Organizations should be investigating those
things that have the biggest impact on the overall goals. The key The investigation process must have an overall owner. For a
performance measurements or indicators (KPIs) that an organization given group, site or company there must be a person who owns the
uses reflect overall goals of that organization. The overall goals are investigation process. There must also be an owner for each of the
the ideal state for that organization. As an example, the overall goal investigations. The owner of the investigation is not necessarily the
for safety in an industrial company is zero injuries. If the goal is zero person that solves the incident being worked, but they do organize
injuries then any injury is a problem. Clearly some injuries are bigger and collect all of the information for that issue. The incident owner
than others, but the goals truly dictate and prioritize all of the is typically called the facilitator but there is more to the
problems. An organization doesn’t have one overall goal they have investigation than the investigation facilitation. The owner of the
several. Typically they are safety, environmental, customer, incident gets involved as soon as an incident occurs. They may be
production-schedule and materials and labor. Any impact on any one at the scene of the incident immediately to help contain the incident
of these overall goals, that is any deviation from zero, is what people and preserve any information that would otherwise be lost. The
call a “problem.” owner coordinates the meetings (times and dates), opens and leads
the investigation and ensures that all information is captured and
Depending on the impact to the overall goals some issues may be documented. The training requirements and skills must also be
investigated to a much more detailed level than others. This is defined for this incident owner.
commonly referred to as severity or importance. The severity of an
incident depends on how it impacts the goals. A large impact to the The investigation process not only defines the role of the owner
organization’s overall goals is a relatively larger issue. A potential (facilitator or lead investigator), but it defines the role of the people
impact or near miss may also be a high severity because of the that will be participating in the investigation. What should people
potential risk of the issue. A near miss fatality with no injuries (no expect when participating in an investigation and what information
one was injured, but it could have been a fatality) will have a (evidence) do they need to provide? What training is required or

226 2007 Conference Proceedings


Lean RW Final.qxp 4/16/2007 9:56 AM Page 227

/ Reliability Basics

suggested for the people that will be participating? The roles of the a dozen. The facilitator may collect a significant amount of
managers will also be defined. What should the managers be preliminary information, photos, diagrams, a timeline, and statements
expecting and asking for as the output an investigation. of fact, before the first meeting is ever held. Participants in the
investigation may arrive to find a preliminary packet of information at
The management of action items must also be defined in the each seat with the cause-and-effect relationships already displayed
investigation process. Not only do we need to know who the action on one wall in the meeting room so that the group can edit as
items are assigned to but we also need to know how they will be necessary. In other cases, the facilitator may not have any
signed off or completed as well as the process for past due action information together because of the nature of the incident and the
items to ensure their completion. The action item process must also timing of the investigation. In this case the facilitator starts from
consider any update in paper work or process documentation scratch with the group and builds all of the related information in the
because of any changes being made (management of change). The investigation into a coherent picture.
effectiveness of the solutions and whether or not the solutions
delivered the planned results must be part of the follow-up. If this The documentation aspect has a couple of parts: what to
follow-up on the solutions is not part of a defined process it will not document and how its documented. What to document in an
necessarily occur. incident usually consists of a timeline (sequence of events) an outline
that captures the impact on the overall goals, the causes with
The investigation process not only defines all of the key roles, but evidence and any related diagrams, photos, statements and
how the entire investigation program functions. The investigation information. The action items (solutions) are always captured in an
process is a training tool for anyone participating in or leading an action plan with owners and due dates assigned. This information
investigation and needs to be updated as any improvements are made. can be put together in a hardcopy report or an electronic file.

4. Facilitation, Documentation How the information is documented is a function of the skill and
and Storage background of the facilitator. Ideally the information is captured in an
Facilitating an incident investigation and documenting it are organized, useable format as it is collected in the investigation.
connected to the storage of the incident. All three of these pieces Meaning, the causes, evidence and statements are captured as it is
are grouped together in this section. Each one will be broken down shared the first time. This can be done using pen and paper, on a dry
with its related information. erase board, on chart paper or electronically (using various software
tools such as Microsoft Excel). This aspect of the documentation
The facilitator is the owner of the incident investigation. Facilitate refers to the speed of the investigation. The investigation can take
is defined as “to make easier.” One of the easiest ways to think of one hour or three hours depending on how it is documented during
the investigation facilitator is the person that collects and organizes the investigation. The faster the documentation is, the faster the
all of the information related to an issue. The facilitator would be investigation is. Facilitator skills have a significant impact on this
familiar with not only the investigation methodology, but also the aspect of investigating an incident.
entire investigation process. The facilitator is not necessarily the
problem solver. The facilitator understands how cause-and-effect fit The storage of the incident can be done in a variety of ways
together, but the participants in the investigation are the ones depending on how it was documented and the tools that organization
providing the details and evidence. has available. A simple way is to put a hardcopy of the report in a
file folder to keep all of the information together, but this is only
It obviously helps if the investigation facilitator is comfortable in available to people that can get to the hardcopy folder. It can also be
front of a group leading a review of the incident. There are some stored electronically using various tools, the simplest, least
fundamental people skills that a facilitator would ideally posses. expensive, most flexible and most readily available is Microsoft Excel.
Keeping the groups focused, keeping the investigation moving along, Storing the incident electronically makes it much more accessible to
being mindful of the time and the investigation progress are all part of other people across a network or even via email. This is where a
the facilitator’s role. There are facilitator skills that have to do with database of incidents can be very helpful for retrieving and tracking
people and meeting effectiveness. And there are investigation incidents. A database of just the incident is relatively simple and
facilitator skills that have to do with capturing cause-and-effect provides powerful sorting capability. The incident can be
relationships, providing supporting evidence and soliciting ideas and documented electronically using Excel, but the database keeps the
solutions from the group. Ideally both aspects of the facilitation are entire Excel workbook as an attachment to the incident record within
part of an effective incident investigation. The facilitator needs to be the database. It makes searching for incidents very easy and keeps
someone who wants the role. The level of interest by the facilitator the software platform simple.
will be reflected in the facilitation.
One way to store the information from recurring incidents is by
The facilitator may talk to people one-on-one or they may get a building a visual of all the past failure modes. A Cumulative Cause
few people together in a group. It may take one meeting, it may take Map™ can be created containing all the incidents that have occurred
for that piece of equipment or unit. These Cumulative Cause Maps

2007 Conference Proceedings 227


Lean RW Final.qxp 4/16/2007 9:56 AM Page 228

are built electronically, but because they typically have a large specifics that must be defined for each organization, it is at least a
amount of information on them they can plotted on wide paper and foundational framework with which to build a comprehensive
placed a wall. The wall plot can be used as a summary of all the investigation program. Investigating and solving problems is part of
failure modes that have been identified to this point. Some may think of every organization in every industry. The more effective individuals
it as troubleshooting guide. These Cumulative Cause Maps are another and groups become at solving incidents the more effective the
way to capture and use a significant amount of information in a simple organization becomes at meeting its overall goals. Improving not only
format – people’s experience can be shared visually rather than stored the how the investigation was done, but also the speed, clarity and
in their head. Many of the process maps for a given work process are results of the investigation is an ongoing effort that should become
also stored on a wall just like the Cumulative Cause Maps. Anyone with part of the culture within an organization.
new, different, conflicting or additional information should add it to the
existing map by writing directly on it. This approach provides a point
for collecting everything we know that caused or causes a system or
piece of equipment to fail. In concept, this approach is similar to a
failure modes effect analysis (FMEA), but it is captured visually. These
visual documentation methods don’t necessarily replace a report, but
they can complement it and help create a learning organization that
shares organizational knowledge more easily.

5. Regular Program Review


The last element of a complete investigation program is the regular
review of the process. This step is ongoing because the investigation
program must be evolutionary. A complete investigation program can
be defined by one person within their group. If the investigations are
effective and incidents are reduced people will copy what works
especially if it’s relatively straightforward.

The program review involves all elements of the program. The


questions that need to be asked at least quarterly are “What works
well and what needs to be improved in the methodology, the
measurements, the investigation process, and in the facilitation,
documentation and storage?”

When the investigation program is one person it’s fairly simple to


get feedback. The facilitator who is probably also the owner of the
investigation process can ask the people in the investigation what
they thought of the approach and what they would like to see
improved. As the program grows to encompass a site or entire
business the number of people involved will increase and the
feedback from participants can become scattered. It may take a little
longer to collect all of the feedback, but the program will
fundamentally be the same for working a single issue as it will be for
a comprehensive investigation approach. A complete investigation
program is only as effective as any single investigation. Regular
review is important with a willingness to incorporate any
improvements back into the definition of the process. Always
connecting the investigation program to the overall goals of the
organization will also help to keep everyone on track with what’s
important and why. An investigation program should measure the
results of the investigations (the effectiveness of solutions and
improvements in performance measures), not the activity of
conducting investigations.

There are several different aspects to the fives elements covered in


this paper. While they don’t provide the detailed blueprint with the

228 2007 Conference Proceedings

You might also like