Introduction to Reliability-centred l\faintenance

1.1 The changing world of maintenance

1.2 Maintenance and RCM
1.3 RCM: The seven basic questions
the RCM process
1.5 What RCM achieves



Describing functions
Performance standards
The operating context
of functions
How functions should be listed




Functional failures

Functional Failures


Failure Modes and Effects Analysis



What is a failure mode?

Why analyse failure modes?
of failure modes
How much detail?
Failure effects
Sources of information about modes and effects
Levels of analysis and the information

}"ailure Consequences

5 .1
5 .2

feasible and worth

Hidden and evident functions
Safety and environmental consequences



5 .4

Operational consequences
Hidden failure consequences


Proactive Maintenance 1: Preventive Tasks


6. l Technical tv,,,nu,u"
6.4 Scheduled restoration tasks
6.5 Scheduled discard tasks
6.6 Failures which are not

Proactive Maintenance 2: Predictive Tasks



7 .5
7 .8

Potential failures and on-condition maintenance

The P-F interval
The technical feasibility of on-condition tasks
of on-condition
On-condition tasks: some of the pitfalls
Linear and non-Jinear P-F curves
How to determine the P-F interval
When on--condition tasks are worth


Default Actions 1: Failurefinding


8. l

Default actions

task intervals
The technically feasibility of failure-finding


Other Default Actions


9. l
9. 3

Walk-around checks



The RCM Decision Diagram


No scheduled maintenance

consequences and tasks

l O. l
l 0.2 TheRCM decision process
the decision worksheet
l 0.3




Implementing RCM Recommendations

11. l
1 l .5
11. 7

Implementation the key

The RCM audit
Task aes,cnptl<)ns


Actuarial Analysis and Failure Data

Work m1<.Ka!les
Maintenance planning and control

12.l TI1e six failure

Technical history data









Applying the RCM Process


How RCM should not be

Building skills in RCM


What RCM Achieves


Who knows?
RCM review groups



maintenance r'\Ar'Tf>t"t',"l'lnf>A
Maintenance effectiveness
Maintenance ATT,i'"'lA'l'"I''"
What RCM achieves


A Brief History of RCM


Appendix l Asset hierarchies

Appendix 2: Human error
,.,...,....,,,..r,,v 3: A continuwn
......... ,...,,,... ,,,, 4: Condition xnn,n,t.r,,,.,,,,


of the airlines
15.1 The
15 .2 RCM in other sectors
15.3 Why RCM





to an
Humanity continues to
extent on the wealth
mechanised and automated businesses. We
more and more on services such as the llr'llntPrf''tTnt;::,,r!
or trains which nm on time. More than ever, these
of physical assets.
Yet when these assets
not only is this wealth eroded and not
are these services interrupted, but our very survival is threatened.
ment failure has played a part in some of the worst accidents and environ- incidents which have become
mental incidents in industrial
Bhopal and
which these failures occur and what must be done
are rapidly
ally as becomes
how many of these failures
are caused by the very activities which are
The first industry to confront these issues was the international civil
On the basis of research which
and widely-held beliefs about
continues to
as its users want it to perform. This framework is
known within the aviation industry as
and outside it asReliabil
' " r
Reliability-centred Maintenance was ue11eumeo
years. One of the principal milestones in its
commissioned by the United States Department of Defense from United
Nowlan and the late Howard
Airlines and
1978. The report provided a
of the
ofRCM by the civil aviation
It forms the
ment and
basis of both editions of this book and of much of the work done in this
field outside the airline industry in the last fifteen
198f fs. the author and his associates have
Since the
comto applyRCM in hundreds of industrial locations around the world
work which led to the development of RCM2 for industries other than
aviation in 1990.

vu,... .,.,, ..

f"n<'.>tt'Ht:>l'V:> nt'P



Thefirst editionof this book (published in the UK in 1991 and the USA
in 1992) provided a
introduction to RCM2.
to the extent
has continued
the RCM
that it became necessary to revise the first edition to incorporate the new
t1e11eliop1ne1r1ts. Several new
have been added, while others have
been revised and extended. Foremost among the
a more
review of the role of functional analysis and the
definition of failed states in
2 and 3
a much broader and
look at failure modes and effects analysis
in the ,..."'*"''"""'"'
of levels
emphasis on the
of detail required in Chapter 4
new material on how to establish .-:,.r,r,,,::n'\l'1.hlta levels of risk in ......, ...,.... ,., ......,.. 5
r.."rr...nn approaches to the determination of failure the addition of more .,...,.
task intervals in .._,u,c,1,
.nv1. 8
more about the implementation of RCM recommendations in Chapter
on the RCM
11, with extra
more information on how RCM should and should not be"?""""'
look at the role of the
'"'-'''-""""f... a mo re
RCM facilitator
new material on the measurement of the overall performance of the
maintenance function in
a brief review of asset hierarchies in Appendix 1,
with a sumrole played by functional hierarchies and
mary of the
functional block
in the application of RCM
a review of different types of human error in Appendix 2, together with
a look at the part they play in the failure of physical assets

f'An'l"'lf'A hAt'\CIH.:>

condition monitoring (now Appendix

[n the second impression of the second edition, the word 'tolerable' has
been substituted for
in discussions about risk in
and 8 and in
3, in order to
this book more with standard
..,.,.,.*"'"''""""'"',.,'" used in the world of risk. It also includes further material on
8, and slightly
revised material on RCM implementation
in Chapter 13.

The book is intended for maintenance,
and r...-., 1"""'
managers who wish to learn whatRCM what it achieves and how it is
applied. It will also provide students on business or
courses with a
to financial)
for the
Finally, the book will be invaluable for any students any branch of
a.n,rnri,,:,..::.nnn wh see
o k
of the state-of-the-art in

wish to review the


..... .-.-- 2 to l O describe the main elements of the 1 ""'"'"'1 "''"'

than reasonable
and will be of most value to those who seek no
technical grasp of the
detail. Chapter 11
be taken to
the reconunendations
look at the sometimes vv, .. ....,,,.,.""'
ses. Chapter 12 takes
h,mh, .. I 3
of the relationship between age and failure . f'
involved. After
should be
and ....
achieves. Chapter 15 "'r,nn,,Pc a brief history of RC1V1.
rP,,C JP\lnn



',;,:,vnt,:,.1rnh;:>.r 1997


to write both editions of this book with the

It has
of a
many people around the world. In particular, I would l ike to
record my
to every one of the hundreds of
to work over last ten years, each of
with whom I have been
whom has contributed
to the material in
I would like to pay
tribute to a number of people who
and refine the RCM philosophy to
in helping to
the point discussed i n this edition of this book.
thanks arc due to the late Stan Nowlan for
foundations for both editions of thi s book so thoroughly, both through his
in the civil aviation
and in person, and to all his
work in thi s field.
industry for their
v...,,..,...., .....,. thanks are also due to Dr Mark Horton, for his help in ,,,,.,.,,,. i,r'\rL
5 and 8, and to Peter Stock
and helping to co-author Appendix 4.
I am also indebted to all members of the Aladon network for their help
and for their continuous feedback about what
in applying the
works and what doesn't work, much of which is also reflected in these
pages. Foremost among these are my colleagues Joel Black, Chris
Hugh Colman and Ian Hipkin, and my associates Alan Katchmar, Sandy
Michael Hawdon B rian
Oxenham, Ray
Simon Deakin and Theun s Koekemoer.
and are
to prove
, . . . .. , . , r the many clients who h ave
that RCM is a viable force in industry. I am especially indebted to the
"-'U"V'""" ..,

Gino Palarchio and Ron Thomas of Dofasco S teel

Belton and
Camina of Ford of
Joe Campbell of the B ritish Steel Corporation
and Frank O'Connor of the Irish Electricity Supply B oard
Francis Cheng of
Bill Seeland of the New Venture Gear ""'"''""..,..,.....
Crouch, Kevin Weedon and Malcolm Regler of the
Denis Udy,
Royal Navy

Don Turner and Trevor Ferrer of China Light & Power

John Pearce of the Mars
Dick Pettigrew of Rohm & Haas
Pat McRory of BP Pvnlnr!t-inn
Al Weber and Jerry
of Opal .....,.. ..... _. .....,.,,....., .
by Don Humphrey, Richard Hall, Brian
The roles
H r1n1 "',..,, ,, D avid Willson and the late Joe
to develop
are also acknowlthe ,...,H"\!''""""'1:"
or to
edged with gratitude.
Finally, special word of thanks to my
an environ
ment in which it was
to write both editions of this book, and to
Aladon Ltd
to reproduce the RCM Information and Decision Worksheets and the RCM Decision
.__,. I. .. JU.JL
,.U ._J . .

- > V H,,..._

I ntroduction to
Rel iabi l ity-centred Maintenance

1.1 The Changing World of l\tlaintenance

Over the
than any other malna:gerne11t a1scilonne.
increase in the number and
and buildings) which must
,:J.lF,H,, new maintenance rec:nn1m.1es
more {'().n,!"']pv \.U.e
on maintenance
and responsibilities.
Maintenance is also
to "'"'-1iu16u15 ex 1Je('tat10r1s
awareness of the extent which e(ruir>m<:;nt failure
'V 'L' J H PUJ<>.

;;..,<.U X hJ1.<UVH

y,:>pr,.r,. nrl1i rt C'<

<" l'V,>1'HT1 n n

con nee-


and to contain
attitudes and skills in all branches of
are having to
to the limit. Maintenance
as ..., ..
and managers. At the same
ways of thinking and
time the limitations of maintenance "" '''H,.,n,. ..
apparent, no matter how much
are computerised.
In the face of this avalanche of vu,,..,,:;..,,
looking for a new ..,.w,.,,. '"'""'"""
starts and dead ends which
> tt rttn ,,r,o


J.,,"'"'"' "

vvho operate and maintain those assets. It also enables new

confidence and Dre:c1;io1:1.
into effective service with
,h 'lt'\fPr 1r'\T"l',1! ! i"li .:' a
brief introduction to

Reliability-centred Maintenance

the evolution of maintenance can be traced through

Since the
three ger1enn1cms. RCM i s rapidly
a cornerstone of the Third
Generation, but this
can only be viewed in
in the
of the First and S econd Generations.
The First Generation
The First Generation covers the period up to World War II. In those
, n rll l rru was not very highly mechanised ., so downtime did not matter
of equi pment failure was not a very
much. This meant that the
priority in . the minds of most managers. At the same
w as simple and much of it was over-designed. This made it
As a result, there was no need for c,uc t,,. n ,:, t,,f>
reliable and easy to
maintenance of any sort beyond simple
and lubrica
tion routines. The need for skills was also lower than it is today.

The Second Generation

World War II. Wartime pressures
of all kinds whHe the supply of industrial
increased the demand
manpower dropped sharply. This led to increased mechanisation.
were more numerous and more "'"''""" .... '"'"'
1950' s machines of all
Industry was
on them.
grew, downtime came into
As this
This led
failures could and shoul d be
to the idea that
maintenance. In the 1 960's, this
overhauls done at fixed intervals,
Orll<H Y1>TY'>>:>T">t"

maintenance under control.

and are now an established part of the practice of maintenance.
the amount of capital tied up in fixed assets together with a
increase in the cost of that capital led people to start
The Third Generation


in industry has ;.,,.......,,, .., .....

the process
momentum. The ,.....,.. .,,...."'"' can be classified under the head-

Figure 1.1

Growing exJ'Jec-:ta11or1s of maintenance

1 94 0

1 950

i 960

1 970

1 980

1 990


Downtime has
assets reduci ng output, l 't'lF'r'P'.} C , n cr OD1eralln:Q:
customer service. By the
conce rn in the
the effects of downtime are
where reduced stocks of work-in
move towards
progress mean that quite small breakdowns are
a whole
In recent
n ;>T l t -r,, "t''HY
.U HlCl.LU._(i';; ...->c-..-._rnn

Greater automation also means that more and more failures affect our
to sustain satisfactory
standards of service as it does
networks as much as they can interfere with the consistent
achievement of "'"''"'"' "'"' '1
More and more failures have serious
either conform to
they cease to "'._,,.,,, .......,
on the
of our
which becomes a simple matter of ,.., ... ,..,.,H., ""''T""'" '" 1 survival.

At the same time as our

on physical assets is
and to own. To secure the maximum return
must be kept

in absolute terms
'"'"'"''"''". . ,,,..,.,. it is now the
thirty years it has moved from almost nowhere to the top of the
as a cost control priority .
New research

new research i s
many of
our most basic beliefs about age and failure. In particular, i t is <'.l n.--. <:>..-,o,n
that there is less and less connection between the
age of most
assets and how
arc to fail .
l . 2 shows how the earliest view
of fai lure was s imply that as
were more likely to fail. A
crr,\n/1 n awareness of 'infant mortality'
w11e1esnnact Second Generation
belief in the 'bathtub' curve.


Figure 1.2:

1 940
,_. ,...n, uHor

1 950

1 960

1 970

1 980

1 990


1nird Generation research has revealed that not one or two but
.,....r, ,.., n,n
'"'"''""'""' ".lt"ln"ill'I./ occur in yiav'-'"'
This i s discussed i n detail
ri.rf effect on maintenance.

New IECfl.fllULte.\
There has been
growth in new maintenance
and techHundreds have been
over the past fifteen years, and
every week,
more are

Introduction to Reliability-centred Maintenance

1 .3 shows how the classical

emphasis on overhauls and administra
tive systems has grown to include many
new developments in a number of dif
ferent fields.

1 940

1 950

Figure 1.3:

1 960

1 970

i 980

1 990

Changing maintenance techniques


The new developments include:

decision support tools, such as hazard studies, failure modes and effects
analyses and expert systems
new maintenance techniques, such as condition
designing equipment with a much
towards participation, team a
sh(ft in organisational
working and flexibility.
A major challenge
maintenance people
is not only to
learn what these techniques are, but to decide which are worthwhile and
which are not in their own
If we make the
is possible to improve asset performance and at the same time contain and
even reduce the cost of maintenance. If we make the wrong
problems are created while

The challenges facing maintenance

modem maintenance managers
In a nutshell, the key "'-'HU.lvu,;:,,,,,.:,
can be summarised as follows:
to select the most appropriate techniques
to deal with each type of failure process
in order to fulfil all the expectations of the owners of the assets, the users
of the assets and of
as a whole
in the most cost-effective and enduring fashion
with the active support and co-operation of all the

RCM provides a framework which enables users to respond to these chalIt does so because it never loses
of the
quickly and
fact that maintenance is about phys ical assets.
the maintenance function itself would not exist. So RCM starts with a
zero-based review of the maintenance

""''""""'''<.' are taken for

This results in
the deployment of resources
about the real needs of the assets. On the other hand, if these
,,.-.r.,.,::,.r n u in the light of modern thinking, it is
possible to achieve
in maintenance effiand effectiveness.
The rest of this chapter introduces RCM in more detail. It
QoV f"'\ l nrn"\n the
of ' maintenance' itself. It goes on to define RCM
this process.
and to describe the seven key
involved in
,,,'<'.'t nn t't,">n ,

1.2 Maintenance and RCM

From the
there are two elements to the manage
ment of any physical asset It must be maintained and from time to time
it may also need to be modified.
The major dictionaries define maintain as cause to continue (Oxford)
that maintenance
GALJH, /t}c state (Webster), This
On the other hand, they agree that to modify
Tu;:,,c,,:,,,-u r . ..,. ,,.,, .. ,,., ... u ..
it i n some way. This distinction between
sornetnm, means to
maintain and modify has profound implications which are discussed at
H owever, we focus on maintenance at this
length in
When we set out to maintain something, what is it that we wish to cause
to continue ? What is the
state that we wish to .,....r,:,,,,,.,n.,"'
The answer to these
can be found in the
physical asset is put into service because someone wants it to do something.
In other
function or functions. So
it to fulfil a
it follows that when we maintain an asset, the state we wish to preserve
must be one in which it continues to do whatever i ts users want it to do.
1;;;;, .

Maintenance: Ensuring that physical assets

continu e to do what their users want them to do

Introduction to Reliability-centred Maintenance

where and how the asset is

What the users want will
(the operating context) . This leads
formal definition of Reliability-centred Maintenance:

Reliability-centred Maintenance: a process

used to determine the maintenance requirements
of any physical asset in its operating context
In the light of the earlier definition of rnaintenance. a fuller definition
RCM could be 'a process used to determine what must be done ensure
nn"''""',.,, asset continues to do
want to do in
context' .

1.3 RCM: The seven basic questions

The RCM process entails
tern under
as follows:

seven ,., ,..,...,,,v"' about the

or sys-

what are the functions and associated pe,formance standards of the

asset in its present operating context?

ill what ways does it fail to fu(fil

what causes each functional failure?
what happens when each failure occurs ?
in what way does each failure matter?
what can be done to predict or prevent each failure?
what should be done if a suitable proactive task cmmot be found?

in the
These questions are introduced
then considered in detail in ,...1,, v,.,..,1. 2 to lO.

r,, t t ,'" ' " '"'

Functions and Performance Standards

a process used to determine what must be
..... asset continues to do whatever its u sers
done to ensure that any f!UJ
nnt r,.,,
want to do in i ts present
need to do two
determine what its users want it to do
its users want to
ensure that it is capable of



in the RCM process is to define the functions of

context, to(eucier with the associated desired
, ,u ...,.u,.,_ .... , What users
assets to be able to do can

which summarise why the asset was acquired in the

outcarem:)rv of functions covers issues such as
product quality and customer service.
asset is
to do
fulfil its primary functions. Users also have expectations in areas such as
control, containment,
and even the appearance of the asset,
The users of the assets are usually in the best position
far to know
and financial
what contribution each asset makes to the
so it is essential that they are
as a
of the
involved in the RCM process from the outset.
alone usually takes up about a third of the time
involved in an entire RCM
It also
causes the team
to learn a remarkable amount often a
about how the
actually works.
Functions are
in more detail in Chapter 2.
"",;;..'"'' ""'"' ""'i,, ,.,

Functional Failures
"""'"''"'r '"'"" of maintenance are defined the functions and associ
of the asset under consideration. But how
does maintenance achieve these """''""""'t""",
The only occurrence which is likely to stop any
to the
by its users is some kind of failure. This
maintenance achieves its
by adopting a suitable approach to
of failure. However, before we can
a suitable blend
of failure management tools, we need to '"'"'"""''""' what failures can occur.
The RCM process does this at two levels:
, rt c>,n f','f'u, n a what circumstances amount to a failed state
then by
into a failed state.
what events can cause the asset to
In the world of RCM, failed states are known asfunctionalfailures beto standard
occur when an asset is unable
r,,:; ,,t-,r,,,,n ,,,n,rr:, which is
to the user.
,..,,,.1M'tYr-rn ,cint'"" "1 'V n.<r>f'C>f'H'<r\C'

Introduction to

this definition encompasIn addition to the total
but at an l l n n r,c,r,-r n le'\ l a
situations where the asset cannot sustain
level of performance
"'"''"'"'i..."'u'J'-' levels of quality or
these can
identified after the functions and
standards of the asset have
been defined.
Functional failures are discussed at
in Chapter
Failure Modes
once each functional fai lure has
ru,,rn , , . , aH the events 1vhich are
to ,U,.:_,111...UJ
state. These events
failure modes inc.l ude those which
failure modes.
have occurred on the same or simi lar equipment v ,_,,.,,. ._,,.,u l l..
context, failures which are
and failures which have not
considered to be real
in the context in n n ,:i, , ,, ,..,. ,.,.
Most traditional lists of failure modes
deterioration or normal wear and teaL
fai lures caused by
flaws so that all reasortaDJlV
nri::>i.!1 t'\l l C ,,...,,,.., ,....,,.....,,...,


l 1'1f'r.rt"\Ar' <,} fA

. . ,.. the cause of each failure i n

t n o n n ,ru

instead of
to treat
effort are n o t wasted
it is equall y
to ensure that tirne is not wasted on the
i nto too much detail.
analysis itself
Failure Effects
The fourth step i n the RCM process entail s listingfailure effects, which
when each failure mode occurs. These
describe what
tions should include all the information needed to support the evaluation
of the consequences of the
such as:
that the failure has occurred
what evidence
or the environment
in what ways (if
it poses threat to
it affects production
in what ways
is caused
the failure
what r, n ,"'""
the failure.
what must be done to


and also for eliminating waste

Failure Consequences
of an average i ndustrial
A detailed
is likely to yield
between three and ten thousand
failure modes. Each of these
in some way, but in each case, the effects
failures affects the
may also affect product
are different.
or the environment. They will all take
quality, customer
time and cost money to
It is these consequences which most strongly influence the extent to
each fa ilure. In other words, if a failure h as seri
which we try to
to try to avoid i t
ous c onsequences, we are likely to go to great
On the other hand, i f it has little o r no effect, then we may decide t o do
no routine maintenance beyond basic cleaning and lubrication.
of RCM is that it
that the consequences of
failures are far more important than their technical characteristics. In
any kind of
that the only reason for
se, but to avoid or at least to reduce the
The RCM process classifies me:se co11seauern:es
into four groups, as follows:
Hiddenfailure consequences: Hidden failures have no direct impact,
expose the
to multiple failures with
of these failures are associated with
Safety and environmental consequences: A failure has
consequences if it could hurt or kill someone. It has environmental conse
quences if it could lead to a b reach of any
or international environmental standard.
Operational consequences: A failure has operation al consequences if
it affects production (output, product quality, customer service or oper
costs in addition to the direct cost of repair)
Non-operational consequences: Evident fai lures which fall into this
nor production, so they i nvolve only the
"''""""''"'"'' affect neither
direct cost of


We will see later how theRCM process uses these {'1 1,.:> rr,,,r 1 "" as the basis
of a
framework for maintenance ae<:::1s1ton-1n.,'11,ir,n
structured review of the consequences of each failure mode in tenns: of the
and the
environment into the mainstream of maintenance t'Y'l"l, ,..,,, ,,..,,.."''"''"'"
The consequence evaluation process also shifts ...,, .,,q..n,,u,, ,,,,"
the idea that all failures are bad and must be prevented. In
it focuses
attention on the maintenance activities which have most effect on
and diverts energy away from those which
formance of the
have little or no effect. It also encourages to think more

tasks: these are tasks undertaken before failure occurs, in

into a failed state. They embrace
order to prevent the item from
what is traditionally known as
although we will see later thatRCIV[ uses the terms wri1P11u1P
' ' 'fl'restorascheduled discard and on ,,condition maintenance
,n .,.,,,,,.,..,."'"

-T'\t',PU/.:l<t'HI UP

actions: these deal with the failed state, and

chosen when it

The consequence evaluation

and in much more detail
nT"t"\".lrT t'\,rr.,. tasks in IIlOfe detail
Proactive Tasks
still believe that the best way to
is to do some kind of
maintenance on a routine
Generation wisdom
that this should consist of overhauls or
component replacements at fixed intervals.
l A illustrates the fixed
interval view o f failure.
.., ,, u u u,_


Figure 1.4:
The traditional
view of failure










1 .4 i s based on the u""' "''" "''l-'"""""

X' , and then wear out.
,uuo,,,u,;.., smrn:e:sts
records about failure will enable us to determine this life and
to take l"'\rcn n,,nf',,uo
"'-'Hi'.1.;)1, \,,.<(.U ..

often found where equipment comes into direct contact with
r,"'''x"" fai lures are also often associated with _..,. ...
the product.
abrasion and U :lnr,,>> ftA>">
However, equipment in
it w as twenty
y ears ago. This has led to strutling
i n the
1 .5. The graphs show conditional probabi lity of failure
shown in
"'""''"ri, ... , n, rr age for a
of electrical and mechanical items.
Pattern A is the well-known bathtub curve. It
with a
'tr,.-r/, i t t'\ , fol lowed by a constant or
then by a wear-out
zone. Pattern B shows constant or
ability of
in a wear-out zone (the same as Figure 1 .4).
iTP .


vVJC '-J,.U.v,u,




Pattern C shows
but there i s no identifiable
Pattern D shows
probability of failure when the item new or just
of the
then a
w hile pattern E shows a constant conrapid increase to a constant
ditional probability of failure all
Pattern F starts
w ith
i nfant mortality, which
or very
of failure.
S tudies done on civil aircraft showed that
of the items conformed
to pattern A,
to 7% to 14% E and no fewer than 68%
occur in aircraft is not
number of times these
But there is no doubt that assets
the same as in
become more
more and
E and
These findings contradict the belief that there is
a connection
This belief led to the idea that the
more often an item i s
this seldom true. Unless there is a dominant ,, ,,,.. _ r,,, , .., , ,,,,.
age limits do little or
to improve the
increase overall failure rates
In fact scheduled overhauls can
introducing infant
i nto otherwise stable c u ,n;n,c
An awareness of these facts has led some """'"'"""'' 'H".""'"'
to do for failures with minor consequences. B ut when the
must be
consequences are CH;;_uu,J.vW,u,
to prevent prediet the failures, or at least to
the "'""'"'''"'rt""""''""'''
u s back to the
Thi s
RCM d ivides proactive tasks into three
scheduled restoration tasks
scheduled discard tasks
scheduled on-condition tasks.
n H ,OT"rt <> n

('iH'le 0 TY! l Y! (1

Scheduled restoration and scheduled discard tasks

component or overhaulScheduled restoration entails
an assembly at or before a
its condition at the time. Similarly, scheduled discard entails
an item
at or before a
regardless its condition at the
of tasks are now '"'""1 "''"'' 1 1 '
ventive maintenance.
used to be far the most
of proactive maintenance. However for the reasons mscu:ssea
are m uch
widely used than they were


Reliability-centred Maintenance

On-condition tasks
need to
of failure, and the nr[Hu,, n ninability of classical techniques to do so, are behind the growth of new
on the
of the fact that they are about
to occur. These ,,. u.:rnu 5" are known aspotentialfailures' and are defined
as za1cmtLTll'.Wite
conditions which indicate that a functional fail
ure is about to occur or is in the process
The new techniques are used to detect potential failures so that action
can be taken to avoid the consequences which could occur if they
are called oncondition tasks because
erate intq functional failures.
to meet desired
items are left in service on the condition
performance standards.
maintenance includes
condition-based maintenance and condition nwnitoring.)
Used appropriately, on-condition tasks are a very
waste of time. RCM enables
... ...u ... .._ - ,.,. but they can also be an
decisions in this area to be made with
HTO t"l"U lr\ i'VL'

Default Actions
r,ru ,:;, n,r.r1


of default


as follows:

vl H.,,vJ.'-.LUr, hidden functions

to determine whether they have failed (whereas conditionbased tasks entail
is ---u..,,.

'"''-""""'!"; . . entails making any one-off change to the built-in

This includes modifications to the hardware and
also covers once-off ,..n,.,"'""'" to procedures.

no scheduled maintenance: as the name

this default entails makanticipate or prevent failure modes to which it is applied,
and so those failures are simply allowed to occur and then repaired. This
default is also called run-to-failure,
The RCM Task Selection Process
ofRCM is the way it
understood criteria
which (if
proactive tasks
, c;u,Huu::. in any context, and if so for deciding how
should be done and who should do them. These criteria are discussed in
more detail in
6 and 7.

Introduction to Reliability-centred Maintenance


Whether or not a proactive task i s technically feasible

the technical characteristics of the task and of the fai lure which it is meant
how well it deals with
....."""""'"' .. Whether it i s
the consequences of the fai lure.
task cannot be found which
is both technically feasible and worth
then suitable defa ult action
must be taken. The essence of the task selection process is follows:

task is worth
if it reduces the risk
of the multiple fai lure associated with that function to an '.H' r.:>nr,,
low leveL If such a task cannot be found then a scheduled/ailure-finding
task must be performed.
default decision is that the item may have to be rethen the
-- !...,-,-UJ-, on the consequences of the mt1m101e
or environmental consequences, a
if it reduces the risk of that fa ilure on its own to
a very low level indeed, if it does not eliminate i t
I f a task cannot be found which reduces the risk of the fa ilure to an
the item must be redesigned or the process must be changed.

i f the fa ilure has Vf?C.l

n n , rn
, u ,n n 1 consequences, a
worth doing if the total
the cost of the,......... ......... ""'" consequences and the cost
the task must be
on econornic
the initial default decision isno scheduled
maintrnance. (If this occurs and
consequences are still
unacceptable then the sec:onaairv default decision is

, n r, h l-


if a fai lure has non-operational consequences

task is only
if the cost of the task over a period of time is
same period. So these tasks must also
the initial default decision
on economic grounds. If it i s
is again no scheduled maintenance,
default decision
This approach means that
tasks are
for failures
which really need
which in tum leads to substantial reductions in
routine workloads . Less routine w ork also means that the
with the elimination of
are more
to be done
counterproductive tasks leads to more effective maintenance.



to the development of
Compare this with the traditional
maintenance policies. Traditionally, the maintenance
of each
asset are assessed in terms of its real or assumed technical cn:3.racte:nst1cs,
resmtmg schedules
the consequences of failure.
without considering that different
are used for all similar assets,
contexts. This results in
in different
numbers of schedules which are
not because they are
the technical sense. but because
achieve nothing.
Note also that the RCM process considers the maintenance rP,.. . .
reconsider the
whether it is
men ts of each asset before
who is on duty
This is simpl y because the maintenance
not what should be
to maintain the equipment as it exists
there or what
be there at some
in the future.
.-.c> ..

1..4 Applying the RCM process

in any onnm1sat101t1, we need to know what these assets are and to decide
vvhich of them are to be sut)tectea to the RCM review process. This means
if one does not exist already. In fact,
ii,....,...,,,.,. u,u orJi:an:1sa1:ior1s nowadays
......,,\.,l uu.u., for this purpose, so this book only touches
in Appendix l .
on the most desirable attributes of such
RCM leads to remarkable improvements in mainIf it
and often does so
on meticulous planning and
the successful application of RCM
...,v,,u. -..uvu. The
elements of the planning process are as follows:
decide which assets are most likely to benefit from the RCM process,
how they w ill benefit
and if so,
assess the resources required to
the process to the selected assets
in cases where the likely benefits justify
in detail
and who is to audit each analysis, when and
who t.o
and arrange for them to receive appropriate rr'.:l l n l ti< Cf

Introduction to Reliability-centred Maintenance


Revierv groups

We have seen how the RCM process embodies seven basic questions. In
practice, maintenance peop.le simply cannot answer all these questions on
their own. This is because many (if not most) of the answers can only be
supplied by production or operations people. This applies especially to
questions concerning functions, desired performance, failure effects and
failure consequences.
For this reason, a review of the maintenance requirements of any asset
should be done by small teams which include at least one person from the
maintenance function and one from the operations function. The senior
ity of the group members is less important than the fact that they should
have a thorough knowledge of the asset under review. Each group mem
ber should also have been trained inRCM. The make-up of a typical RCM
review group is shown in Figure 1 .6:
The use of these groups not
only enables management
to gain access to the
knowledge and
expertise of each
member of the group
on a systematic basis,
but the members
(M and/or E)
themselves gain a
greatly enhanced under
standing of the asset in
External Specialist (if needed)
(Technical or Process)
its operating context.
Figure 1. 6: A typical RCM review group


RCM review groups work under the guidance of highly trained special
ists i n RCM, known as facilitators. The facilitators are the most impor
tant people in the RCM review process. Their role is to ensure that:
the RCM analysis is carried out at the right level, that system bounda
ries are clearly defined, that no important items are overlooked and that
the results of the analysis are properly recorded
RCM is correctly understood and applied by the group members
the group reaches consensus in a brisk and orderly fashion, while retain
ing the enthusiasm and commitment of individual members

quickly and finishes o n time.
..,,,..u .. ,,,..,...

managers or sponsors to ensure

and receives appropriate manage-

rial and J. '-J .:::..,-,n,,,.,. " ' ' *'"'...

Facilitators and RCM review groups

discussed in more detail in

The outcomes
in the manner ,,,,,;...,...,---
If it is
outcomes, as follows:
maintenance schedules to be done by the maintenance ' 'lrl"m,o. n J"
for the
of the asset
of the
a list of areas where
asset or the way in which it is ri.n,0., .,,.,,.,4 to deal with situations where the
asset cannot deliver the desired
in its current ,rn"ITH'Yll>"'>T> E"\t'\
in the process learn a
outcomes are that
Two less
deal about how the
and also tend to function better as teams.

"' ..

and unn umz,emwi:on

after the review has been
for each asset, senior
managers with overall
for the
themselves that decisions made the group are sensible and defensible.
the recommendations are InllPlt!m.ented
After each review is
maintenance schedules into maintenance planning and
into the
control c u c t Pn-i s: by
authority. Key
cmmgc;s to the
dations for
of and implementation are discussed in Chapter 1 1 .

1.5 What RCM Achieves

Desirable as
are, the outcomes listed above should only be see n as
a means to an end.
they should enable the maintenance func1 . 1 at the
tion to fulfil all the ex1)ec1tat1,ons listed in
do so is summarised in the following n '71 :r !'.l orrl1"\n <.:'
1 4.
and discussed
in more detail in

Introduction to Reliabili(v-centred Maintenance


Greater safety and environmental integrity: RCM considers the

and environmental implications of every failure mode before consider
ing its effect on operations. Thi s means that
are taken to minimise
all identifiable equipment-related safety and environmental hazards, if
not eliminate them altogether. By integrating safety into the mainstream of
maintenance decision-making, RCM also
attitudes to
Improved operating performance (output, product quality and customer service): RCM
that all types of maintenance have some
which is most suitable in every
value, and provides rules for
situation . By doing so, it helps ensure
the most effective forms
of maintenance are chosen for each asset, and that suitable action is
taken i n cases where maintenance cannot help. Thi s much more tightly
focused maintenance effort leads to quantum jumps in the
of existing assets where these are
RCM was developed to help airlines draw up maintenance programs
they enter service. As a
for new types of aircraft
ideal way to develop such programs
equipment for which no historical information is available. This saves
much of the trial and error w hich is so often
of the
new maintenance programs trial which is
and frustrating, and error which can be very
Greater maintenance cost-effectiveness: RCM continual1y focuses
attention on the mai ntenance activities which have most effect on the
to ensure that
performance of the plant. This
maintenance is spent where it will do the most
maintenance sysIn addition, ifRCM is correctly applied to
tems, it reduces the amount of routine work (in other
nance tasks to be undertaken on a
issued in each period,
usually by 40% to 70%. On the other hand, if RCM is used to
a new maintenance program, the resulting scheduled workload is much
by traditional methods.
lower than if the program is
Longer useful life of expensive items, due to a carefu lly focused em
phasis on the use of on-condition maintenance 1ecnn1qutes.
A comprehensive database: An RCM review ends with a .,v,u'"'''""''"""
sive and fully docurnented record of the maintenance
all the s ignificant assets used by the
This makes it v,__,,,,H''"''""

to reconsider aU maintenance policies from
scratch. It also enables equipment users to demonstrate that their mainte
nance programs are built on rational foundations (the audit trail
the information stored on RCM
more and more
with its attendant loss
worksheets reduces the
An RCM review of the maintenance
of each asset also
to maintain each
a much clearer view of the skills
what spares should be held in stock. A valuable
asset, and for
and manuals.
r""r1 1 1 1 1,."''

rurnJH' SAC'

Greater motivation ofindividuals, ,..,,n..,,...,.,,,..,.,., people who are involved

underin the review process. Thi s leads to
'''""'u....,.u;., of t.he equipment in its operating context, vu,;;;,vL>-n,J.
.,., ... r, ..
"'"'"'"'"< and their solutions. It also means
..... ,,, ..... ""'"'


-i,;;.,, a cornmon,
understood techni Better teamwork : RCM .....
P' v"
cal :uu5u"6 for everyone who has
to do with maintenance.
maintenance and ,.....,..,,...,.,... ,.."."' people a better understanding
of what maintenance can
achieve and what must be done
to achieve it
J <'l n f 'll;"l ,'T,>

All of these i ssues are part of the mainstream of maintenance manage

feature of RCM is
all of them at once, and for
to do with the equipment in the process.
results very quickly. In
if they are
RCM reviews can pay for themselves in a matter
as msicu :s sea in Chapter
of months and sometimes even
1 4. The reviews transform both the
main tenance
and the way in which the
assets used by the
of the
The result i s more costmaintenance function as a whole is
more harmonious and much more successful maintenance.
" 0" 1 0, ,.,,,

l"t'.>fl l n r,al'YH> r'lt'<.'


for things, be

elec trical or structural. This

Pr'l ,r , r'I P."'1""

feel o ffended

assets in poor condition.

been at the heart of
These reflexes have
tive maintenance. They have
rise to
seeks to care for
which as the name
smategists to believe that maintenance
or b uilt-in
of any asset.
this is not so.
As we
to <'1 n'!"\'f'C>r' > 'lf'P
put into service because someone wants it to do <cfU'n"-jhtnn So it follows
that when we maintain an asset, the state which we vvish nro,,,:;,11,,:; nzust
be one in which it continues do whatever its users want it to do. Later
r"h''""""' we will see that this state what the users want - fundadifferent from the built-in
of the
eHLPIJlaSll:S On What the asset does rather than What it !S 1vrrn I1 ,""' "
whole new way of
of maintenance for any
one which focuses on what the user wants. Thi s
feature o f the RCM process, and is
'TQM applied to physical assets' .
terms of user
of the functions
of each asset
with the associated .,..,Y.,'"'',..i,-,,,v,., ,Y,,,
umnce standards. This is
why the RCM process starts by
what are the functions and associated performance standards of the
asset in its present operating context?
in more detaiL It describes how
two main
and shows how funcstanct11rcts, reviews different
tions should be listed.
v U 1 LvV"' ' ''

,,v,u-.,, ,u u,,ih

l t n,'1 .:>l"('l''l l"Ui f t"> n


,:rv r. t n.T'CH'


2. 1 Describing functions
a function state
It is also helpful to start such
statements \Vith the word to'
pump watee , to
, etc).
in the next
of this chapter, users
exJJHlme:a at
an asset to fulfil a function.
it to do so
to an
v . ,.,.,.,,,,.,,,.,.,. So a fu nc tion definition
and by
implication the definition of the nh,,,,.r,r,u,:.c,
v:.,j.,,,.,: ....,., of maintenance for the asset
..,"""'"''"'''-' the level of
- is not complete unless it
,-,.o, -t-p,rn... ,, ..... ,,.::, desired
the user
" f.J'"'"-'-'- H.J

For instance, the primary function of the pump in

2.1 would be listed as:
To p u mp water from Tan k X to Tan k Y at not less than 800 litres per minute.


that a complete function statement consists of a

and the standard of
the user.

i1 function statement should consist of a verb,

an object and a desired standard ofp erformance

2.2 Performance standards

of maintenance is to ensure that assets continue to do what
their users want them to do. The extent to which any user wants any asset
can be defined by a minimum standard
to do
WC could build an asset which COUid deliver that minimum r\Pl'TAirrn '.\nr,'>
n"',r"' ",.,.,-r. ..,,,.... i n any way, then that would be the end of the matter.
The machine would run continuously with no need for maintenance .
..... ,.,.. ,,,""''" in the real world,
are not that
tell us that any
The laws
to the real world will deteriorate. The end result of this no-ta..,,.-...,,,t,,-...-n
known as 'chaos' or
taken to arrest whatever process is
to deteriorate.

For instance, the pump i n

water i s drawn a t rate of
to deteriorate
it can n o

2. 1 is pumping water into a tank from which the

l itres/min ute. One process that causes the pump



Figure 1:

Initial capability vs
desffed perlormance




(What it can do)

Offtake from tank:

800 l itres/mi nute

So if deterioration is
it must
be allowed for. This means that when
any asset is put into
i t must be
able to deliver more than the minimum
desired the
user. What the asset is able to
ent reliability).
22 illustrates the
right relationship between this capabi
lity and desired
Figure 2.2: Allowing for deterioration

For i nstance, in order to ensure that the pump shown i n

2. 1 does what its
m ust
users want and to allow for deterioration, the system
p u m p which has an i n itial built-in capabil ity of
greater than 800 litres/
minute. I n the example shown, this i nitial capability is 1 000 litres per minute.

This means that

can be defined in two ways, as follows:
desired performance (what the user wants the asset to
built-in capability (what it can do)
Later chapters look at how maintenance helps ensure that assets continue
that their capability
to fu lfil their intended functions, e ither by
remains above the minimum standard desired by the user or by .r,::,,;,i',,rinn
something approaching the initial capabi lity if it drops below thi s point
When considering the question of
bear in mind that:
and by how
the initial capability of any asset is established by i ts
it i s made
maintenance can only restore the asset to this initial level of capab ility
it cannot go beyond it.
In practice, most assets are
and built , so it i s usually
possible to develop maintenance programs which ensure that such assets
continue to do what their users want
n., ...v, .u,.,-,


Reliability-centred Maintenance




cannot raise
the capability
of the asset
above this level

The objective of
maintenance is
to ensure that
capability stays
above this level

Figure 2.3: A maintainable asset

In short, such assets are maintainable, as shown in Figure 2.3.

On the other hand, if the desired performance exceeds the initial cap
abi1i ty, no amount of maintenance can deliver the desired per formance.
In other words, such assets are not maintainable,
as shown in Figure 2.4.

For instance, if the pump shown i n Fig

u re 2 . 1 had an in itial capability of 750
litres/minute, it would not be able to keep
the tank full. Since the maintenance pro
gram does not exist which makes pu mps
bigger, maintenance cannot deliver the
desired performance in this context. Sim
ilarly, if we m ake a habit of trying to draw
1 5 kW (desired performance) from a 1 0
kW electric motor (initial capability), the
motor will keep tripping out and will even
tually burn out prematurely. No amount
of maintenance will make this motor big
enough. It may be perfectly adequately
designed and built in its own right - it just
cannot deliver the desired performance
in the context in which it is being used.


I._ I ,C. 1 I"C / .._ 0 .2 , C,

1. c i. ..c a

a ., .:i ' , ,

1 1 .1 I II -l !" , C '" II ct- , C


- 'I" ".

.: ,. '

;1 . ' -1


, ..,C _i; ' , I c .l :.

-1 ! . , c '.. i ,e e ":..i ' '
.s 1.r...c , .... - a .l ' a

Figure 2.4:

A nonmaintainable situation

Two conclusions which can be drawn from the above examples are that:
for any asset to be maintainable, the desired peiformance of the asset
must fall within the envelope of its initial capability
in order to determine whether this is so, we not only need to know the
initial capabil ity of the asset, but we also need to know exactly what
minimum performance the user is prepared to accept in the context in
which the asset is being used.



Thi s underlines the importance of identifying

what the users
want when starting to develop a maintenance program. The
of perfo rmance standards in more detail.
explore key
Many function statements incorporate more than one and sometimes
several performance standards.
For example, o n e function o f a chemical reactor i n a
m ig ht be l isted as:
To heat u p to 500 kg of product X from ambient temperature to
( 1 25 C) i n one hour.
In this case, the weight of p roduct, the temperature range and the time all p resent
different perfo rmance expectations. Similarly, the
of a motor car
might be defined as:
To t ransport u p to 5 people along mad e roads at
of up to 1 40 km/h
relate to
and n u mber of passengers.
Here the performance

Quantitative pe1formance standards

Perfo rmance standards should be quantified where n,,,., ,,, ,""''""
quantitative standards are inherently much more
care should be taken to avoid "-l'"''"u""'ll
'to produce as many widgets as required by production' , or 'to go as fast
as possible' . Function statements of this type are
because they make i t impossible t o define exactly when the item i s fa iled.
In reality, it can be extraordinarily difficult to define
what i s
required, but j u s t because it i s difficult does not mean that it cannot or
should not be done. One major user of RCM summed up thi s point
say ing ' If the users of an asset cannot specify
they want from an asset, they cannot hold the maintainers accountable for
that performance. '

,:H .,.,H.U.Hi.lUF,

Qualitative standards
In spite of the need to be precise, it i s sometimes 1moo:;;;s1101e
quantitative performance standards so we have to live with ,.1 ..................
For instance, the p ri mary function of a
from person to
not 'attractive'). What is meant by
person and is i mpossible to q ua ntify. As a result, user and maintain e r need to take
care to ensure t hat they s hare a common understanding of what is meant by
words l ike 'acceptable' before
u p a system intended to p reserv e that
acceptabi l ity.

-lVc) 'V < v<MC t),(> Jt,,Hr'.n Orlr'O standards
A function statement which contains
"""'"'"'' H'" an absolute.

For instance, the concept of containment is associated with nearly all enclosed
co ntainment are often written as follows:
svEnerns. Function statements

contain liquid X
that the system must contain
The absenc e of a
at all amounts to a failed state. In cases where
and that
an e nc losed system can tolerate some l eakage, the amount which can be toler-

1"U'r1;.'1''HU11'1rP S{(l!Ulanl'S

:'"'"t,,t;,,,,c (or
Performance {'.'.>V
between two extremes.

of 2.5 tons, and the distribution of l oads is

as shown in
ration, the
be more than the 'worst case' load, which in
is 5 tons. The m aintenance
nrr,nr;;::, m in turn m ust ensure that the
does n ot drop below this level,
which case it would !'..l ! 1f r.rr1-::it1t'lhl


Figure 2.5:

Variable performance standards

and frnver limits

In contrast to variable
which simply cannot be set up to
These are
function to
the same standard every time they __,..,,._,,. ..,,"'"''
machine used to finish g rind a crankshaft will not produce
same finished diameter on every journal. The diameters will vary, if
only by a few m icrons. Similarly, a filling machine i n a food factory will n ot till two
The weights wil l vary,

variations of this nature usually vary

2.6 indicates that
about a mean. In order to accommodate this variability, the associated
desired st:andards of
incorporate an upper and l ower l imit

For instance, the primary function of a swee1:mcK1na m ,,,,.,,.h ,nn
gm of sweets i nto
The primary function of the ,,,.,r\r1,.-,,r1
chine might be:
To finish grind main
a cycle time of 3.00 0.03
diameter of 75 0. 1 m m with a s urface
finish of Ra0.2.

(In practice, this kind ofvariability is

unwelcome for a number of
reasons. Ideally, processes should be
so stable that there is no v ariation at
all and hence no need for two limits.
In pursuit of this ideal, many indusa great deal of ti me
tries are
and energy on
processes that vary
ever, this aspect of design and
now we are concerned
point of maintenance.)
How much variability can be tolerated in the
external factors.
uct is usually
For i nstance, the l ower limit which can be tolerated on
such noise, vibration and h :::i ,c ,n <"'
upper limit by the clearances needed
limit of the weight of the bag of sweets
standards legislation,
amount of product which the company can afford to

In cases like
the desired performance limits
limits. The limits of ...,,.,,...,v .. ...
and lower
three standard deviations either side of the
the upper and lower control limits. Quality mamagecne1m
that in a well
process, the difference between the control limits
should ideally be half the difference between the
limits. This
multiple should allow
a maintenance viewpoint.
and lower limits not only
apply to other functional
nc:uHms such as the accuracy of
and the
of control systems and nrc,1-,.-.,,i- n ,.., devices. This issue is
discussed further in


2.3 The Operating Context

used to determine the mainphysical asset in its operating contex t ' . This
formulation process,
the entire maintenance
c ontext
.:: t .. ,rtu,, n with the definition of functions.

consider a s ituation where a m aintenance program is being devel

for a truck used to transport material from Startsville to Endburg. Before the
functions and associated performance standards of this vehicle can be defined,
the p rogram need to ensure that
thoroughly u nder
nn=>r!:.1'1nn context.
For instance, how far is Startsvi lle from
Over what sort of roads and
what sort terrain? What are the 'typical worst case' weather and traffic condi
tions on this route? What load is
(fragile? corrosive? abrasive?
explosive?) What speed limits and other regulatory constraints apply to the route?
What fuel facilities exist
the way?
The answers to these
might lead us to define the primary function
of this vehicle as fol l ows: 'To transport up to 40 tonnes of steel slabs at
m ph) from Startsville to Endburg on one tank of fuel'.
nn,,:.rr.i f1n n context also profoundly influences the requirements for
functions. In the case of the truck, the climate may demand air
u1-,uu;,-,, the remoteness of
spares be carried on board, and so on.
affect functions
could occur, their effects and consequences, how often they happen and
what mu st be done to manage them.

d rawn at a rate of 900 l itres per m i nute, the primary function would be:
To p u mp slurry i nto Tank B at n ot less than 900 litres per m i nute.
standard than in the previous l ocation , so the stand
This is a
ard to which it has to be maintained rises accordingly. Because it is now p umping
and severity of the failure modes
instead of water, the nature,
the p u m p itself i s unchanged, i t is l ikely t o e n d
As resu lt,
oto,hJ, different maintenance program i n the new context.
up with a r-nrY'lnl
V"'f I

All this means that anyone

out to apply RCM to any asset or
of the
h ave a
process must ensure that
start. Some of the most important factors
context before
w hich need to be considere d are disc u ssed in the T A I I A'-''1H">i"< ,,,i ,r>i rrrCU"\ tH'.'


the most important feature of the nn,P>r"'.> h t Ct

nr\ir r n"<C'
h where
This nmges from flow process vi,1vu..:1l
an11 1 1 r,rn ,'.:> n l' is
'' \Vhere
most of the mach ines are
In flow processes, the failure of a
plant or
unless surge
available. On the other hand, in
cm1ail the output of a
machine or line. The consequences
of such failures are r!J'>t.:::,rrn , r,d_
duration of the
the "'"""
'""1 to an
These differences mean that the maintenance e rr,c, ra r n ,
different from the
of a flow process could
to an identical asset in a batch environment.

11 1



Jc H.

or alternative means of
The presence of
feature of the nn,,,..,.,:, t t n. cr context which must be considered in detail when
....,.,.., u, ,, . . .... the functions of any asset.

. .

The i mportance of redundancy is illustrated the three identical p umps

B has a stand-by, while pump A does not

Figure 2. 7:


Stand Alone

. :u. . t.y:..:..:

This means that the p ri mary function of pump A is to tra nsfer

from one
to another on its own, and that of pump B to do it in
presence of a cr,c,nn. nu
This difference means that the maintenance
pumps wil l
how diffe rent w e see later) , even
the pumps a re identical.

Quality standards and standards of customer service are two more
context which can lead to differences between the deof the
C' ,...........T.,.,,,..", of the functions of otherwise identical machines.
stations on two transfer machines
For example, identical
same basic function - to mill a
of cut,
ness tolerance and surface finish
all be different.
lead to quite different conclusions about




Environmental standards
An : ..\,,vu,.,:.,E,: J important
of the rvn,Or r'!i'H,, cr conte xt of any asset i s
which it has
o n the environment.
worldwide interest in environmental issues means that when
two sets of 'users ' . The
we maintain any asset, we actually have to
the asset itself The second is
as a
first the people who
which wants both the asset and the process of which it forms part
not to cause undue harm to the envi ronment.
wants is
in the form of mc:re::1s11ni!llv "t,,,-.... "'''""'r
environmental standards and l\.;bu;a;_uu;,. These are international, nationat
, n ,>-r,a,,,'H' 1 H i"l' 'I H



,a.r,r. U <"lN r"W

s imply to ensure that a
or process
sound a t the moment i t i s commissioned.
have to be taken to ensure that
because all over
more and more incidents which
affect the environnn\,1:,P -:i , asset did not behave as it should


acceptable levels of risk. In
in others to individual sites and
some cases,
others to individual processes or assets.
they are an important
of the
d=,, ra l ,(). ....,rl

,...n,nr, .:,rn ,, n ..

Shift anangements
context. Some plants
v,,.,. ,..,,"" for
hours per
a week (and even less in bad
Others operate continuously for seven
somewhere in between.

shift plant, production lost due failures can
In a
overtime. This overtime leads to , n ,,r,::,o e,,,, I
up by
evaluated in the
of these
so maintenance
24 hours per
On the other hand, if an asset is
week, it is seldom :_--rwc>hlA to make up for lost
lost sales. This costs a
deal more than extra
so i t is worth
" " 'n>THT


it is also more difficult to make eu1lno1m<nt available for mai ntenance

need to be formulated with
Cl'rr}l( ,:;,. CT I A C

As products move
as economic conditions
their li fe
to the
organisations can move from one end this
quickly . For this reason, it i s
to review maintenance
policies every time this
context ,.,... ..,.., ,.,.,,.._,
of the

..,u ...... ,,....,-,

' Ht''t'\rt C t fl Cl' I U

Work-in-process refers to any material w hich has

of the
process. It may be stored in
The c o11seauenon conveyors or in
in hoppers, on
ces of the failure of any machine are
between it and the next machines in the process.
Consider an example where the volume of work in
the next
working for six hours and it
failure mode under consideration. In this case, the failure would be
affect overal l output Conversely, if it took eight
it could
would come
overall output because the
these consequences in turn
the n ext and
the amount of
down the line, and
the extent to which any of the operations affected
bottleneck operation (in
which governs
output of the whole
other words an
c osts money to hold
stocks ,...,.,.,,.,,,",..,..
because the pressure maintenance
in order to make it possible to do without the cushion is also

So from the ma.mten:anc:e ""''" n"'"" "t- a balance has to be stmck between
the economic
the cost
of those fniJnrP<' Qf
/""'"' ''" maintenance tasks with a view to
the cost
the failures.
To strike this balance
understood in un,L.................- ...
be paiticularly
,, .,,.A ,. ...., ,., ,_, ,

') M f" t r" , lrV>T

T"\Y'<CC>Ut:JYHll r\

times are influenced by

which is a function of the availability

of the person
tools and of the
These factors heavily influence the
and the consequences of
from one
to another. As
context also needs to he
of the
It is
to use a derivative of the RCM process to optimise spares
This derivati ve is
stocks and the associated failure
based on the fact that the
reason for
a stock
parts is
to avoid or reduce the consequences of failure.
between spares and failure consequences
the time it takes to procure spares from suppliers, If it could be done
there would be no need to stock any spares at all. But in the real
sp,U'es takes time. This is known as the lead
and it
ranges from a matter of minutes to several months or years. If the spare
the lead time often dictates how long it takes to
not a stock
consequences. On the other hand,
and hence the
holding spares in stock also costs money, so a balance needs to be stmck,
between the cost of holding a spare in stock and
on a
it. In some cases, the weight and/or dimenthe total cost o f not
sions of the spares also need to be taken into account because of load and
in facilities like oil platforms and
the scope of this book. HowIn most cases, the best way to deal with spares is as follows:



review the failure modes assoc1.ate:a

by estao11s1ung

If this approach
seen part of the (initial) Ar\,O.rOt"> n, ,y
.Market demand
,-,. .,.....c, .,,,,

h n, n \.,

sometimes features


in dernand for

For exf:lm!Jle. soft dri n k companies ovr,ar,on,o

in summer than in winter, while u rban transport t'nt'Y'lr\l n i ,:lC, o,... nnr,,:,,,-.r,a
mand during rush hours.

much more
of industry, this
understood when
asf;essmg failure consequences.
Raw material
context is influenced by
Sometimes the
the supply of raw materials. Food manufacturers
ods of intense
harvest times and
activity at other times. This
to fruit processors and sugar
operational failures
n n ,<> n t>t"H:><' of raw materials if these cannot
Documenting the
For all the above reasons, it is essential to ensure that everyone involved
the development of a maintenance program for
context of that asset The best way to do is to document
ifnei:::essa1:v up to and
the overall mission
statement of the entire oran1sat10ia, as part of the RCM process.
Vj.J'\.eJ. >.,\,

J.U,:., f'>f'>

context statement
2.8 overleaf shows a
machine mentioned earlier. The crankshaft is
in type
used i n motor car model K

Reliability-centred Maintenance


Make car
model X


asset: Motown

our target is not more

for two weeks per year
production workers to take their main annual vacations.

Make Type 2

Finish grind
main and big
end journals

about 45 minutes before the line as whole stops.

vv1 1u11.;tncv 0.4% to the present overall scrap rate.

Figure 2.8: An operating context statement



Note also that a context statement at any level

the asset under review.
it in the

The context statements at the

broad function statements. Performance standards at the
from the
of the overall business. At lower
until one
levels, performance standards become
reaches the asset under review. The primary and ,vuuu.
the asset at this level are defined described in the rest

2..4 Different Types of Functions

asset has more than one often several functions. If the
can continue to fulfil

expectations change - but still we find
was built. Defining
very close "''-"J""'"'"
""'""t'1 An
'""'" between maintainers and users.
It also usually a profound 1eaimu1g e;mene1:1ce for everyone involved.
Functions are divided into two main
(primary and secondary functions) and then further divided into various
are reviewed on the fo llowing pages,

Primary functions

assets for one, possibly two, seldom more

are defined
than three main reasons. These
function statements. Because
are the 'main' reasons
are the reasons
are known asprimaryfullctions.
the asset exists at all, care should taken to define them
nhui..' ,r 'l l

Primary functions are
easy to rec:ogmse . In
of most indust1ial assets are based on their

the names

nrt ltYl ':>lt"t l

For instance the primary functio n of a packing machine is to

crusher to crush
and so on.

of a

the current perAs mentioned

lies in
the real
"""'".-" h'""'" associated with these functions. For
"'""',-,.,,.,,v;.,-.... ,.., ,,,_
standards associated
1 mentioned that our ability to achieve and sustain
quality standards
on the cap ability and condition of
the assets which
These standards are usually associfunctions. As a result, take care to incorporate
ated with
criteria into primary function statements where relevant These
include dimensions for
forming or
operations, purity
chemicals and
hardness in the case
of heat treatment,
levels or
and so on.
.._,,.,.....J ......._

, Uh''-'""''-'VJ.

An asset can h ave more than one primary function. For

the very
two primary functions.
both should be listed in the functional
In such
,,.-. ,:,r,,1n,

, r; r,._....


A similar situation is often found i n manufacturi ng, where the same asset may be
different functions at different times. For i n stance, a single reactor
used to
vessel in a chemical plant m ight be used at different times to reflux
continuunder three different sets of conditions, as follows:
ously) three
2 bar
Pressu re
1 0 bar
6 bar
1 20"C
1 40 c
1 80C
500 litres 600 litres 750 l itres
Batch s ize



In cases like this, one could list a separate function ,,...,.,.,...........,,.ii,.

duct. This would
lead to three
u.....,., maintenance programs
for the same asset Three programs may be i-- r,a,..,.,,..,,.,_,, even desirable
if each product runs continuously for very long periods.
However, if the interval between long-tenn maintern:mce tasks is
than the change-over intervals, then it is i mpractical to
the tasks
every time the machine is changed over to a different product.
One way around this problem is to combine the 'worst case' standards
associated with each product into one function statement.
c' CU''VH> f'a


In the above example, a combined fun ction statement could be 'to reflux up to 750
l itres of product at temperatures up to 1 80 G and pressures u p to 1 o bar.'

This will
some overmaintenance some of the
but which will ensure that the asset can
handle the worst stresses to which it will be o v ,,...,..._,, .::, r1
Serial or dependent prinwry functions
One often encounters assets which must
two or more
functions i n series. These are known as serial functions.

F o r i nstance, t h e primary functions o f a machine i n a food

m ay be 'to fill
300 cans with food per minute' and then 'to seal 300 cans per m i nute'.

The distinction between multiple primary functions and seri al

functions i s that i n the fonner case, each function can be
pendently of the other, while in the latter, one function must be
before the other. In other words, for the canning machine to work properly
it must fill the cans before it seals them.
.-.o.,i-A, .......,,."'

Secondary Functions
Most assets are expected to fulfil one or more functions in additi on to their
primary functions. These are known as secondary functions .

For example, the primary function of a motor car might be described as follows:
to transport up to 5 people at speeds of up to 1 40 km/h
m ade roads
If this was the only function of the vehicle, then the only ornec1we of the mainteto carry up to 5
n ance program for this car would be to p reserve its
at speeds of up 1 40 km/h along made roads. However, this
from the
because most car owners expect far more from their vehicles,
to the ability to i ndicate how m uch fuel is i n the fuel tank.
ability to carry

To help ensure that none of these functions are

as fo llows:
ded into seven

are divi-


The first letters of each line in this list form the word ESCAPES .
>.1u,vu,,._u sec:onaai--v functions are usually less obvious than primary
of a
fu nction can still have serious consefunctions, the
quences sometimes more serious than the loss
function. As
se<:::0110lfY functions often need as much if not more maintenance
too must be clearly identified. The fol lowthan primary fu nctions,
of these functions i n more detail.
the main
Jr,Ql', I I H-,.J H'

nnf< V tM 1n,r1 how

have become a critical feature of the
context of many assets.
the process of compliance with the associated standards by
,,..,,,,,.-n..r, ..,, ... .-, .. them i n
worded fu nction statements.
For instance, one function of a car exhaust or a factory smoke stack might be 'to
contain no more than X micrograms of a specified chemical per cubic meter' . The
car exhaust system might also be the
of environmental restrictions dealmight be 'to emit no
with noise, and the associated functional
more than X dB measured at a distance of Y metres behind the exhaust outlet'
M.ost users want to be reasonably sure that their assets will not hurt or kill
hazards e merge later in the RCM process
them. In
in some cases it is necessary to write function
as failure modes.
threats to
statements which deal with
. ---1- ...........
For instance, two "'"''<:""-n., ,, ""'""' functions of a toaster are 'to prevent users from
l ive components' and 'not to bum the users'.
are u nable to fulfil the safety expecta
Many processes and
rise to additional functions in
tions of users on their own. This has
some of the most diffi
the maintainers of modern industrial
they are dealt with



those which deal with
A further subset of
nucn,c.r,,,. TI1ese are most often found in the food
onannace11t11:a1 industri es . The (> <' <' r>f'"I 1 n,arr,r.rn. '1nra.
and lead to
maintenance routines ( cleaning and

,.>IJ'v'-'U. llv.._.,


, w ,._,,,.. , ,... ,

function. This
Many assets have a structural
''"'-''",,n or conu:>orrent.
supporting some other
"--v .. n....

For example, the primary function of the wall

from the weather, but it
the roof
bear the weight of shelves and


exrmctea to support

.._..__,,,.IJ...,.,,,, s tructures with multiple load bearing

levels of redundancy need to be
RCM. Typical examples of such structures are ...,... .. ..... . ...... ",
and the structural elements of offshore oil
Structures of this type are rare in
analytical ,_-.,-... uu.
ward, srn12"H-c.;ueu
as any other function described in this
.1.4 ,.,,._ ,J

In many cases,
to fulfil fu nctions to
standard of performance, but
want to be able to
fonnance. This expectation is summarised in sermr:ate
For i nstance, the primary function of a car as suggested
u p to 1 40 km/h
made roads'. One control function
up to 5
associated with this function could be 'to enable driver to regulate speed at will
and + 1 40 km/h'
between - 1 5 km/h
' ""'"' ,.... .,.,,
. - --.-,........n
of functions. This includes functions which '"'r''
P' ,.:, :,., up1 u.Lv1 ., with real time information about the process
and control
or which record such information for later
or analog recording
functions not
Perfonnance standards associated with
late to the ease with which it should be
to read and assimilate
to playback the
For i nstance, the function of the
of a car
indicate the road speed to the d river to withi n +5 -0% of

b e described as 'to

function is to contain
In the case of assets used to store things, a
whatever is being stored. However, containment should also be acknowlmaterial of
sec:onoruv function of all devices used to
pumps, conveyors, chutes,
function of items like gearremarks on

<'.l>V t"<Ol" t n fs r\>'H'

their assets not to cause them
are l isted under the
of ' comfort' because the
freedom from anxi
classified under

"' --, so it is undesirable from a human

who are anxious or in pain
viewpoint It also bad business because
are more likely to make incorrect decis ions.
is caused
for domestic
"""'"""n,,,:.,, or for oil refineries. Pain is
and furniture which are incompatible with the people
The best time to deal with
is of course

egory of functions
that this doesn 1t ...., .


one function of a control panel might be 'To i ndicate

to a
colour-blind operator up to five feet away whether pump A is ""I"'
1 nn or shut
"' '-::.
down'. A control-room chair might be expected 'To allow oot::ra1:ors to sit comfor up to one hour at a time without inducing drowsiness'.

secondary function.
The appearance of many items embodies a
the primary fu nction of the
on most industrial
it from corrosion, but a bright colour might be
eami:>mint is to
used to enhance its
for safety reasons. Similarly, the main function of a
is to show the name of the company which
sec:onoa1':V function is to project an

Protective devices
assets become more complex. the number of ways
n-rr,nr,, nn almost
fail is 51vvvu.5
This has led to Pn,"'1",;u.-.,,.nf"f, n rr
in the
nate (or a t least t o reduce)
use i s
made o f automatic protective devices. These work i n one of
r.r,a r,,,
tQ abnormal conditions
to draw the attention of the vpvHH\It,')
and audible alanns which ro,nr),YJ/1
are monitored a
cells, overload o r
temperature or pressure


fl1!.t) r t: n,;) >rl

to eliminate or relieve abnormal conditions which

medical eu1u n,rmr::ru

rupture discs

to take over from a function which

redundant structural ,'f)Jvn nnnun,rt'
dangerous situations from
from failures or to proThe purpose o f these devices i s to
all three.
tect machines or to
Protective devices ensure that the failure of the unction
ted is much less serious than it would be if there were no .....".,.,,.,:>.,r,,..
presence of protection also means that the maintenance
nr,,r.,.rTPfl fu nction are often l ess
than they would be otherwise.
Consider a
machine whose
a toothed belt. If the
belt were to b reak in the absence of any nrr.tol"'1r1f'ln
drive the stationary cutter i nto the
and cause
C'C.t'V'lnrt-:1 1ru damage. This can b e
two ways:
by implementing a comprehensive nrr'l,,til,IO maintenance routine aesm1nea to
prevent the failure of the belt
by provid ing
such as a b roken belt detector to shut down the machi ne
as soon as the belt b reaks. I n this case, the only
i s a brief stoppage while it is
so the m ost cost-effective
valid if the broken
might simply be to let the belt fail . But
belt detector is
and steps must b e taken to
that this is

The maintenance
which are not
fail-safe - is discussed in much more detail in rhmti::>rs 5 and 8. However,
demonstrates two fundamental points :
..._,U..,.IJ ..,.. L

devices often need more routine maintenance attention

than the devices
r\T'l">f't:>,'f't t,rr

that we c annot develop a sensible mc1Jnten:arn;e program for a protected

function without also
the maintenance
of the
nr,,ro,,t-1 1 .r n device.
t"C:,./'H l l t".:>-rn s:>,nf'(>

of protective
to consider the
It is only
the functions of
devices if we understand their functions. So when
any asset, we must list the functions of all
devices concerns the way their functions
A final point
should be described. These devices actov ext;ermcm (in other words when
so it is important to describe them correctly.
function statements should include the words 'if'
of the c ircumstanorin the event of . followed by a very
ces or the event which would activate the protection.
1"\ri'\f'A /'t, 'ttP

For instance, if we were to describe the function of a tripwire as being 'to

m ac hine', anyone reading this
could be forgiven for thinkin g that the
is the normal stop/start device. To remove any ambi guity, the function of
a tripwire should be described as foll ows:
machi n e in the event ofan e mergency at any point
along its length
valve m ay be described as follows:
The function of a
of relieving the pressure in the boiler if it exceed s 250 psi.
to be

Anyone who uses assets of any sort only has finite financial resources.
This leads them to put a limit on what they are prepared to spend on
v.., ...,, .......u,., and maintaining it How much they are prepared to spend is
a combination of three factors:
the actual extent of their financial resources
how much they want whatever the asset will do for them
the same end.
and cost
At the
context level, fu nctional
out in the form of
economic issues can be addressed directly by function
in areas such as
statements w hich define what users
and loss
nv ,.-,.,:,1"\ rl 1 h 1 >C> hlirl<'rOf'c<

For i nstance, car might be
'to consume not more than 6
of fuel
per 1 00 km at a constant 1 20 km/h, and not more than 4 l itres of fuel per 1 00 km
might be
at 60 km/h.' A fossil fuel
export at least 45%
of the latent energ y in the fuel as electrical power.' A plant
solvent might want 'to lose no more than 0.5% of solvent X per month'.

are sometimes encountered which are "''" "'"'''' ''"
Items or
has been modified
superfluous. Thi s usually
has been over,,..,,...,. ....,,,,,.,,....... '" built
1 1 11 "'.,." '""""'


line between a
a pressu re reducing valve was built into
gas manifold and a gas turbine. Th e original f unction of the valve was to reduce
1 20 psi to 80
The system was later modified to red uce
the gas
the manifold pressure to 8 0 psi , after which the valve
no useful p u rpose.

that items l ike these do no harm and it costs money

It is sometimes
so the
solution may be to leave them alone until
to remove
the whole plant decommissioned.
this is seldom true in
ui,t.._.u,.,,.,. Although these items have no
can sti l l fail
and so reduce overal l
ance, which means that
still consume resources.
It is not unusual to fin d that between 5 % and
of the comr:>ontents
the sense described above. If they
of complex C' 1U <st.orn c are
..., ...,,u,., ............ ... . it stands to reason that the same
maintenance problems and costs will also be eliminated.
before this can
be done with confidence. the functions of these
first need to
be identified and


as 'to operate 7 days a

a function i n its
all the other functions. It is ......r,1.........r, ..,
with each of the failure modes which could
This issue discussed further in Chapter l 3,

TI1ere will often be doubt about which of the ESCAPES "'"1''""'"''"''"""' some
should the function of a seat
mechan ism be classified under the
of 'control' or "comfort' ?
classification does not matter. What does m atter
and define all the functions which are l ikely to be ex"'"'''-'-'5\Jl.i\.;;:'I


2.5 How Functions should be Listed

.......,,,.-..o,rh r written functional C'n,:.,,,i-.r<'.l'ti<,"\n
quantified ,, v ,.,,.,h,... ,. defines the objectiv es of the Pn 1t"Pr1"\1"H! P
involved know s
wanted, which in turn
on the real needs of
ensures that maintenance activities remain
It also makes i t easier to absorb crnam!es
the users
the whole
vuuu5,1115 '-J\.P\.\.t.uLn.;u0 with out
Functions are listed on
RCM Information Work
sheets in the left hand col
functions are
listed first and the functions
are numbered numerically,
as shown in
the exhaust system of 5
,..,.......... ,.,.""" Inforn1ation
Worksheet is shown at the
end of
f-'Hf'r ff>Tl"'

P rt l >"'li'<H, ,'r o V ,n;C>t'>Tt

>t,r-.T'\ C'

Figure 2.9:

De,scnlbfnrc, functions


Fu nctional Failures

RCM process entails

Chapter 1
about selected assets, as follows :
what are thefunctions and assodatedperfimnance standards of the
asset in its present operating context?
in what ways does it fail to fu{fil its functions?
what causes each functional fallure?
what happens when each failure occurs ?
in what way does each failure matter?
what can be done to predict or prevent etich failure?
what should be done (fa suitable proactive task cannot be found?

dealt at length with the


the second "

1',, .,,,,,..,

3.. 1 Failure
In the previous chapter, we
because they want the assets do
their assets to fulfil the intended functions
dard of performance.
went on to explain that
both to do what its
of the
want and to allow for deterioration, the initial
as the
exceed the desired standard of
capability of the asset continues to exceed the desired standard perTr>t>m-,,n(''"' the user will be satisfied.
if for any reason the asset is unable to do what the
On the other
user wants, the user will consider it to have failed .
,.,._.iJ"""' ' " ..

f'\Pl"Tr>,irn 'H'l/',:l,


Reliability-centred Maintenance

This leads to a basic definition of fai lure:

'Failure ' is defined as the inability of any

asset to do what its users want it to do.
This is illustrated in

3. .

For instance, if the pump shown i n Figure

2 . 1 on Page 22 is unable to pump 800
litres per minute, it will not be able to keep
the tank full and so its users wil l
as 'failed'

Figure 3. 1:
The general failed state

3 ..2 :Functional Failures

as ifit
The above defini tion treats the
to an asset
as a whole. In
this definition is vague because it does not distinfailure) and the events
between the failed state
which cause the failed state (failure modes). It is also
it does not take into account the fact that each asset has more than one
function, and each function often has more than one desired standard of
are explored in the following paragraphs.
pe1tmmc1mc1e . The
Functions and Failures
We have seen that an asset is failed if it doesn't do what its users want it
to do. We have also seen that w hat anything must do is defined as a func
tion and that every asset has more than one and often several differen t
fo r each one of these functions t o fai l, it
functions. Since i t i s
follows that any asset can suffer from a variety of different failed states.
For instance, the pump in
2.1 has at least two functions. One is to pump
water at not less than 800 litres/minute, and the other is to contain the water. It is
for s uch a pump to be capable of pumping the required amount
(failed i n
i n terms of its pri mary function) while leaking
term s o f the C"C>f'l''\nrl !l l"\/ " , nrme,un

Functional .Failures


Conversely, i t is
it cannot pump the
it still contains the

This shows why it more accurate to define fail ure in terms the l oss
functions rather than the fa i lure of an asset as a whole. It also
the RCM process
the term 'functional failure' to describe
its own.
rather than 'fai lure'
we also need to l ook m ore
oeirtorm:am;e s tandards.
Performance Standards and F
. ailure
As discussed in the first
by a n.w,r-r-.:rt"n <J t'> ,''< standard.
factory _:''-""t"Fnrm,;:in .. and failure is
Given that ns::>1 'trYrm 1 n,p standards
defining a functional failure as follows:
can be defined
IJV.. .. V.1C U Ll.,_U,.,v


j,1'-l lVUUU,H..:.,\..,

U U \U'UHc'l ,

t'""'1:u1 u, .....,..,:,

A functionalfailure is defined as the inability

of any asset to fulfil a function to a standard of
performance which is acceptable to the user
The fol lowing
different aspects of functional fail
ure under the following
partial and total failure
upper and
gauges and indicators
the operating c ontext.
HVICl.\..UU,C:, ,a>

Partial and total jlzilures

The above definition of functional
loss of funcfailure covers
situations where
tion. It also
the asset still
but pe1fonns
outside acceptable limits.

Figure 3.2:
Functional failure

....'' 'I"''
... , the primary function of the
from tan k X to Tank Y at n ot less than 800 l itres/mi nute' This function
from two functional fail u res, as follows:
fails to pump any
at all
pumps water at less than 800 l itres per m i n ute .



which could affect each function should be recorded.

Record all the functionalfailures associated with each function.

Note that partial failure should not
be confused with the s ituation where
but its
the asset deteriorates
remains above the level of
,,.,.,....t,",....,., ,,..,....,."' require d by the user .
the initial capability of the
2. 1 is 1 000 litres per min
wear i s i n evitable, so this
as i t does
capability will decl ine.
not decl i n e to the point where the p u m p
is u nable t o pump 8 00 l itres per minute,
it will
be able to fil l the tan k and so
keep the users satisfied in the context

(What it can do)
I Actual deterioration

. ,i. . .. >...

:.. ). y-
tor d9:terf()fa{i:





However if the capability of the

set deteriorates so much that it falls
below the desired
users will consider it to have failed.

Figure 3.3:

Asset still OK
some deterioration

Upper and lmver limits

chapter ex.plained that the perfonnance standards associated
with some functions incorporate upper and lower limits. S uch limits mean
that the asset has failed if it
products which are over the upper
limit or below the lower limit In
the breach of the upper limit
needs to be i dentified
from the breach of the lower
l imit. This i s because the failure modes and/or the consequences associated with
over the upper l imit are usually different from those
below the lower l imit.
ter 2
per m i n ute'. This machine has failed:
if i t stops
if it packs m ore than 251 gm of sweets i nto any
if it packs less than 249 gm i nto any
if it packs at a rate of less than 75

Functional Failures


The function of a crankshaft grinding machine was listed as 'To finish grind main
bearing journals in a cycle time of 3.00 0.03 minutes to a diameter of 75 0. 1 mm
with a surface finish of Ra 0.2'.
Completely unable to grind workpiece
Gri nds workpiece in a cycle time longer than 3 .03 m inutes
Grinds workpiece in a cycle time less than 2.97 minutes
Diameter exceeds 75. 1 m m
Diameter i s below 74.9 mm
Surface fin ish too rough.

Of course, if only one limit applies to a particular parameter, then only one
failed state is possible. For instance, the absence of a lower limit on the
roughness specification in the above example suggests that it is not pos
sible to make the item too smooth . In some circumstances, this may not
actually be true, so care needs to be taken to verify this point when analys
ing functions of this type.
In practice, the failed states associated with upper and lower limits can
manifest themselves in two ways. Firstly, the spread of capability could
breach the specification limits in one direction only. This is illustrated in
Figure 3.4, which shows that this type of failed state can be likened to a
number of shots hitting a target in a tight group but way off center.

Figure 3.4:

Capability breaches upper limit only

The second failed state occurs when the spread of capability is so broad
that it breaches both the upper and the lower specification limits. Figure
3.5 shows that this can be likened to shots scattered all over the target.
Note that in both of the above cases, not all of the products produced
by the processes i n question will be failed. I f the breach is m inor ! only a
small percentage of out-of-spec products will be produced. However, the
further off centre the grouping in the first case, or the broader the spread in
the second case, the higher will be the percentage of fa ilures.

Mean desired_.:


Reliabilitycentred Maintenance

Figure 3.5:

Capability breaches upper and lower limits

Figure 2.6 i llustrated a process which is in control and in specification.

Figures 3.4 and 3.5 show that processes which are out of control and out
of spec are in a failed state. The failure modes which can cause these failed
states are discussed in the next chapter. (Chapter? deals w ith the i mplica
tions of a process wh ich is out-of-control but within specification.)
Gauges and indicators
The above discussion has tended to focus on product quality. Chapter 2 men
tioned that upper and lower limits also apply to the perfonnance standards
associated with gauges, indicators, protection and control systems. De
pending on failure modes and consequences, it may also be necessary to
treat the breach of these limits separately when listing functional failures.

For i nstance, the funct ion of a temperature gauge could be listed as 'to disp lay the
temperatu re of process X to within (say) 2% of the actual process temperatu re'.
This gauge can suffer from three functional failures, as follows:
fails altogether to display p rocess temperatu re
displays a temperatu re more than 2% higher than the actual temperatu re
displays a temperatu re more than 2% lower than the actual temperature.

Functional failures and the operating context

The exact definition of failure fo r any asset depends very much on its
operating context. This means that in the same way that we should not
generalise about the functions of identical assets, so we should take care
not to generalise about functional failures.

For example, we saw how the p u m p shown in Figure 2. 1 fails if it is com pletely
u nable to p u m p water, and if it is able to pump less than 800 litres/minute. If the
same pump is used to fill a tan k from which water is d rawn at 900 l itres/minute,
the second failed state occurs if the throughput d rops below 900 litres/mi n ute.

Functional Failures


Who should set the standard?

An issue which needs careful consideration when defining functional

failures is the 'user' . To this day, most maintenance programs in use around
on their own. These
the world are compiled by maintenance
people u sually decide for themselves what is meant by 'failed'
In practice, thei r view of failure often turns out to be quite different
from that of the users, with sometimes disastrous consequences for the
effectiveness of their programs.
For example, one f unction of a hydraulic system is to contain oil. How well it should
of view. There a re p ro
fulfil this fu nction can be s u bject to widely differing
d uction managers who believe that a hydraulic leak only amounts to a fu nctional
On the oth e r
fail u re if it is so bad that the equipment stops working
hand, a maintenance m anager might suggest that a functional failure has occurred
if the leak causes excessive consumption of hydraulic oil ove r a
time. Then again , a safety officer m ig ht say that a functional fail u re has occurred
if the leak creates a pool of oil on the floor in which
and fal l or
which might create a fi re h azard. This is illustrated in

Figure 3.6:
Different views
about failure

\ Leak

The maintenance manager (who controls the hydraulic oil budget) may ask the
operators for access to the hydraulic system to repai r leaks 'because oil consump
tion is excessive'. However access may be denied b ecause the operators think
the machi n e 'is stil l working OK'. When thi s happens, the maintenance people ( 1 )
record that the m achine 'was not released for
maintenance' , and
form the opinion that their p roduction col l eagues 'don't believe in PM'. For similar
reasons, the maintenance manager might not release a maintenance person to
to do so by the
repair a smal l leak when
In fact, all three parties almost certainly do believe in prevention. The real p roblem
is that they have not taken the trouble to agree
what is meant by 'failed' so
they do not sh are a common u nderstanding of what they are
to p revent.

Thi s

illustrates three key points:

.,,..,.........,...""""""'.""' standard used to define functional failure in other
the point where we say 'so far and no further' defines the level
maintenance needed to avoid that failure other words,
to sustain the
level of ,,....,,,.,,..,r-,-..,,-m,c,n ro
much time and energy can be saved if these pe1rtonnam::::e standards are
the failures occur
estabJ ished
the ""'""''!
by opera
p1v1 ,-;..!:':'!'!!!:e standards used to define failure must be
with anyone else who
tions and maintenance
to say about how the asset should behave,
CJVJiU\.. H H U f:., L\,>F,, HUL.U.'- 'LV

How Functional Failures should be Listed

Functional failures are listed in the second column of theRCM Informa..
tion Worksheet. They
coded alphabetically, shown in Figure 3.7.

@ 1 996 ALADON LTD



5 Af'W'Iur6ine
u:.,fiaust System


To channel all the hot turbine

without restriction to a fixed
above roof of the turbine

Gas flow restricted

Fails to contain the gas

D Fails to

above the

To reduce exhaust noise levels to ISO

30 at 150 metres

Noise level exceeds ISO Noise

30 at 150 metres

Duct surface temperature exceeds 60C

Incapable of sending warning

if exhaust temperature exceeds

Figure 3. 7:

10 m

gas to a


of sending a shutdown
temperature exceeds

Does not allow free movement of

Fai lure Modes and Effects

Analysis (FM EA)

J'lPlrtnlncr the functions and desired standards

,l,:,jr,n,,rH't the
of maintenance with
to that asset. We have also seen that ,.,r1"'fir
.... ,,.uu,1'""
u ""' functional failures
what we mean by "failed'. These tvvo issues
enables us to
the first two
of the RCM process.
which are
_._,...,,,..uvu ui failure, and to ascertain
associated with each failure mode. This is done
for each functional failure.
failure modes and
This chapter describes the main elements of an
definition of the term 'failure mode' ,

4.. 1 \Vhat is a Failure Mode?

A failure mode could be defined a,1y event which is
asset (or system or
to fail.
is both vague and
whole. It is much more
ure' (a failed
and a 'failure mode'
This distinction leads to the
tion of a failure mode:
cn,c,c . v ,. .

A failure mode is a11y evmt

which causes a functional failure
The best way to show the connection and the distinction between failed
states and the events which could cause
is to list functional failures
then to record the failure modes which could cause each functional
1 overleaf.
as shown in


Reliability-centred Maintenance




Cooling Water Pump ing Sgsten



Figure 4. 1: Failure modes of a pump

4 . 1 also indicates that at the very least, a description of a failure

mode should consist of a noun and a verb. The description should contain
- " - detail for it to be possible to select an appropriate failure manage
but not so much detail that excessive amounts of time are
wasted on the analysis process itself.
I n particular, the verbs used to describe fa ilure modes should be chosen
with care, because they strongly i nfluence the subsequent failure manverbs such as 'fails' or
"''., ........... policy selection process. For
'breaks' or 'malfunctions' should be used sparingly, because they give
little or no indication as to what might be an appropriate way of managing
the failure. The use of more specific verbs makes it possible to select from
the full range of failure
ov,::imnlo a term l i ke 'coupling fails' p rovides no clue as to what might be done
to anticipate o r p revent the failure. However, if we say 'coupling bolts come loose'
or 'coupling hub fails due to fatig ue', then it becomes m uch easier to identify a
po;s1ble proactive task.

one should also i ndicate whether the loss

In the case of valves
of function is caused by the item fail ing in the open or closed position
'valve jams closed' says more than 'valve fails'. In the interests of complete
clarity, it may sometimes be necessary to take this one step further.

For instance, 'valve

closed due to rust on lead screw' is cleare r than 'valve
closed'. Similarly, one m ight need to distinguish between 'bearin g seizes due
to normal wear and tear' and 'bearing seizes due to lack of lubrication'.

These issues are discussed at length later in thi s chapter, but first we l ook
at why we need to analyse fail ure modes at all.


Failure Modes and

4.2 Why Analyse Failure Modes?

A single machine can fail for dozens of reasons. A group of machines or
system such as a production line can fail for hundreds of reasons. For an
entire plant, the number can rise into the thousands or even tens of thousands.
Most managers shudder at the thought of the time and effort likely to
be i nvolved in identifying all these failure modes. Many decide that this
type of analysis is just too much work, and abandon the whole ........,.. ,..,;, nu
In doing so, these managers overlook the fact that on a
mode level. For i nstance:
maintenance is really
work orders or j ob rec1Uests are raised to cover '''"''-'"'' ... '"' failure modes
specific failure modes

UUl,LUv.U.i-"U'-''-' f-'".L.LUUJlJC,

is all about

UHi, .'-J.Llr,

to deal with

in most industrial undertakings, maintenance and operations people hold

meetings every day. The meetings usually consist almost entirely of
discussions about what has failed, what caused it (and who is to
it and sometimes what can be done to
what is being done to
In short, the entire
is spent ....u.:1\v u,:,,,u1ti,:,
stop it happening
failure modes
to a large extent, technical history recording
fa i lure modes ( or at least, what was done to

record individual

Sadly, in too many cases, these failure modes are mscwseia . recorded or
otherwise dealt with after they have occurred . .....,.,."'
"'' ""' with failures after
they have happened i s of course the essence of reactive maintenance.
Proactive management on the other hand, means dealing with events
before they occur or at least, deciding how they should be dealt with if
they were to occur. In order to do this, we need to know beforehand what
events are likely to occur. The 'events' in this context are fa i lure modes.
So if we wish to apply truly proactive maintenance to any
we must try to i dentify all the failure modes which are
to affect that asset.
they should be identified before
all, or if thi s is not possible, before
Once each failure mode has been identified, it then becomes uo:ss1101e
to consider what happens when it occurs, to assess its consequences and
to decide what (if anything) should be done to anticipate,
it out
or correct it or perhaps even to

Failure Modes and

and much of the subse
at the failure mode level.
and then discussed at
4.1. This
e 2.1. Fiaure
_ 4.2 shows that

ut end-suction volute pump

1k more
at the three
he impeller onlv. These are

, failure by:
reen in suction fine?

, failure by: training
impellers correctly?
i;entrifugal pump

nenon. As shown in Figure

nd of the six failure patterns
n B). So if we know roughly
,equences of the failure are
his failure by changing the
fa foreign object appearing
to do with how long the im-

::ason that this failure mode

There would also be no
1r. So if the consequences
,ften enouqh, we would be
installing some sort of filter


If the
mechanism is
be because it wasn't put
coming adrift, this would almost
on properly in the first
(If we knew that th!s was so, then
failure mode should
be described as 'Im peller fitted
.) This
in turn means that the failure mode is most
to occur soon after start up,
as shown in
F in
and we would
with it by imnrm,inri
reinforces the
that the level at which we manage the
maintenance of any asset is not at the level of the asset as a whole this
and not even at the level
this case,
but at the level of each failure mode. So before we can deSV5,te1na1t1c, ""''""'""'"''h'"" maintenance management
for any
modes are (or could be).
"---- that one of the failure modes could be elimand another by improving
or procemode is dealt with by scheduled maintenance.
.,..,.,,,,,.,.,.,.,, 5 to 9 describe an
approach to deciding what is likely to
,1,,. .. i;"'"' w
be the most suitable way of uvuuu5
ith each failure.
Note also that the failure management c r d uh r. n c
one of several possibilities in each case.
For instance we could monitor impeller wear by monitoring the pump performance and only
the i mpeller when it needs it We also need to bear in mind
that adding a screen to the suction line adds three more failure possibilities, which
need to be analysed in turn (it could block up, it could be holed and therefore cease
to screen, and it could disintegrate and damage the impeller.)
C'hapters 6 to 9 examine these alternatives in more detail.
These points all indicate that the identification of failure modes is one
program intended
in the
to ensure that any asset continues to fulfil its intended functions. In praccontext and the
tice, depending on the complexity of the item, its
level at which it is being analysed, between one and thirty failure modes
are usually listed per functional failure.
i ssues in
The next two sections of thi s chapter consider two of the
this area u nder the
of failure modes
level of detail .
of the chapter consider failure
Thereafter._ the last three
sources of i nformation for an FMEA, and how failure modes and effects
should be listed.
nrAr'\ AQ,:;.,i

,-, ,:, t c,. cr,'Vl'H <'


Reliability-centred Maintenance

4.3 Categories of Failure Modes

Some people regard maintenance as being all about - and only about dealing with deterioration. Some even go so far as to specify that FMEA's
carried out on their assets should deal only with failure modes caused by
deterioration, and should ignore other categories of failure modes (such
as human errors and design flaws). This is unfortunate, because it often
transpires that deterioration causes a surprisingly small proportion of fail
ures. In these cases, restricting the analysis to deterioration only can .lead
to a woefully incomplete maintenance strategy.
On the other hand, if one accepts that maintenance means ensuring that
physical assets continue to do whatever their users want them to do, then
a comprehensive maintenance program must address all the events that
are reasonably l ikely to threaten that functionality. Failure modes can be
classified into one of three groups, as follows:
when capability fall s below desired performance
when desired pe1formance rises above initial capability
when the asset is not capable of doing what is wanted from the outset.
Each of these categories is discussed in the following paragraphs.
Falling Capability
The first category of failure modes
covers situations where cap ability is
above desired performance to begin
with , but then drops below desired
performance after the asset is put into
service, as illustrated in Figure 4.3.
The five principal causes of reduced
capability are listed below:
lubrication failures
'capability reducing' human en-ors.

Yhat tun do

Figure 4.3:

Failure Mode Category 1

Any physical asset that fulfils a function which brings it into contact with
the real world i s subject to a variety of stresses. These stresses cause the

Failure Modes and


asset to deteriorate
or more , , r,,,., .. ,, ,t-a h ,
sistance to stress.
resistance drops so much that the
deliver the desired
in other
it fails.
Deterioration covers all forms o f 'wear and tear'
of insulation,
modes should of course be i ncluded in list of failure. modes wherever
The level of detail with which
are thought to be
need to be recorded is discussed in the next
of this ,,n ,,r.r.::.r
o f fai lure modes. The first
Lubrication is associated with two
cerns lack of lubricant, and the second the failure of the l ubricant itself.
to lack of , ......... "' ." h ,....
i n the last two decades. Twenty years ago, the
were rei:>1erusr1ea manually. The cost of
compared to the cost of not
so. It was also tiny
to the cost of analysing the lub1icat.ion rea1m1err1ents of each point in detail.
a lubrication program. ""''"'"',,,
cise to
set up on the basis of a quick survey by a l ubrication
and centralised lubrihave become the norm in most industries. This has led to
where a human has to
a massive reduction in the number
oil or grease to a
and a m assive increase in the consequences of
the failure of
v iewpoint, thi s means that it is now cost-effective to:
centralised lubrication
in their own
use RCM to
consider the loss oflubricant in the few remaining
points as individual failure modes .
of failures associated with lubrication concerns
The second
deterioration of the lubricant i tsel f. It is caused
Q h ,:,. r, r, r, cr of the oil mo1ecu1es.
deterioration of the oil may be
by the buildtion.
or the presence of water or other contaminants. A lubricant
up of
because the wrong lubricant has been
m ay also fail to do its
used. If any or all of these failures modes are considered to be
in the
should be recorded and
context under consideration,
to transformer oil and
u p,, .., ,.,Uh. . .




Dirt or dust is a very common cause of failure. It interferes directly with
them to block, stick or jam. It also a nr, nr-, n,;
cause of the failure of functions which deal with the appearance of assets
which should look clean look dirty). Dirt can also cause product
mechanisms of maeither by getting into the
into products
chine tools
As a
such food, pharmaceuticals or the
failures caused dirt should be listed in the FMEA whenever they are
likely to
with a
function of the asset.

If co1mcion.ents f all off machines, assemblies fall apart or whole machines

so the relevant
come adrift, the consequences are usually very
solthe failure
failure modes should be listed. These are
the failure of threaded
or rivets due to
components such as
electrical connections or
can also fail due to fatigue or
or which simply come undone.
Also take care to record the functions and associated failure modes of
...,....,,..,. .. , ... .rr1ecnaimsms such as split pins
integrity of assemblies.
Human errors which reduce C(li'Jatnlltv
capability' category of failure modes are
The final subset of the
those caused by human error. As the name implies, these refer to errors
of the process to the extent that it is unable
which reduce the
to function as required by the user.
l:::X<:Un>les include
operated valves left shut
process to be
unable to start,
,...,,._,., ..,.,,.,,,....,", fitted by maintenance craftsmen or sensors set
in such a way that a machine
out when nothing is wrong.

If failure modes of this

are known to occur,
should be recorded
in the FMEA so that appropriate failure rrv:nv,t:>m,;>nf decisions can be
made later in the process. However, when
caused by
to record what went wrong and not who caused
take care
the analysis could
placed on 'who' at this
it. If too much
and people
to lose
of the
problems, not attaching
fact that it is an exercise in avoiding or
to say 'control valve too
, not
blame. For
it is
set by instrument technician .
.UUA,.uu.,i:;;.v.1.u..... ,. . ..

Failure Modes and Effects Analysis (FMEA)


Increase i n Desired Performance (o r Increase in Applied Stress)

The second category of failure modes occurs when desired performance
is within the envelope of the capability of the asset when it is first put into
service, but then the desired performance increases until it falls outside
the capabil ity envelope. This causes the asset to fail in one of two ways:
the desired performance rises until the asset can no longer deliver it, or
the increase in stress causes deterioration to accelerate to the extent that
the asset becomes so unreliable that it is effectively useless.
An example of the first case occu rs if the users of the pump shown in Figu re 2.1
were to i ncrease the offtake from the tank to 1 050 lit res per m i n ute. Under these
ci rcumstances, the pump is unable to keep the tan k full. (Note that in this case,
the users are not forcing the pump to work any faster - they have simply opened
a valve a bit wider somewhere else in the system.)
The second case occurs for instance if the owner of a motor car whose engine
is 'red-lined' at 6 000 rpm persists in revving the motor to 7 000 rpm. This causes
the engine to deteriorate more quickly than i f the user keeps the revs withi n the
prescribed lim its, so it fails more often.

This phenomenon is il lustrated in Fig

ure 4.4. It occurs for four reasons, the
first three of which embody some kind
of human error:
sustained, deliberate overloading
sustained, unintentional overloading
sudden, unintentional overloading
incorrect process material .




Sustained, deliberate overloading
In many industries, users quickly give w
in to the temptation simply to speed 0... ..........................
up equipment in response to increased
Figure 4.4:
demand for existing products. In other
Failure Mode Category 2
cases, people use assets acquired for
one product to process a product with different characteristics (such as
larger, heavier unit sizes or higher quality standards). People do this in the
belief that they will get more out of their facilities without any increase
in capital investment. This may even be true in the short term. However,
this solution carries long-term penalties in terms of reduced reliabil ity
and/or availability, especially when the increased stresses begin to approach
or exceed the ability of the asset to withstand them.



to claim that 'there must be ,vu,.,.o-th,.-,,rr wrong with our maintenance' ' while
nc,gg:m_g the machine to death'. These
focus on what they want
tend to thi nk in terms of what
"-VllC> t ,.J.vLUt;.;,

'better' maintenance procedures will do

a machine which
little or nothing to solve
cannot deliver the desired ..,,,,-t,..........,,.. ,, n ,- .'.>
deck chairs on
Titanic. In such cases we need to look beyond maintenance for solutions. The two
are to modify the asset to
i nherent capability, or to
the machine
within its


to increased demand
programs. l11ese programs entail mc:reas1ng the capasuch as a production line to accommo ..
mu c h to the
more probof their sponsors, these programs often seem to end up
lems than
solve. This usually
because a few small
program, with
left out of the overall
terns or
results. How this occurs i s i llustrated in
r,:H" r,,,. n rl

._, ,-,,Hh:n ,,:

,,..-, r,,..,.,.,, .,.

,"it:1> , , ,Hf",,.,,,, .., ....

Demand for the

by the facility illustrated in the
i ncreased to the extent that its users wish to increase output from 400 to 500 tons
of each operation.
The dotted lines
capable of meeting the new requ i rement
that most of the
3, 8 and 1 O are
of less than 500 tons, so they are the
'bottlenecks'. To achieve the new target, the users 'debottteneck' these 0013rat1orls
new machin es or components which a re
of producing well
over 500 tons per week. They also
the power supplies to match.
However, i n this
the n eed to upgrade the i nstrument air supply was
overlooked, so the plant
to suffer i ntermittent instrument
demand for instrument air is at a maximum. (Note also that although the u nchanged
500 tons, their margin for deteriora'"" " ....' '' '"' program, so they also
to fai l more often.)

they should
from fa ilure modes of this
be recorded in the FMEA so that
can be dealt with "........,...,.,"'.,.., ,, t .,, ,

Failure Modes and


A production line with 12 operations and supplied with four


8 I 9




the best efforts of

their engineers, debottlenecking
causes so much instability that
it is forbidden in all but the most tightly control led and heavily restricted
circumstances. In these cases, growth is handled by
for i t in the
plant and/or by building new plants .)
of the
Sudden, unintentional overloading

Many failures are caused by sudden and (usually) unintentional increases

in applied stress, u sually caused in turn by one of the
incorrect operation (for instance, if a machine is put into reverse while
moving forward)
incorrect assembly (for i nstance, overtorquing a bolt)
if a fork lift truck smashes into a pump
electrical '""''"'" '''"''' ""
or lightning strikes a poorly
These are not actually increases i n
because no-one
wants the operator to put the machine into reverse at the wrong moment
or the fork lift to smash the pump. However,
in this ,,,,,,,,. .,, ,,"'.."
because applied stress rises above the ability of the asset to withstand i t
I f any o f these failure modes are thought t o b e reasonably likely in the
context under consideration, they should be i ncorporated in the FMEA.
Incorrect process

Manufacturing processes often suffer functional failures caused by pro
cess materials which are out of specification (in terms of such variables
as consistency, hardness or pH). Similarly,
plants often suffer
from inadequate or incompatible packaging materials.


Reliabili(y-centred Maintenance

In both cases, the machines fail or run badly because they cannot handle
the out-of-spec material. This can be seen as an increase in applied stress.
In practice, these 'failure modes' are seldom the result of a failure of the
asset under review, but are nearly always the effect of a failure elsewhere
in the system. This means that remedial action has to be applied to a differ
ent asset. However, acknowledging these failures i n the analysis of the
affected asset helps to ensure that they will receive attention when the
system which is really causing the problem is analysed. As a result, these
failure modes should be incorporated in the FMEA where they are known
to affect the as"set under review, with a comment in the failure effects
column which directs attention to the real source of the problem.

Initial incapability
Chapter 2 explained that for any asset to be maintainable, its desired
performance must fall within the envelope of its ini tial capability. It went
on to mention that the majority of assets are in fact built thi s way. How
ever, situations do arise where desired performance is outside the envelope
of the initial capability right from the outset, as shown in Figure 4.6.
This incapability problem seldom
affects entire assets. It usually affects
just one or two functions of one or two
components, but these weak links upset
the operation of the whole chain. The
first step towards rectifying design
problems of this nature is to list them
as fail me modes in an FMEA.
Figure 4.6: Failure Mode Category 3

4.4 How Much Detail?

Earlier in this chapter, it was mentioned that failure modes should be de
scribed in enough detail for it to be possible to select an appropriate failure
management strategy, but not in so much detail that excessive amounts
of time are wasted on the analysis process itself.

Failure modes should be defined in enough detailfor it to

be possible to select a suitable failure management policy

Failure Modes and


difficult to find an
level of
In practice, it can be
i t important t o d o
because the l evel of detail pro
detail .
of the FMEA and
amount of time needed
foundly affects the
to do it Too little detail and/or too few fai lure modes lead
Too m any failure
and sometimes
much detail causes the entire RCM process to take much
than i t
needs to. I n extreme cases, excessive detail can
the process to take
than necessary ( a
two or even three times
known as
This means that it is essential to
balance. S ome
factors which need to be taken into account are discussed i n the
follow ing n,:;i1 nr,lr'\h.:>

The causes of any fu nctional failure c an be defined to almost any level of
in different situations. At one
and different levels are
extreme, it is sometimes
to summarise the causes a functional
failure in one statement, such as 'machine fails' , At the other,
may need
to consider what goes wrong the molecular level and/or
remoter corners of the p syche of the
and maintainers in bid to
define so-called root causes of failure.
The extent to which fai lure modes can be described at different levels
of detail illustrated in Figure 4.7 on the next three pages.
Figu re 4.7 is based o n the pump set shown Figure 4.2, some of whose failure
m odes were l isted in Figure 4. 1 .
i n which the pump set
. These fai lure
suffer from the functional fai l u re
modes are considered at seven d ifferent levels of detail.
1 ) is fail u re of the pump
the motor, the switchgear and i nlet/outlet
nrc,on=issivelv more detail. When r'rn'lcu1or,nr1
levels have been defined and failure
1-''-'' ""''"'"''"' of this example only.
a re not any kind u niversal c1ass1t1cc1tic1n
4. 7 does not show all failure
at each level so don't use this
example as a defin itive model
it is possible to analyse some of the failure modes at even lower levels than level
to do
7, but it would very seldom be
' u nable to transfer
the fail u re modes listed only
water at all'. Figure 4. 7 does n ot show fai l u re modes which would
functional fail u res, such as loss of containment or loss













Failure Modes and

4. 7 (continued):

Failure modes at different levels of detail










Failure Modes and Effects

is the connection between the

The first point to emerge from this
level of detail and the number of failure modes l isted. The
the fu rther one 'drills down' in an FMEA, the
number of failure
modes that can be listed.
For i nstance, there are five failure modes listed at level
4.7 but 64 at level 6.


Two more key issues which arise from

4. 7 concern root causes'
and human effor. They are discussed below.

Root causes
The term 'root cause' is often used in connection with the
of failures. It implies that if one drills down far enough, it is
to arrive
at a final and absolute level of causation. In
this is seldom the case.

For i nstance, i n Figu re 4.7 the fail ure mode 'im peller nut
at level 6, which in turn is caused by an
error' at level 7 . If we were to
g o down one level further, the assembly e rror might have occu rred because the
'fitter was distracted' (level 8). H e m ight have been distracted because his 'child
was HI' (level 9). This fail u re might have occurred because the 'child ate bad food
in restaurant' (level 1 0) .

Clearly, this process of dri ll ing down could go on almost forever way
beyond the point at which the
the FMEA has any constresses ret)eDlrec11v
trol over the failure modes. Thi s is
that the level at which any failure mode should be identified is the level at
which it is possible to identify an appropriate failure
out an FMEA before failures
(This is equally true whether one is
occur or a 'root c ause analysis' after a failure has ""''-'"",.,,,,.....
The fact that the level which i s appropriate varies for different fa ilure
modes means that we do not have to list all failure modes at the same level
on the Information Worksheet. Some failure modes
be identified at
level 2, others at level 7, and the rest somewhere in between.
H l UaH<.<J>','-",C 1 H.,"ll'-

For i nstance, in one particular context, it may be appropri ate to l ist only those
failure modes shaded i n grey i n Figure 4.7. I n anoth e r context, it may be appro
priate for an entire FMEA for an identical pump set to consist of the single fail ure
mode 'pump set fails'. Another context may call for yet another selectio n .

Obviously, in order to be able to stop at an

level, the
doing such analyses need to be aw,ffe of of the full range of failure management policy options. These are discussed at length in
6 to 9.
Other factors which influence the level of detail are considered in the
rest of this
of this

Human error
Part 3 of this chapter mentioned a number of '"'-- - ways in which
human error could cause machines to fail. It went on to
that if the
they should
associated failure modes are thought to be reasonably
'"'"' r""""'..,,.,."" r1 i n the FMEA. This has been done i n
where all
w ith the word 'error' are some fonn of human
a brief summary
issues involved in the
classification and u1duu5vwvlH. of such errors .
't'Yl<'.l, Tl'l ,lTO ,n1c" '""

Different failure modes occur at different
Some may occur
regularly, at average intervals measured in months, weeks or even
Others may be extremely improbable, with mean times between occur
rences measured in millions of years. When preparing an FMEA, decisions must be made
as to what failure modes are so unlikely
This means that we do not try to list every
failure possibility re.Q:an11e:ss of i ts likelihood.

When listingfailure modes, do not try to list every

single failure possibility regardless of its likelihood
Only failure modes which might reasonably be ""v,.,"'t',,,r, to occur in the
context i n
should be recorded. A l ist of
likely' fail
ure modes should include the following :
which have occurred before on the same or similar assets.
These are the most obvious candidates for i nclusion in an FMEA un
less the asset has been modified so that the fa ilure cannot occur again.
sources of information about these failures i nclude
As discussed
people who know the asset well (your own employees ) vendors or other
users of the same equi pment) ) technical history records and data banks.
In thi s context note the comments in Part 6 of this chapter about the
of most technical history
12 about
and in
of too much reliance on historical data.
modes vvhich are
maintenance w as
and so which would occur if no
done . One way to ensure that none of these failure modes have
maintenance schedules and ask
been overlooked is to study
"what fai lure mode would occur if we did not do this task?"
<' hf'rh' Arrn tt t l'Q

JrfHI T . .. , ,.H

Failure Modes and


a review of
schedules should only be c arried out
has been comr:ne1:eo
a final check after
rest of the RCM
in order to reduce the possibility of
the status quo.
users of RCM are
to assume that all
fail ure
and hence that these
modes are covered their
are the
fa i lure modes which need to be considered in the FrvlEA.
This assum'PtKm

are con
to deal

we don't want to
w aste time on fa ilures which have never occurred before and which are
the context in
are i nstalled on the motor
4. 7. This means that the l i kelihood of lubrication failure
pump shown in
is low - so low that it would not be i ncluded in most FMEA's. On the other hand,
failure due to lack of l ubricant probably would be included in FMEA's nn::, n::l rart
for manually lubricated components, centralised lube systems and aet:ub1Jxeis .
However, the decision not to list a failure mode should be r,,.,. ,,.n r., " ,..'"'''
careful consideration of the failure consequences.
n ,:>c< ru, -n

If the consequences are likely to be very severe
then less
failure possibilities should be 1isted and
For instance, if the pump set in
the failure mode
sky' would be wsmH;se, o mnmemately
pump were pumping
mode is more likely to be taken .;,,:;;, ,v..,,-:.. v
(Appropriate failure management
might be
to ban aircraft from
or to
roof which can withstand a ,.,..c,chinr,

Another example from
is 'motor not switched on'. This failure m ode
to be dismissed o n
o f impro bability i n most situations. Even
if it does occur, the consequen ces may be so trivial that it is excluded from the
if it could occur and it does matter - especial ly in cases
and something could
m ust be switched on in a
if they are not - then this failure mode should be considered. )

Cause vs Effect
Care should be taken not to confuse causes and effects when listing failure
modes. This a subtle mistake most often made by people who are new
to the RCM process.
nv.::,c"'' all of the same design and all
For example, one p'lant had some 200 /'10!:l
the same type of
f n iti,.,, ; ... ""t'
1"1"\!lr.Utlr"li"t failure modes were recorded for one of th ese rtOC>Yht'\VOC''
Gear teeth stn1opea.
These failure modes were listed to begi n with because the people carrying out the
review recalled that each failure h ad happened in the past to their knowledge
old). The failures did not affect
of the
So the implication was that it might be worth doing
but they did affect
n.rc,HOir"ITl\,tO tasks like 'check
for wear' or 'check gearbox for backlash',
However, furthe r discussion revealed
that both failures had occurred because the oil level had not been checked when
it should h ave been, so the
due to lack of oil. What
had failed if they had been
is m ore, no-one coul d recal l that a ny of the
property lubricated. As a result, the failure mode was eventually recorded as:
Gearbox fails d u e to l ack of oil.
obvious proactive task, which was to check
This u nderlined the
is not to suggest that all gearboxes should be anathe oil level
i n this way. Some are m uc h more complex or m uc h more heavily loaded,
to a wider variety of failure modes. In other cases, the failure
and so are
m ay be much more severe, which would call for a more defensive
fail u re OO!SS1tm1t1!es.

Failure Modes and the Operating Context

We have seen how the functions and fu n ctional failures of any item are
c ontext. This is also true of fail ure modes in
influenced by its
terms of causation, probability and c onsequences.
consider the three pumps shown in Figu re 2. 7. The failure modes
which are likely t o affect the
pump (such as b rinelling of the bearings,
and even the 'borrowing' of key compost;:}nrn:in()1n of water in the pump
nents to use elsewhere in an e mergency) are different from those which m ight
affect the
pump, as set out i n Figure 4.7.

Failure Modes and


1 rr1 1 1 ,:i ,I\, a vehicle operating i n the Arctic would be

to run,c:na 1rn
modes from the same m ake of vehicle
in the Sahara desert . ..._ir.-, 1 1 ,:,r1"
a jet ai rcraft wou ld have different failure modes from
a gas turbin e
mover on an oil platform.
same type of turbin e actin g as a

These differences mean that

should be taken to ensure that the
vv,, ....,,,... i.., identical before applying an FMEA developed in one
set of circumstances to an asset which is used in another.
comments regarding the use
FMEA's in
6 of this
context affects levels of
and consequences of failure. As discussed
to identify failure modes for two identical assets
ting context and at another level in another.

4.5 Failure Effects

The fourth
in the RCM review process
when each failure mode occurs. These are known

Failure effects describe what happens

when a failure mode occurs
(Note that failure effects are not the same as failure consequences. A
failure effect answers the question "what
a failure
consequence answers the question
does it matter?".)
A desciiption of failure effects should include all the infom1ation needed
to support the evaluation of the consequences of the failure.
the following should be recorded:
what evidence (if
that the failure has occurred
in what ways any) it poses a threat to safety or the environment
in what ways any) it affects
or operations
what physical
any) is caused by the failure
the failure.
what must be done to
that one of
These issues are reviewed in the
Ull IPTrlPr
is necessary. If we are to do this correctly,
cannot assume that some
sort of proactiv e maintenance is
of a
so the
fai lure should be described if nm:nrnt!Z
r" ""'"""''"""'



Evidence of l<"ailure
Failure effects should be described in a way which enables
to decide whether the failure will become evident to the
the RCM
,,..,.,,,,.,.,,, ........ ,.. crew u nder normal circumstances.
should state whethe r the failure causes warning
For i nstance, the
to come o n or alarms to sound (o r both) , and whether the warning is given
on a local
or in a central control room (or both).

the description should state whether the failure is accompanied

by obvious physic al effects such as
fire, smoke,
esc:aomg steam, unusual
or pools of liquid on the floor. It should
also state whether the machine shuts down as a result of the failure.
For example, if we are
bearings of the pump shown
in Figure 3.5, the failure effects m ight be described as follows (the italics describe
what would make it evident to the operators that a f aifure has occurred):
Motor trips out and trip alarm sounds in the control room. Tank Y low level alarm
sounds after20 minutes, and tank runs dry after 30 minutes. Downtime required
4 hours.
In the case of a sta1t1onarv nc turh,r,a a fail u re mode that occurred in practice was
the g radual build u p of combustion
on the compressor blades. These
aeios1ts could be partially removed by the periodic i njection of
i nto the air stream , a process known as 'jet blasting'. The failure effects were de
scribed accordingly as follows:
t ;oimnrP.or efficiency declines and governor compensates to sustain power
output, causing exhaust temperatu re to rise. Exhaust temperature is displayed
and in the central control r oom. If no action is taken,
on the local control
exhaust gas temperat ure rises above 475C under ful l power. A high exhaust
gas temperature alarm annunciates on the local control panel and a warning
light comes on in the central control room. Above 500C, the control system
shuts down the turbine. (Running at temperatures above 475 C shortens the
creep l ife of the t u rbine blades.) The b lades can be partially cleaned by jet
blasting takes about 30 minutes.

111is is an unusually complex failure mode, so the description of the fail

ure effects is somewhat
than usual . The average description of a
failure effect usually amounts to between twenty and sixty words.
do not prejudge the evaluation of the
failure consequences by
the words 'hidden' or 'evident' They are
part of the consequence evaluation process, and
them prematurely
could bias this evaluation incorrectly.
failure effect descripwhat would happen if the protected device were
tions should
'. v device was unserviceable.
to fail while the t',.......,.'-'""''"''"


Failure Modes and Effects

and Environmental Hazards

Modern industrial
proportion of failure
if there is a possibility that someone could
killed as a direct resul t of the failure, or an environmental standard or
the failure
should describe how this
could be
""""' ""' 1 ', h r,n

increased risk of fire or ex101c,s1cms

the escape of hazardous chemicals
falling objects
pressure b ursts
pressure vessels and
exposure to very hot or molten materials
the disintegration of
vehicle accidents or derailments
or moving ., ...,...., .,, .. .,.,
exposure to sharp
increased noise levels
of structures
of bacteria
of dirt into food or pharmaceutical

statements like ''this

the environment".
state what
and leave the evalu ation of the co1t1sc:auenices
to the next stage of the RCM process.
Note also that we are not only concerned about
threats to our
but also about threats to the safety
of customers and the
as a whole, This may call for some
into the environmental and
standards which govern the process under review.

Secondary Damage and Production Effects

shoul d also help with
tional and
failu re c onsequences.
indicate how production is affected at all), and
This is
given by the amount of downtime associated with each failure.
r! O C r,., r,t , ,(._TH'

flPl'' l ' f ,rH,'

In this context, downtime means the total amount of time the asset
to this
would nonnally be out of service
from the moment
As indicated in
fails until the moment it is folly
this is

Figure 4.8:

Downtime vs

........ REPAIR ..,...___


Downtime as defined above can vary

for different occurrences of
and the most serious consequences are usually caused
which are of most interest
oura12:es. S ince it
to us, the downtime recorded on the information worksheet should be
worst case' .
based on the
For instance, if the downtime caused by a failure which occurs late on a weekend
it is when the failure occurs o n a normal day
is usually much
occurrence, we l ist the former,
and if such night shifts are a

n , r, ih"t c h ltt

the operational consequences of a failure

to shorten the downtime, most often by
the time
as discussed in Chapter 2, we
it takes to hold of a spare
the problem at this
so the analysis
are still in the process
should be based at least i nitially on current spares holding
Note that i f the failure affects operations, i t i s more important to record
downtime than the mean time to
(MTTR), for two reasons:
the word 'repair time' has the
i n many
4.8 . If this is used instead of downtime, it could upset the subseo1ue1rn assessment of the
consequences of failure
we should base the assessment of consequences on the 'typical worst
case and not the 'mean' , as discussed above.
If the failure does not cause a process
then the average amount
of time it takes to
the failure should be
because this can
be used help establish manpower rec1mJrenflerus.
"'" ' ' P"'<'

Failure Modes and

In addition to downtime, any other ways in which the failure could have
effect on the operational c apability of the asset should be
listed. Possibilities include:
whether and how product quality or customer service
if so whether any financiaJ penalties are involved
whether any other equipment or activity also has to
slow down)
whether the failure leads to an increase in overall ""''"'
"'h ,,
costs in
addition to the direct cost of
(such as higher energy
(if any) is caused by the failure.
what secondary
Corrective Action

Failure effects should also state what must be done to

the failure.
This can be included in the statement
as shown in italics
in the following examples:
bearings about fou r hou rs
Downtime to
Downtime to clear the blockage and reset the
switch about 30 minutes
Downtim e to strip the turbine and replace the disc about 2 weeks.

4.6 Sources of Information about Modes and Effects

When considering where to
information needed to draw up a reason
ably comprehensive FMEA, remember the need to be
means that as much emphasis should be placed on what could h appen as
on what has happened. The most common sources of information are dis
cussed in the following paragraphs, together with a brief review of their
main advantages and disadvantages.
The manufacturer or vendor of the equipment
out an FMEA, the source of information which
springs to mind first is the manufacturer. This i s
s o i n the case o f
new equipment. I n some industries, this has reached the point \Vhere manu
facturers or vendors are routinely asked to provide a comprehensive FMEA
of the equipment supply contract. Apart from
request implies that manufacturers know everything that needs to be known
about how the equipment can fail and what
when this occurs.
This is seldom the case in reality.

few manufacturers are involved in the day-to-day operation of the equipment After the end of the
almost none
users about what fails and why.
reeao,1cK from the
The best that many of them can do is try to draw conclusions about how
mpnt is performing from combination of anecdotal evidence
1 l-''""'"'u"
'-''"f 'U
their 1::.n
'.l n,h'"'
1 " of spares sales (except when a really spectacular failure
and an ........
.:1 ":"
occurs, in which case
tend to take over from
At this
rational technical discussion about root causes often
Manufacturers also have little access to infommtion about the
context of the equipment, desired standards of perfonnance, failure
consequences and the skills of the user's
and maintainers. More
often the manufacturers know nothing about these issues. As result.
FMEA's compiled by these manufacturers are usually
and often
highly speculative, w hich
limits their value.
The small minority ofequipment manufacturers who are able
FMEA on their own
fall into one of two '"'' T,:,. c..,, ....,. ..

are involved in maintaining the equipment throughout its useful

life, either
associated vendors. For instance
or through
most privately-owned motor vehicles are maintained by the deal ers
who sold the vehicles. This enables the dealers to provide the manufac
turers with copious failure data.
to cmry out formal reliability studies on prototypes as part
of the initial
This is a common feature of military
procurement, but much less common in industry.
In most cases, the author has found that the best way to access whatever
manufacturers possess about the behaviour of the equipment
field technicians to work
in the field is to ask them to supply
H,,,.....-.'". '"'"" the
who wil l evcnnia11
_ ----"Y operate and maintain the asset,
to develop FMEA's which are
to both parties. If this sugges
tion is adopted, the field technicians should of course have unrestricted
access to
to help them answer difficult arnest1or1s
issues such as
When adopting this
should be able to speak fluently, techniuu.1,F'., uai:::.,,;:, which the
cal support, confidentiality, and so on should be handled at the contractso that everyone knows clearly what to
of eac h other.
Note the
to use field tec hnicians rather than
Dereluctant to admit that their designs can fail,
to help develop a sensible FMEA.

Failure Modes and E1Jects


Generic lists ojfailure modes

'Generic ' lists of failure modes are lists of failure modes or sometimes
They may cover entire c "''""' n-.c
entire FMEA's - prepared by third
but more often cover individual assets or even
generic lists are touted as another method of speeding up or ,-.,rr"''"'"'""
this part of the maintenance program
process. In
should be
1.... ,

the level of analysis may be inappropriate: A

list may identify
failure modes at a level equivalent to
level 5 in
4.7, when
all that may be needed is level I . This means that far from ,::>,,,--,., .. .-. . .v
the process, the
l ist would condemn the user to
more failure modes than necessary. Conversely , the
list may
focus on level 3 or 4 in a situation where some of the failure modes
really ought to be analysed at level 5 or 6.
t T..

the operating context may be different: The operating context of your asset
may have features which make it susceptible to failure modes that do not
list Conversely, some of the modes in the
appear in the
improbable (if not impossible) in your context.
list might be
peef<Jnnance standards may differ: your asset may
to standards of
which mean that your whole definition of failure may be
completely different from that used to develop the
vv,, .< VJllU'"''"'"'

These three points mean that if a

list of failure modes is used at
all, it should only ever be used to supplement a
and never used on its own as a definitive list
same equipment
Other users
Other users are an obvious and very valuable source of i nformation about
what can go wrong with commonly used assets, provided of course that
competitive pressures permit the
of data. This is often done
through industry associations (as in the offshore oil i ndustry), through
regulatory bodies (as in civil aviation) or between different branches of
the s ame organisation. However, note the above comments about the dandata when considering these sources of i nformation .
gers of
Technical history records
Technical history records can also be a valuable source of information.
However, they should be treated with caution for the following reasons:

are often mc:on101ete
more often than not,
describe what was done to
the failure
main bearing') rather than what caused it
they do not describe f a ilures which have not yet occurred
they often describe fa ilure modes which are
the effect of some
other failure.
These drawbacks mean that technical history records should only be used
source of information when
an FMEA, and
as a
never the sole source.
i -., ic,. ...,.,.,._,""

who operate and maintain the emuvirne,ru

In nearly all cases, by far the best sources of infonnation
FMEA are the
and maintain the
on a daytend to know the most about how the equipment works )
what goes wrong with it, how much each failure matters and what must
be done to fix it and if they don't know, they are the ones who have the
most reason to find out.
is to arrange
The best way to capture and to build on
of the FMEA as part
formally in the
for them to
of the overall RCM process. The most efficient way to do this is u nder the
... u,,....,..,.... ,., of a
(The most
valuable source of additional information at these
is a comprehensive set
coupled with ready
access to process and/or technical
on an ad hoc
approach to RCM was introduced in Chapter l and is discussed at much
in Chapter 1 3.

4.7 Levels of Analysis and the Information Worksheet

Part 4 of this chapter showed how failure modes can be described at
almost any level of detail. The level of detail which is ultimately selected
policy to be identified. In
should enable a suitable failure
detail) should be selected if the component or
to be allowed to nm to failure or
to failurefinrtino while lower levels (more detail) need to be selected if the f ailure
to be subjected to some sort of proactive maintenance.
mode is
i u , ... " ..' "" '

Failure ,Hodes and

modes on lnfnrrn:1ti,,n
The detail used to describe
is also influenced
the level at which the FMEA as whole is carried
the level a t which the entire RCM
out This in tum i s
For this reason, we review the principal factors which
i nfluence the overall level of analysis (which is
known 'level of
how this affects the detail with which
failure modes should be described.
L U " "' '" "' ' V "

Level of analysis
RCM is defined as a process used to determine what must be done to
e nsure that any
asset continues to do whatever users want
to do in its
context. In the l ight of this "''""' U H v ....
have seen that it is necessary to define the context in detail before \VC can
we also need to define
what the
appl y the process.
'physical asset' is to whi c h the process wiH be applied.
RCM to a truck, is the entire truck the
For example, if we
subdivide the truck and
(say) the drive train "'"''"""'" T'"'1"
the chassis and so on? Or should
further subdivide the
system, the
{say) the
from the
drive train and
not be
differentials, axles and wheels? Or should the
b lock, engine management system , cooling
so on before
starting the analysis? What about subd ividing the fue l Sy'Stem into tank, pump,
and filters?

This issue needs

th)ught because an
a level becomes too .., ....,,....,.....,...,,,, ..u.
and unintelligible. The
plore the
out the
Starting at a low level
One of the

out the

For example, when thinking about the fail u re modes which coul d
vehicle, a possibility which comes t o mind i s a b locked fuel
part of the fuel system, so it seem s sensible to address this failure
4.9 indicates that if the
a Worksheet for the fuel system.
out at this l evel, the b locked fuel line
be the seventh failure mode
a dozen which could cause the functional fail ure
identified out of a total
'unable to transfer any fuel at all'

----- --..". -----.... --.. WO RKS HEET lsussvsM..."----


4.9: Failure modes of a fuel system

When the deci sion ,unrv c, i,..,.,,,.r

we consider that the vehicle can
if not hundreds of sub
is canfod out for each subprogresses,, the more difficult it be the further down the
standards. (One c ould
comes to
also ask who actually
about the
amount of fuel """'--'"u,,,..,
r n r,,, , ,rn the fuel
as long as the fuel economy of the vehicle is
within reasonable limits and the vehicle has e nough power.)

at a low level,
failure consequences.

difficult to visualise and hence to ana-

the lower the level of the

"""'''"' f'c
dec ide which 1..,''"'""'
accelerator part of the fuel

the more difficult it becomes to

to which system
instance, is the

some failure modes can cause many sub-systems to cease to function

as a failure in the supply of power to an industrial
"">ft'> i "'""'n on its own, failure modes of this
loops can become very difficult to deal with in
c ontrol and
especially when a sensor in one
a low-level
an actuator in another
a processor in a third.
off the flywheel the 'engine
For i nstance, rev l imiter which reads
through a p rocessor in the 'engine
block' sub-system might send
control' sub-system to a fuel shut-off valve i n the 'fuel' sur1-svsrem


Failure Modes and

the same function

d ifferent ways, and the same
t hree times in
task ........
more than once for the same
,,.(r rl

a new work sheet has to be raised for each

to the
of vast
ter memory space . The associated manual or electronic
structured if the
have to be very
mamage:m1e. In short, the whole exercise starts to become much more
extensive and much more
than i t needs to be.
:, , r,:11r-rn1
FMEA's are often carried out at too low a level in the
because of belief that there a correlation between the level at which
failure modes and the level at which the FMEA
the RCM
analysis as a whole) should be r,.,.,,.,.A""'TTI "rt
failure modes in
that if we want to
FMEA for each
ln fact, this is not so. The ]evel at which failure modes can be identified
o f the l evel a t which the
as shown
in the next section of this r,t, int;:>r
Prl l l l Y\tY\.i n , n . .

at the top
towards the bottom of the ernLuurm<nt hierwe could start at the top.
For example, the primary function of a truck was listed on page 28 as follows: To
transport up to 40 tons of material at
of up to 75
from Startsville to Endburg on one tank of fuel. ' The first functional failure
ciated with this fun ction is 'Unable to m ove at all'.The fou r failure modes shown
4.9 could a l l cause thi s functional failure,
i n stead o f
listed o n
Worksheet for the fuel system ,
could have been l i sted o n a
Worksheet covering the e ntire truck, as shown i n
4 . 1 O.

40 ton
--- - --- - - -WORKSHEET i:Fvc?'i'Eu----------------

Figure 4.10: Failure modes of a truck

,,. ..,.""'"'''' at thi s level are as follows :
.........,,.+r.,,..,...,,,,..'"''"'"' e:x:p(cr:mcms are much easier t o define
failure consequences are much easier to assess
it is e as ier to identify and
and circuits as a whole
there is less repetition of functions and failure modes
it not necessary to raise a new information worksheet for each new
carried out a t thi s level consume far less paper.
sub-system, s o
1-1r-.u,.,.,.n, ... the main
the analysis at this level
that there are hundreds of failure modes which could render the truck
,,T,. .,,.,,,,""'"' u nable to move. These range from a flat
crankshaft. So i f we were to
to list all the failure modes at this level,
likely that several would be overlooked am)gemer.
it is
For i n stance, we have seen how the b locked fuel
have been the
seventh failure mode out of twelve to be identified i n the
carried out at
level. However, at the truck level, Figure 4. 1 O shows that it might
the 'fuel
have been 73rd out of several h undred failure modes.

SUgest that
at an i ntermediate level. In fact,
it may be
we are almost spoiled
because most assets can be sub--divided
into many levels and the RCM process applied at any one of these
'.) t\':d "U C: P C'

1 shows how the 40 ton truck could be divided into at least

five levels. It traces the h iera rchy from the level of the truck as a whole down to
the level of the fuel l ines. It goes o n to show how the primary function of the asset
might be defined at each level on an ACM I nformation Worksheet, and how the
b l ocked fuel line could be appear at each level .

G iven
how do we select
the level at which to perform the
We have seen that the
level usually embodies too many failure
modes per function to permit sensible analysis. In spite of thi s however,
the main Jimctions of the asset or
we still need to
at the
in order to provide a framework for the rest of the analysis.
o.v,.11.11,, vv a truck to carry goods from A to B, not to pump
a fuel l ine. Although the latter function contrib utes to the former, the
or"t...,nr,n-.:u"I" ' of the asset and hence of its maintenance tends to be
overall ,.,n,._,,;
levels. For i nstance, the chief executive of a truck fleet is
judged at the
to ask 'how i s truck X pertorming?' than 'how i s the fuel system
much more
on truck X F'"''
,..,, iirnir,f'1'J'
.... 1::1 (unless the fuel system is known to be causing a problem).

Failure Modes and




'Engine 6[ocl(



1 To transfer fuel from the fuel tank to A Unable to transfer any

1 litre per
fuel at all

:Jue[ tank

A Unable lo carry any
fuel at a!I

Figure 4.11: Functions and failures at different levels

Chapter 2 explained that in practice, a statement of the operating context

provides a record of the main functions and associated
at levels above the level at which the RCM
dards of any asset or
analysis is to be carried out.


Reliability-centred Maintenance

On the other hand, we have seen the initial i nclination is nearly always
For this reason, a good general rule
to start too low in the asset
esr>ecrntLY for people new to RCM) is to carry out the analysis one level
or even two levels higher than at first seems sensible. Thi s is because
easier to break complex sub-systems out of a high-level anait is
than it is to go up a level when one has started too low. Thi s is dis
cussed in more detail in the next section of thi s chapter.
With a bit of practice (especi ally concerning what is meant by 'a level
at which it is possible to identify a suitable failure management policy ' ),
the most suitable level at which to carry out any analysis eventually be
comes i ntuitively obvious. In this context, note that it is not necessary to
at the same level throughout the asset hierarchy.
For instance, the entire braking system could be analysed at level 2 as shown i n Fig
u re 4 . 1 1 , but lt may be necessary to analyse the engine at level 3 or even level 4.

How Failure Modes and Effects Should be Recorded

Once the level of the entire RCM analysis has been established, we then
of detail is necessary to define each failure
have to decide what
mode within the framework of that analysis. There is no technical reason
with their
why all the failure modes cannot be listed
a level which enables a suitable failure management policy to be selected.
However, even intermediate level analyses sometimes generate too
many failure modes per functional failure, especially for p1imary functions.
This usually happens when the asset
complex subassemblies
number of failure modes.
which could themselves suffer from a
Examples of such subassemblies i nclude small electric motors, small hydraulic
systems, small gearboxes, control loops, protective circuits and complex couplings.

Depending as usual on context and consequences, these sub-assemblies

can be handled in one of four different ways, as discussed below.

Option 1
List all the reasonably l ikely failure modes of the subassembly individu
ally as part of the main analysis in other words, at levels equivalent to
level 3 , 4, 5 or 6 in Figure 4. 7 .

For example, consider a n asset which could stop completely as a result o f the
failure of a small gearbox. On the I nformation Worksheet for this asset, this gear
box fail u re could be listed as shown below:

Failure lvfodes and

3 Gearbox seizes due

to lack of oil
...... etc

In general, the the failure modes which could affect a :st11Ja,:-:;cu llJ1
in a higher level
if the
suffer from no more than about 6 failure modes which
considered to
functional failure of
and which \Vill
be worth
the higher level
Option 2
failure mode the Infom1aList the failure of the
tion Worksheet to
then rai se a new worksheet to 'A ""''"' '"'
.,.,...,,,.,... ,".,.,. functional fai lures, failure modes and effects of the sut)-assemt)lV
For o.v! the failure the gearbox discussed above could
as follows:
'"' PT'\".>r <TP

is usually w orth
in this
ure modes of the
one function
could cause the loss
the main as;en101v
there are between 7 and 9 fai lure modes per functional
u se
in mind that ,,.,., .,........ ,,.., tl n iH u c::P-2 mean
option one or option two,
but fewer fai lure modes per
Option 3
List the failure of the Slm-as;en101.v on the Information Worksheet a
at level
to level one or
in other
and leave
record its
to treat
For example, if it was considered
discussed above in this fashion , it wou ld be listed shown overleaf:


or sutiassemo1v
be adopted for a
which has the following characteristics:
to detailed diagnostic and repair routines when it
It 1s not
to l ater repair
but is simply replaced and either d iscarded or
it is
small but
it does not h,ive any dominant failure modes
to any form
it is not likely to be
Option 4

a complex
might suffer from one or two domand a n umber of less
inant failure modes which are readily
common failures
not be
and/or the consequences of the failures do not warrant it
a small electric motor operating i n a dusty environment m ight be
its cooling fan gets blocked,
certain to fai l due to overheatin g if the
far between and not very serious if they
do occu r. In this case, the failure modes for this motor might be listed as fol l ows:
motor fan blocked by dust
motor fail s

This option is

a combination of options 1 and 3.

The failure of services
water, steam,
gases, vacuum, etc) are
treated as a
failure mode from the point of view of the asset which
is supplied by that
because detailed analysis of these failures is
beyond the scope of the asset in question. Such failures are noted
their effects recorded and
for infom1ation purposes
as a whole.
in detail when the service is
are then
listed in the l as t column of the Information Worksheet
4. 1 3.
the relevant fa ilure mode, as shown in
,.,...,.,,, n ;..,,,

'-"'-""';;;,,,.a ...,v


5 MW Gas Turbine



B I Gas flow restricted





exhaust temperature may rise to where it shuts down










Failure Consequences

""'I"''"'"'-'"-' how the RCM process asks the followPrevious

seven au,est10r1s about each asset
what are the functions and associated per/ormance standards ofthe
asset iu its pr,esent operating context?
in what ways does it fail to fulfil its .fimctions ?
what causes each functional failure ?
what happens when each failure occurs ?
in what way does each failure matter?
what can be done to predict or prevent each failure?
what if a suitable proactive task cannot be found?
were discus sed at length in Chap
The answers to the first four
ters 2 to 4. These showed how RCM Information Worksheets are u sed to
record the functions of the asset under review, and to list the associated
failure modes and failure effects.
are aske d about each individual failure mode.
The last three
considers the fifth amsnon:
in ivhat way does

5.1 Technically Feasible and Worth Doing

which uses the asset i s affectime a failure occurs, the
ted in some way. Some failures affect output, product quality or customer
service. Others threaten safety or the environment Some increase opercosts. for instance
energy consumption, while a few
have an impact in
five or even all s ix of these areas. Still others may
appear to have no effect at an if they occur on their own, but may expose
to the risk of much more serious failures.
If any of these failures are
the time and effort which need
them also affects the
to be
be better used elsewhere.
failures consurnes resources which



n....... .-u of these effects govern

The nature and .::i'"""c:"v\.,1;lJ
f ilure is viewed by the
a.,vu. The
the extent to which each failure
in other
i;;., .. ,.. ..._, ..

each failure mode.
This combination of context, standards and effects means that every
failure has a
associated with If the
then considerable efforts will be made to
quences are very
at l east to
it in time to reduce or eliminate the
true if the failure could hurt or kill someconsequences. This is
or i f it is
to have a serious effect on the
It is also
true of f ai lures w hi ch i nterfere with production or
if the failure only has minor c onsequences, it is
possible that no proactive action w i l l be taken and the failure simply cor
rected each time i t occurs.
that the consequences of failures are more ""'""' """
than their technical characteristics. It also suggests that
whole i dea of
proactive maintenance is not much about
failures it is
of failure.
about avoiding or

Proactive maintenance has much more to do with

avoiding or reducing the consequences offailure
than it has to do with preventing the failures themselves
task is only
If this is accepted, then it stands to eason that
if it deals successfully with the consequences
which it is meant to "'"""'"""'.. ;.

A proacthe task is worth doing if it deals successfully with

the consequences of the failure which it is meant to prevent
This of course presupposes that it is possible to
Whether or not a
failure in the first
1t::u:.1,u,c depends on the technical characteristics of the task and of the
fai lure which it is meant to prevent The criteria
are discussed in more detail in \..... uav\., . L:>
not possible to find a suitable
task. the nature of
fa ilure consequences also indicate what default action should be taken.
Default tasks are reviewed
8 and 9.



the criteria used to evaluate the

and hence to decide whether any form
These consequences are divided in two
tive task is worth
The first
hidden functions from evident
r'h<,1"'\t,:,,r r,,,_.,,.. ,, , , ,,.,.,..-c,

5.2 Hidden and Evident Functions

We have seen that every asset has more than one and sometimes dozens
of functions. When most of these fu nctions fail , it will
,,,.......,,,..,,.....r to someone that the failure has occurred.
some failures cause warning lights to flash or alarms to
or both. Others cause machines to shut down or some other part
of the process to be interrupted. Others l ead to product quality
or increased use of energy, and
others are accompanied by obvious
effects such as loud
steam, unusual smells or
of liquid on the floor.
ava,rnn, 1 Figure 2.7 in Chapter2 showed three pumps which are shown again
pumping capability is lost This
operators, either as soon
own will
or when some downstream part of the process is interrupted. (The
not know immediately that the problem was caused by the
but they would eventually and inevitably become aware of the fact that
sor11e1thir10 unusual had happened.)
Figure 5. 1:

Three pumps

Stand Alone



Failures of this kind are classed as evident because someone will even
tually find out about it when they occur on their own. This leads to the
following definition of an evident function:

An evident function is one whose failure will on

its o wn eventually and inevitably become evident
to the operating crew under normal circumstances
some failures occur in such a way that nobody knows that the
in a failed state unless or until some other failure also occurs.



For i nstance, if P u m p C in Figure 5 . 1 failed, no-one would be aware of the fact

because u nder normal circumstances Pump 8 wou l d still be
In other
words, the fail u re of Pump C on its own has no direct
u nless or until
B also tail s (an abnormal cm:unist:anc:e).

Pump C exhibits one of the most i mportant characteristics of a hidden

function, which is that the failure of this pump on its own will not become
crew under normal circumstances. In other
evident to the
it will not become evident unless pump B
fails. This leads to the
+,.,. i , ,,,,,....,,,.,. definition of a hidden function:

A hidden function is one whose failure will

not become evident to the operating crew under
normal circumstances if it occurs on its own.
hidden functions from
The first step in the RCM process is to
evident functions because hidden functions need
in Part 6 of this chapter. We will see l ater that these
are discussed at
devices which are not fail safe.
functions are associated with
modes which
S ince
can account for up
hidden functions could well become the
dominant issue in maintenance over the next ten years.
we fi rst consider evident failures.
hidden functions in

Categories of Evident Failures

Evident failures are classified i nto three ,':lr,:,,Yt'\ru:c in (i,:c,,;c pn, h, n l'I' order of
importance, as follows:
conse safety and environmental consequences. A failure has
quences if it could hurt or kill someone. It has environmental conse
quences if it could lead to
or national
environmental standard

operational consequences. A failure has r.n,,.r:> l'H'\t'> ;J

it affects producti on or
service or operating costs in addition to the direct
non-operational consequences. Evident failures in this
mode are considered.

Reliabilitycentred Maintenance
This approach also means that the safety, environmental and economic
which is much
consequences of each failure are assessed in one
them separately.
more cost-effective than
of these t"'"" '"' .."'"''
next four sections of this
on to the
with the evident
and then
in detail.
rather more complex issues
0 1

" U ld. v uu,UlJl,;:;.

5.3 Safety and Environmental Consequences

Safety First
As we have seen. the first
in the consequence evaluation process is
hidden functions so that they can be dealt with appropriately.
modes - in other words, failures which are not clasmust
definition be evident 111e above paragraphs
and environmental
ex10 1ameo that the RCM process considers the
., .....,.,,..,,._.,."' ...., of each evident fai lure mode first It does so for two reasons:
cus a more and more
that hurting o r ki lling people in the course
tomers and society i n
of business is
not tolerab le, and hence that "" ' """J t h t n cr ,..r,,, c , ik l a
should be done to minimise the possibility of any sort
incident or environmental excursion .
that the
T,,r (:< l-rPT'lJ_ r,,,,. l > tPrl incidents tend to be several orders of magni tude lower
than those which are tolerated for failures which have operational consequences. As a result, in most of the cases where a
task is
from the
refers to the
of individuals in the workplace.
At one
Specifically, RCM asks whether anyone could
a direct result of the faiJure mode
which may be
caused by the failure.
........,,.-. ....... .. t- ., ;-, >'<'.:> ' 1 1 <' '>Ttnn

n..-,,, n. .:, :h.1

U.U\,.0 \.l t.<'lO.'-'

A failure mode has safety consequences

if it causes a loss offimction or other
damage which could hurt or kill someone



At another level, 'safety ' refers to the

.,::;_'"'"'"''' 'u Nowadays, failures which affect
tend to be classed as
'environmental' issues. In fact, in many parts of the world the point is fast
approaching where organisations either conform to
mental expectations, or they will no
So quite
which anyone may have on the
apart from any
environmental probity is becoming a prerequisite for corporate survival.
Chapter 2 explained how society' s expectations take the form of muni
cipal, regional and national environmental standards. Some
also have their own sometimes even more
A failure mode is said to have environmental consequences if it could lead
to the breach of any of these standards.

A failure mode has environmental consequences

if it causes a loss offwzction or other damage
which could lead to the breach of any known
environmental standard or regulation
Note that when considering whether a failure mode has
or environ
whether one failure mode on its
mental consequences, we are
own could have the consequences. This is different from patt 6 of this
chapter, in which we consider the failure of both elements of a
The Question of Risk
Much as most people would like to live in an environment w here there is
no possibility at all of death or injury, it is
that there
is an element of risk i n
we d o . In other words, absolute zero
is unattainable, even though it is a worthy
striving for. This
immediately leads us to ask what is attainable.
To answer this question, we first need to consider the
of risk
in more detail.
Risk assessment consists of three elements. The first asks what could
happen if the event under consideration did occur. The second asks how
l ikely it is for the event to occur at all. The combination of these two ele
of risk. The third and often the
ments provides a measure of the
most contentious element asks whether this risk is tolerable.



ovamnto consider a failure mode which could result in death or i njury to ten
(what could happen). The probability that this failure mode cou l d occur is
one in a thousand i n
one year (how likely it is to occur). On the basis of these
with this failure is:
the risk
per 1 00 years
1 0 x (1 in 1 000)
N ow consider a second failure mode which could cause 1 00 0 casualties, but the
probability that this failure could occur is one in 1 00 000 in any one year. The risk
associated with this fai l u re is:
1 000 X (1 i n 1 00 000}

upon which
the risk is the s ame
In these
it is based are
different. Note also that these
do not indiquantify it Whether or
cate whether the risk is tolerable they
is a
not the risk is
and m uch more difficult question
which is dealt with later.

in more detail.
What could
Two issues need to be considered when ""'"H', -n n what could h appen
if a failure were to occur. These are what
and whether
to be hurt or killed as a result.
anyone is
if any failure mode occurs should b e recorded
as explained at
on the RCM Information Worksheet as its failure
4. Part 5 of Chapter 4 also listed a number of typical
or the environment.
effects which pose a threat to
The fact that these effects could hurt or kill someone does not necesmean that they will do so every time
occur. Some m ay even
so. However, the issue is not whether
occur quite often without
such consequences are
but whether they are nr.,c,,n in ,,,,.
For ex,im1J1e. if the h ook were to fai l on a travelling crane used to carry stee l coils,
load would only h u rt or kill anyone who h appened to
to it at the time. If no-one was nearby, then no-one wou ld get h u rt
However, the.po;s10111tv that someone could be hurt means that this fail u re mode
hazard and
should be treated as a



This example demonstrates the fact that the RCM process assesses
consequences at the most conservative level. If it is reasonable to assume
that any failure mode could affect
or the environment, we assume
to further
that it can, in which case it must be
(We see
is taken into consideralater that the likelihood that someone will
tion when evaluating the tolerability of the risk.)
A more complex situation arises when dealing with
hazards that
have seen
that one of the main objectives of the RCM process is to establish the most
effective way of managing each failure in the ..., .,,,, ......,,,. ...
This can only be done if these consequences are evaluated to
as if nothing was being done
the failure
its consequences).
diet or prevent it or to
to dea] with the failed or the
Protective devices which are
failing state ( alarms, shutdowns and relief systems) are nothing more than
built-in failure management systems. As a result, to ensure that the rest
of the analysis is carried out from an appropriate zero-base, the conse
quences of the failure of protected functions should ideally be assessed
as if protective devices of this type are not
For example, a failure which could cause a fire is always
as a
of a fire-extingu ishing system does not necessarh azard, because the
ily guarantee that the
will be controlled and

The RCM process can then be used to validate

the suitability of the protective device itself from three points of view:
its ability to provide the required protection. This is done by
function of the
to avoid the consequences, as discussed in Chapter 7
what must be done to ensure that the protective device continues to
function in its turn, as discussed in part 6 of this chapter and Chapter 8.
How likely is the failure to occur?
Part 4 of Chapter 4 mentions that only failure modes which are reason
ably likely to occur in the context in question should be listed on the RCM
Information Worksheet. As a result, if the Information Worksheet has
been prepared on a realistic
the mere fact that the failure mode has
been listed
that there is some likelihood that it could occur, and
therefore that it should be subjected to further

it may be prudent to list a wildly unlikely but nonetheless
failure mode in an FMEA, purely to
on record the fact
In these cases, a comment like
that it was considered and then
''This failure mode is considered too unlikely to j ustify further
should be recorded in the failure effects column.)
UVA U'-".u''""'


ls the risk tolerable?

One of the most difficult U.::)1,./v\..-()
the extent
to which beliefs about what is tolerable vary from individual to indivi
of factors influence these
to group. A wide
dual and from
by far the most dominant of which is the
of control which
any individual thinks he or she has over the situation.
prepared to tolerate a
level of risk when they believe that
they are personally in control of the situation than when
believe that
the situation is out of their control.
For example, people tolerate much higher levels of risk when d riving their own
(The extent to which thi s issue governs
cars than
t:::i rtl1,nn c:t,::it1C-tlf' t ih;::it 1 person i n 1 1 000 000 who
in the USA is likely to be killed
air between New York and Los
so, while 1 person in 1 4 000 who makes the trip by road is l i kely to be
killed. And yet some people insist on making this trip by road oec:am;e u,au '"'"''11""''0
that they are

illustrates the relationship between the probability of being
killed which any one person is prepared to tolerate and the extent to which
terms, this
that person believes he or she is in control. In more
vary for a
i ndividual as shown in

107 -+-----'""-+----,--------+---------+-------.::::......i:----

Figure 5.2:
of fatal risk

I believe I have
complete control
(driving my car
or in my home

I believe I have
some control and
some choice
about exposing
myself (on the site
where I work)

I believe I have
I have no control,
no control, but and no choice about
I don't have to
exposing myself
and/or my family
expose myself
(in a passenger (off-site exposure to


industrial accidents)

do not necessarily reflect the views of the author - they
what one individual might decide that he or she is n,,,,nn .-.c>rf
Note also that they
based on the
of one individual going
about his or her daily business. This view then has to be translated into a
of risk for the whole
(all the workers on a
all the
In other words, if I tolerate a probability of 1 1 00 000
in any one year and I have 1 000 co-workers who all share the
we all tolerate that on average 1 person per year on our site will
every 1 00 years - and that person may be me, and it may
a,,. v- u of risk in this fashion can
Bear in mind that any 1:rnmfifi
if I
In other
l 0-5 , it is never more than a ballpark
It indicates that I
to tolerate a probability of being killed at work which is
lower than that which I tolerate when I
the roads
with <:1 r,n ,,v,m,>1-H>.l"'i<'
bearing in mind that we are
to translate the
to tolerate that any one
work into tolerable probabili ty for
mr, 1 r 11"' 1 "" failure) which could kill someone.
For example, continuing the logic of the previous example, the orC)babllttv
, thatany
one of my 1 000 co-workers will be killed i n any one year ls 1 in 1 00
that everyone on the site faces
the same
Furthermore, if
activities carried out on the site
(say) 1 0 000
event could kill one
someone, then the average
be reduced to 1 05 in any one year. This means that the orc>babllrtv of an event
must be reduced to 1 Q-7 , while the
., of an
which is l ikely to kil l ten
event which has a 1 in 10 chance of killing one person must be
to 10-5.

"' ' -"-"'""' ' v

The techniques by which one moves

ity in this fashion are known as r,r, ,,rvJ ,r-.. ,
ments. This
to bear in mind at this stage are that:
the decision as to what is tolerable should start \vith the
How one might involve such ' likely
in this decision in the
industrial context is discussed later in this rh,nr,,.,
1t 1s
to l ink what
person tolerates
tively to a tolerable ,..,.,,1-,., "' 1 ' "" of individual

of control usually dominates decisions about
the tolerability of
it is by no means the only issue. Other factors
us decide what is tolerable include the L'""v v, , .. .::...
is well beyond the
this issue in any
individual values: To
scope of this book. Suffice it to contrast the views on tolerable risk
likely to be held a mountaineer with those of someone who suffers
or those of an underground miner with those of someone
who suffers from

industry nowadays
the need to
the fact that some are
as possible, there is no
intrinsically more
than others. Some even
The views of any individual
of risk with
who works in that industry
boil down to his or her perception
whether the
of whether the intrinsic risks are 'worth it'
the risk.
of children - especially
unborn children has an
effect on
about what is tolerable. Adults
display a surprising and even
"''"'u ....,.".,,,u;:;.. c11;rei2:ar11 for their own safety.
E' l 'l, H C trAt'lYi,r>1",11 ".l

ov;::unnlo the author worked with one group which had occasion to discuss
of a certain chemical. Words like 'toxic' and 'carcinogenic' were
treated with indifference, even thoug h most of the members of this group were
m ost at risk. However, as soon as it eme rged that the chemical was
and the meaning of these words was explained
also mutagenic and
to the group, the chemical was suddenly viewed with m uch g reater respect.

of risk are greatly influenced by how much

people know about the asset, the process of which it forms part and the
failure mechanisms associated with each failure mode. The more they
know. the better their
(Ignorance is often a two-edged sword.
take the most appalling risks out of sheer
In some situations
in others they
risks - also out of
1.:.,,v,_ ..,.,....,...,. On the other hand, we need to remind ourselves constantly
of the extent to which familiarity can breed contempt.)
._....,;u_.._, . , c,

many other factors also influence v......... ....,...,1.., u,.,.,"

on human life by different cultural groups,
and even factors such the age and marital status of the individual.



AH of these factors mean that it

Who should evaluate risks?

of the factors discussed above mean that it is
The very
not possible
one person - or even one on!::m11s2mo,n to assess risk
in a way which wil l be
accused of playing

done by a group. As far as ornrnttHe.

who are likely to have
failure effects
and what possible measures can be taken to ......'"'"vu.,,,._,
who have a
The group should also i nc lude
or otherwise of the risks . This means represenview on the
tatives of the likely victims
or maintainers in the case
of direct
held accountable if someone is hurt or if an environmental standard is oreact1eo').
in a properly focused and s tructured
the collecIf it is
tive wisdom of such a group will do much to ensune that the onrnnistill101n
does its best to identify and manage all the failure modes that could affect
or the environment. (The use of such groups
with the
worldwide trend towards laws which say that
of all employees, not just the responsibility of u1uuu1::.c:u,1c11tt.
Groups of this nature can
reach consensus
Environmental hazards are not quite so simple, because
is the
victim' and many of the
involved are unfamiliar. So
to consider whether a
any group which is
could breach
must find out beforehand which
an environmental standard
of these standards and
under review.
' "'r, u , u.-,-.

1 02
Safety and Proactive Maintenance
If a failure could
lates that we must

the RCM process stiputhat:

it. The above discussion

nrP '1PlnI

Forfailure modes which have safety or environmental con

sequences, a proactive task is only worth doing if it reduces
the probability of the failure to a tolerably low level
task cannot be found which achieves this objective to the
If a
satisfaction of the group performing the
we are dealing with a
or environmental hazard which cannot be
or prevented. This means that something must be changed in order to
could be the
a process
or an operating procedure.
of this sort are classified as
-re,aesagr1s , and are usually undertaken with one of two
to reduce the probability of the failure occurring to a tolerable level
so that the failure no longer has
or environmental consequences.
The question of
is discussed in more detail in ,.__. u,,u.,'"'"
Note that when dealing with
and environmental
does not raise the
of economics. If it is not safe we have an obliit from failing, or to make it safe. This
either to
that the decision process for failure modes which have
or environ
mental consequences can be summarised as shown in Figure 5 .3 below:
.'.) U l:::.\.., c\ L.'.)


Figure 5.3:
Identifying and
developing a
sm,1Rf1v for a
the environment




See Parts
4 and 5 of
this chapter

1 04

total output. This occurs when

or when it works too slowly. This results either in increased
costs if the
has to work extra time to catch up, or lost
sales if the
fully loaded .
'V'-f ''-U f.-'U.l>vu,

u,v.:..,., .... . ..,,,


If a machine can no longer hold manufac

the likely
result is either scrap or rvr,::,.m:,1ve rework. In a more
of navigation sysalso covers \.VU"-\;F,:)
"""''"' "' "'' ""T,.,, such as the
tems, the accuracy o f ,.1 r,CYao 1" t n lY C < r <cn c,' m ' and so on.
- H / ,.,, .. L

customer service. Failures affect customer service in many

from the late delivery of orders to the late departure of
.., ....,...,,,,.,,,L, or serious delays sometimes attract heavy
V'-''""' ..."""'' but in most cases they do not result in an immediate loss of
revenue. However chronic
problems eventually cause customers
confidence and take their business elsewhere.


such as military under1,:,1;, n,w certain failures

of the ,..,...... .......u-.,,.. to fulfil its primary fu nction
sometimes with rl.c>u');;,f'nt,nn results.
"For want of a nail, a shoe was lost For want of a shoe, a horse was lost For want
a battle was lost For want
of a horse, a
of a battle, a war was lost All for want of a

While it may be difficult to cost out the results of

war, failures
of this sort still have economic implications at a more mundane level. If
they occur too often,
two horses in order
to ensure that one will be available to do the job or
battle tanks
Redundancy on this
that if an evident failure does
the RCM process focuses
consequences of failure.
vu.... ,.,.u .. ,

'"' '" .. ...... ,v .,.uu

A Jailure has operational consequences if it has a

direct adverse effect on operational capability

1 05
As we have seen. these consequences tend to be economic in nature, so
they are usually evaluated i n economic terms. However, in certain more
extreme cases (such as losing a war), the 'cost' may have to be evaluated
on a more qualitative basis.
Avoiding Operational Consequences
The overall economic effect of any fai lure mode which has
consequences depends on two factors:

.n.r., ,. ...

,,h ,...., ...

how much the failure costs each time i t occurs, in terms of its effect on
operational capability plus
how often it happens.
In the previous section of this chapter, we did not pay much attention to
how often failures are l ikely to occur. (Failure rates have little bearing on
s afety-related fai lures, because the objective in these cases is to avoid any
failures on which to base a
if the fa ilure consequences are
Pl'()nnmtt' the total cost is affected by how often the consequences are
l ikely to occur. In other words, to assess the economic impact of these
failures, we need to assess how much they are likely to cost
of time.
, u u a,

Consider for example the

Pump can
deliver u p
pump shown in Fig u re 2 . 1
t o 1 000
a n d again i n Figu re 5.4.
litres of
The pum p is controlled by
waer per
one float switch which ac
tivates it when the level i n
Offtake from tank:
Tank Y d rops t o 1 20 000
800 l itres/minute
l itres, and another that turns
it off when the level in Tank
Y reaches 240 000 litres. A
Figure 5.4: Stand-alone pump
low level alarm is located
just below the 1 20 000 l itre level. If the tank runs d ry, the downstream p rocess h as
to be shut down. This costs the organisation using the pump 5 000 per hour.

Figure 5.5: FMEA for bearing failure on the stand-alone pump

1 06
Assume that it
mode which can affect
seizes due to normal wear and tear'. For the sake of simplithis pump is
city, assume that the motor on thi s pump is equipped with an overload switch, b ut
there is no
alarm wired to the control room.
This failure mode and its effects might b e described o n an RCM Information
Worksheet as shown i n
5.5 above.
.. ,; ....w.,'
Water is d rawn out of the tan k at a rate of 800 litres ""
""'' ...,...,,,._,
,.t,... so the tank runs
hours after the low level alarm sounds. It takes 4 hours to
so the downstream process stops for 1 .5 hours. So this failure costs:
1 .5 X 5 000 = 7 500
every three years, plus the cost of rep lacing the bearing.
in lost
Assume that it 1s
feasible to check the bearing for audible noise
once a week
basi s upon which we make this kind of judgement is discussed
In the next
If the
i s found to be
the operational
consequences failure can be a voided by ensuring that the tank is full before
work on the
This provides five hours of storage so the bearing can
without i nterfe ri n g with the downstream p rocess.
in four
now be
Assume also that the pump is located i n an u n manned pumping station. It has
that the check should b e carried out by a maintenance craftsman,
and that the total time needed to d o each check is twenty minutes. Assume further
in which case it costs
i s 3 years, h e will do about
each check. the MTBF of the
per failure. In other words, the cost of the checks is:
1 50 X 8
1 200
plus the cost of replacing the bearing.
every three years,
1"' the scheduled task is clearly cost-effective relative to the
In this ,>vqm:'1.,,.,,
l consequences of the
cost of the ''!:'pr:itinn!\
v.u .....
This ,.,... ,.,:1-,.._,,., ...u that if a fail ure has operational consequences, the basis for
whether a proactive task is worth doing is
as fo llows:
'-A<-4 . . .

>v .,,.,,. .., ...

1 1 1
' " ,,.. "

For failure modes with operational consequences, a

proactive task is worth doing if, over a period of time, it
costs less than the cost of the operational consequences plus
the cost of repairing the failure which it is meant to prevent
if a cost-effective proactive task cannot be found, then it is
scrzeauu::'.a 1rza1J1ue.nm1ce to try to anticipate or prevent

the failure mode under consideration. In some cases, the most

be to decide to live with the failure.
tive option this
However, if a
task cannot be found and the fai lure conseof
quences are still intolerable, it may be desirable to change the
in order to reduce total costs by:
the asset
vu,. ,L ,1.1 ,,_, ...



reducing the frequency (and hence the total

of the failure
reducing or elimi nating the consequences the failure
making a
task cost-effective.
Redesign i s discussed in more detail in Chapter 9.
Note that in the case of a failure mode with
and environmental
consequences, the objective is to reduce the probability of the failure to a very
low level indeed. In the case of operational consequences, the objective
i s to reduce the probability (or frequency) to an economically tolerable
level. As mentioned at the start of part 3 of this chapter, this frequency i s
we would tolerate for
likely t o b e several orders o f magnitude
that a
failure to a tolerable
which reduces the probability of a
level will also deal with the operational consequences of that failure.
To begin with, we again only consider the desirability
after we have established whether it i s
t o extract the desired
However, in this
from the asset as it is currently
case modifications also need to be cost-justified, whereas they were the com
pulsory default action for failure modes with
or environmental
In the
of these comments, the decision process for failure s with
5 .6:
operational consequences can be summarised as shown in
'"'""'' J.V.Li .uu.u,""

D<> ttte failr,e. r,ag taave

: a dire c;1yre effeton
op,ratirif: capabUity?
See Part 5
of this

lfl?f-ffe,tiy, rictiye. tQ$k )

can"<>t b.ftJn,j tJlid,fa....1t '<:ision
.. . i s no scl1r:,quff#d maintertt:J.t1J!ff.,
...but il.m,i9ll_t -?e ..<>l1.-repf!s_i9nig.Jh, t1sset
or. phtu19im1 t,.proc,s$ .? . fduce.t<WltfOStS

Figure 5.6:
/de:7tifY_in9. and

strategy for a

failure which
has oo.ernmonat

Note that this analysis i s canied out for each individual failure
not for the asset as a whole. This i s "'"'"'' ""'''"'
n,'1-.,n.n r,:,,rt to the costs of the failure mode which it is meant
In each case, it a
go decision.
individual failure modes in this way, it i s
necessary to do a detailed cost-benefit study based on actual
downtime costs and MTBFs as shown in the
106. Thi s
because the economic desirability
tasks is often intuiti vely
failure modes with operational consequences .
obvious when
whether o r not the economic consequences are evaluated
this aspect of the RCM process must still be applied
thoroughly. (In
is surprisingly often overlooked by people
new to the process. Maintenance people in particular have a tendency to
m11:,1e1rne11t tasks on the basis of technical feasibility alone, which results
maintenance .......,...,...,.,.."l l'Y'H'
bear i n mind that the operational consequences of any failure
are heavily i nfluenced by the context in which the asset is
another reason
care should be taken to ensure that the context
maintenance program
is identical before
for one
issues were discussed in Part 3 of Chapter 2.

5.5 Non... operational Consequences

evident failure which has no direct adverse effect
capability are classified as non
'"''"'''."'l,,, .. ,,. .. ,,"",. associated with these failures are the
so these consequences are also economic.
This set-up is simi lar fo that
5.4, except that there are now two pumps (both identical to the

Figure 5.7:
Pump with

Offtake from
tank: 800

Failure Lo.nst:.'auen1ces


The duty pump is swi tched on by one float switch when the level in Tank Y
d rops to 1 20 000 l itres, and switched off by anothe r when the level reaches 240
000 litres. A third switch is located just below the low level switch
and this switch is designed both to soun d an alarm in the control roo m if the water
level reaches it, and to switch on the stand-by pump. If the tan k runs
downstream p rocess has to be shut down. This also costs the nr,...,, "',....'"'"'t,nn
uses the pump 5 000 per hour.
As before, assume that it h as been agreed that one fail u re mode which can
affect the d uty pump is 'bearin g seized', and that this seizure is caused by normal
wear and tear. Assume that the motor o n the d uty pump is also
with an
over-load switch , but again there is no trip alarm wired to the contro l room. This
fai l u re m ode and its effects might be described on an RCM I nformation Work
sheet as shown in Figure 5.8:

Figure 5.9: FMEA for failure

I n this example, the stand-by pump i s switched on when the duty pump tails, so
the tank does not run dry. So the only cost associated with this fail u re is:
the cost of replacing the bearing.
Assume however that it i s stil l technically feasible to check the
noise once a week. If the bearin g were foun d to be
the operators would
switch over manually to the stand-by pump and the bearing would be
Assume that these p u m ps are also located in an u nmanned pumping station,
and that it has again been agreed that the check which also takes twenty min
utes - should be done by a maintenance craftsman at a cost of 8
once again, he will d o about 1 50 checks per fai l u re. I n other words, the cost of the
proactive maintenance p rogram per fai l u re is:
1 50 x 8 = 1 200 plus the cost of replacing the

In this example, the cost of doing the scheduled task is now much
than the cost of not doing it As
it is not worth
task even though the pump is technically identical to
that it is only worth
a failure
which has non-operational consequences if, over a period of time, the cost
of the
task is less than the cost
the failure. ff it is
not, then scheduled maintenance

For failure modes with non-operational consequences, a pro

active task is worth doing if over a period of time, it costs less
than the cost of repairing the failures it is meant to prevent

The third
function is still working. In this case, the fai lure has no direct consequences. In fact no-one even knows that the
in a failed
if the p ressure relief valve was
shut, no-one would be
aware of the fact as long as the p ressure in the vessel remained within normal
operating l im its. Similarly, if Pump C were to fail somehow while
was still
worki ng, no-one woul d be aware of the fact unless or

The above discussion

the following

h idden functions can be identified

Will the loss ofjimction caused by this failure

mode on its own become evident to the
operating crew under normal circumstances ?
If the answer to this question is no, the failure mode is hidden. If the answer
is yes, it is evident Note that in this context, 'on its own' means that U'l.,JlLLUJ,
else has fa iled. Note also that we assume at
in the
no attempts are
made to check whether the hidden function is still
,.,,,.,.,,,.,,.,, ,... This is because such checks are a fonn of scheduled mE1mtemmce,
and the whole purpose of the
to find out w hether such 1nainte
nance is necessary. These two issues are discussed in more detail later in
this chapter.
The fourth possibility
Jails, then
failed state. This situation is known as a
because the failure of the protective
and so no-one w ould be aware of the need to
corrective or
alternative action to avoid the multiple

A multiple failure only occurs if a protectedfimction

fails while the protective device is in a failed state

1 14

Reliability-centred Maintenance

2: No action taken to shut down the protected

function or to provide other protection




function <>prJi09

3: If protected function
fails here, the result is
a multiple failure

protectivedevi ce
has failed


.t l:>ecaus,
knows. thatthe

e +-------....:
1: Failure of non -fail
safe device is not
evident to operators

Figure 5.10:

Failure of a protective device whose function is hidden

The sequence of events which leads to a multiple failure is summarised

5. 1() ..

I n the case of the relief valve, i f the p ressure i n the vessel rises excessively while
the valve is jammed, the vessel will p robably explode ( u nless someone acts very
quickly or unless t here is other p rotection in the system). If Pump B fails while
Pump C is in a failed state, the result will be total loss of pumping.

Given that failure prevention is mainly about avoiding the consequences

that when we develop maintenance
of failure, this example also
programs for hidden functions, our objective is actually to prevent - or at
least to reduce the probability of the associated multiple failure.

The objective of a maintenance program for a

hidden function is to prevent - or at least to reduce
the probability of - the associated multiple failure
How hard rve try to prevent the hidden failure depends on the consequen
ces of the multiple failure.

Pumps B and C might be pumping cooling water to a n uclear re

actor. In this case, if the reactor cou l d not be shut down fast enough, the u ltimate
consequ ences of the multiple failure could be a melt-down, with catastrophic safety,
environmental and operational consequences.
On the other hand, the two p u m ps might be pumping water into a tank which
h as enough capacity to s upply a d ownstream process for two hours. In this case,
the consequence of the m ultip le failure would be that p roduction stops after two
hou rs, and then only if neither of the pumps could be repaired before the tan k ran
dry. Further analysi s might suggest that at worst, this m ultiple failure m ight cost
2 000 i n lost p roduction.

In the first of these examples, the consequences of the multiple failure are
very serious indeed, so we would go to great lengths to preserve the inte
grity of the hidden function. In the second case, the consequences of the
multiple failure are purely economic, and how much it costs would influence
the hidden failure.
how hard we would try to

could follow if

LA<CU H 'i\.,,

of hidden failures :md the

are not detected are:

I J5
u n u u ,,J < '-/

fai lures which

to shut down a
vibration switches: A vibration switch
is h idden. Howbe
ever, thi s only matters if the fan vibration
tolerable l imits
,.,...,,.. ,,., ..,.., the fan
the fan i tself
ultimate level switches: Ultimate level switches
to activate
if a primary level switch fails
an alarm or shut down
operate. In other
if an ultimate low level switch
there are
no consequences unless the
switch also
in which case the vessel or tank would run

hoses: The failure of a fire hose has no direct consequences. lt

matters i fthere is a fire (a
when the failed hose may result

Other typical h idden functions include ernergency medical

fire detection, fire
and fire
containment stmctures,
emergency stop buttons and trip
overload or
pressure and
plant, redundant structural '-'U>
IJ"' ' " u ,,>, over-current
c ircuit breakers and fuses and emergency power
L < ;;c,UUU ,;;;,_ C.rt>; H n, iYl<2'

The Required Availability of Hidden .F unctions

S o far, this part of this chapter has defined hidden failures and described
dev ices and hidden functions. The
the rel ationship between
involves a closer look at the
hidden functions.
One of the most important conclusions which has been drawn so far is
that the
h idden fail ure increased exposure
the latter which \Ve most wish
to the risk of a
I'":. failure.
must be connected with the associated ,u,,_
.. .....::',"'
We have seen that where a
device which not
a multiple failure
occurs i f the
nr,,t.=,.;> device is in a failed state, as illustrated in

1 16

Reliability-centred Maintenance

So the probability of a multiple failure in any period must be given by

the probability that the protected function wil l fai l while the protective
5 . 1 1 shows that
device is in a failed state during the same period.
this can be calculated as follows :
Probability of failure of
the protected function x

Average unavailability
of the protective device

The tolerable probability of the multiple failure is determined by the users

as discussed in the next part of this chapter and
of the
Appendix 3. The probability of failure of the protected function is usually
So i f these two variables are known, the allowed unavailability
of the protective device can be expressed as follows:
Allowed unavai1.ab1l!tv of the protective device =

So a crucial element of the performance required from any hidden func

tion is the availability required to reduce the probability of the associated
multiple failure to a tolerable level. The above discussion
this availability is determined in the following three stages:
first establish what probability the organisation is prepared to tolerate
for the multiple failure
then determine the probability that the protected function will fail in the
period under consideration (this is also known as the demand rate)
finally, determine what availability the hidden function must achieve
to reduce the probability of the multiple failure to the required leveL
the risks associated with protected systems, there is
sometimes a tendency to
the probability of failure of the protected
and protective devices as fixed. This leads to the belief that the only way
to change the probability of a multiple failure is to change the hardware
(in other words, to modify the system), perhaps by adding more protec
components with ones which are thought to
tion or by replacing
be more reliable.
In fact,
because it is usually possible to vary both
the probability offailure of the protected function and (especially) the
unavailability of the protective device by adopting suitable maintenance
As a result, it is also possible to reduce the probaand
bility of the multiple failure to almost any desired level within reason by
adopting such policies. (Zero is of course an unattainable ideal. )

1 17

Figure 5. J1:


wilf fail in a ny
m ean time between failures, as illustrated i n
5. 1


Figure 5.Jla:

and protected functions


by the percentage of time which it is in failed state. This

by its unavailability
known as downtime or fractional dead
in Figu re 51 1 b below:

-------=--1r ---------:;}


ff the average unavailability of the protective

device is 33%, then the probability that it will
failed state
time is 1 in

Figure 5. 1 lb:

The probability of the multiple failure calculated by

function by the av:rar:1e m, i:l\l,,H 1 <::> l"'\111ru
of failure of the
device. For the case described in Figure 5 . 1
a m u ltiple fail u re would b e as i nd icated i n


Probability of a

Probability of failure in any one year = 1 in 4

Availability 67%


The *probability of a multiple

failure in any one year:
1 ln 4 x 1 in 3 1 1n 12


page 96


1 18

pumps i n
m ay b e such that the users
tolerate a probability of m ultiple failure
Assume that it has also been
of less than 1 in 1 000 in any one year
maintained, the mean time between
estimated that if the d uty pump is
mant1c:1pa:tea fail ures of the duty pump can be increased to ten years, which
of failure i n
o ne year of one i n ten, or 1 0 1
corresponds to
the u navailSo to red uce the
pump must not be allowed to exceed 1 02 , or 1 %. In other
of the
words m ust be maintained in such a way that its
exceeds 99'%,. This
5. 1 2 below.
is i l lustrated in
Probability of failure in any one year reduced to 1 in 10

fu nction


Unavailability 1%

Availability 99%

av.:wao,mrv of a protected device


which i s considered to be tolerable for any

consequences. In the
the users
what deemed to be tolerable
To illustrate this point,
such assessments for four different
,: ,r .: , , f n <.: -

Failure of

Failed State
of Protective

Fai lure


Rate of Multiple
1 O per m onth?

memo or
Motor bums out:
500 to rewind

Boiler over

Relief valves

1 in 50

Total loss of pumping

capability: 1 o 000
in lost production

1 in 1 000

Boiler blows up:


1 in 10 000 000

5. 13: Multiple failure rates


1 19

these levels of tolerability are not meant to

not necessarily reflect the views of the author.
are meant to demon
someone must ctecide what tolerable
before it is possible to decide on the level
n eeded. and that
thi s assessment will differ for different uc rAn"\c
that if the m n i r , .,,. , ,""
Part 3 of this
'someone ' should be
with their managers. TI1is is also true of , u ._,, . . , , .v
likely victims
failures which have economic consequences.
e lectric motor, the person most
'likely victim') will either be the m l:l n ,;:; ,1ar !"OC l"\f'\,n C l it"\IO
or the maintenance manager in person. In the case of loss of
sums i nvolved mean that h igher levels of management should become i nvolved
tolerability criteria .

Figure 5.13 also c-nnnpd,;:, that the

which any
be f.11"-JJlvr1 to tolerate for failures whic h have economic consequences tend to decrease as the m
twl,:. of the con seq ucnces
. .,, ,n ,.,,
This fmther
that it should be nr,,
11 "''""' 1 rvn a schedule of tolerable 'standard' economic 1isks vvhich could in
tum be used to help
maintenance programs
to del iver
those risks. This
take the form shown i n

.,,.,,..,,., ...0

1 t"i l't'P <"H'r:>C


!Q) 0! 1 0-1



.... C:
(I) (ti

1 0-2
1 0-3
1 04


(no cost)

up to


1 000

10 000

100 000 1 million 10 million

Cost of any o ne event

and over

Figure 5.14: Tolerability of economic risk

note that these levels

and are not meant to be any kind
standard. The economic risks which any
nr> C'i"1"1 1"'1,t l"I/P

nrfT''.ln, H' l, 1"H,.n

...,..,.,..,, ..... ,.... "'"d

., ,..,,.,...,,.0

un i versal
to tolerate

1 20
5.2 and 5 . 14
be possible to produce a
that it
risks and economic risks in one
schedule of risk which combines
continuum. How this
be done is discussed in Appendix 3.
In some cases, it may
sometimes impossible
.... ""'""'n q1L1 ar1tm:1mre analysis of the probabi li ty of multiple
failure in the manner described above. In such cases, it may be enough to
make a judgment about the required availability of the protective device
based on a qualitative assessment of the reliability of the
consequences of the mul tiple failure. This approach
tion and the
is discussed further i n
8. However, if the multiple failure is partianalysis should be
consider in more detail how it i s
- the rate at which protected functions fail
- the availability of protected devices.
11 0

tJVJc _LV,Ui.l.''-U .

Routine Maintenance and Hidden Functions

which incorporates a non-fai l-safe protective
fai lure can be reduced as follows:
of a

the pro-

some sort of
f Pr1 function is operat ed
the \Vay in which the f"\rt'\fP.,('
.....r..f' a r,, f'
n rl function.
vu,.... u;.,,u;., the
of the P.l vv..A,\.v...;
n r,-1,1'.o,l't'l!L>
increase the availability of the r' v,'-'u: .._ device by
some sort
...,.,...,..., ...,....... periodicall y if the protective device has failed
the pro tee ti ve device.

We h ave seen that the probability of a multiple failure is partly b ased on
the rate of failure of the protected function. This could almost
or operation of the protected
be reduced
by changing its
or even
a last
if the failures of a protected function can be anticipated
"'r,.,,u,:,,n,>'fl the mean time between
failures of this function would be increased. This in turn would reduce the probability of the
multiple failure.

1 21
one way to prevent the simultaneous failure of
to try to p revent u nanticipated failures of
B. By
these fail ures, the mean time betwee n failures of
failure would be
so the probability of the
shown i n
5. 1 2 .
H (11u""""'r

is that the

bear i n mind that the reason for ....,, ..,u.ufuncti on is vulnerable to

n n H"ll'H< 't'V> f'Ar!

This situation must be i ntolerable, or a -n.T', ,tn/ T n l/:>

have been installed to begin w ith. This
to find a practical way of
the failure
which are not fail safe.

Prevent the
In order to
must try to ensure that the hldden
function is not in a failed state if and when the
function fails. If
task could be found which was
availability of the
then a
is theoretically almost 1moo:;s1t)le.
if a proactive task could be found which could
1 Offl/o
ability of Pump C while it is
the n we can be sure that C would
always take over if B failed.

quences of the
of protection, as d iscussed

that any
it is most
function, h idden or otherwise, to achieve an
What it must do, however, is deliver the
needed to
reduce the probability of the multiple failure to a tolerable level.
assume that a
task is foun d which
If the mean tim e between ma.nt1ic10.ate,c1
achieve an
of the
Pump B is 1 O years, then the
1 000) in any one year, as discussed earlier,



If the ava11aormtv of Pump C could be increased to 99.9% then the probability

failure would be reduced to 1 0 4 (1 in 1 0 000), and so on.
of the
So for a hidden
task is
w01th doing if it secures
the availability needed to reduce the probability of the multiple failure to
a tolerable

For hidden failures, a proactive task is worth doing

if it secures the availability needed to reduce the
probability of a multiple failure to a tolerable level
The ways which fail ures can be
are discussed in ,AIP'--''
also explain that it is often :upv..:l>.J:_,,._ to
and However, these
.,...,_.,..,...,,,,,.,,.,.,.::. task which secures the v4 ,..,.... ...,""
to the type of
which suffors from hidden failures.
So i f we cannot find
a hidden failure, we must find some
other way of
of the hidden function.
1 n->,.... ,...._, 0 1 "" 1 "'

Detect the
a hidden failure,
vu,.,,.n ,,J .'"' to find a suitable way
failure by checking the
it is still ooss110 1e to reduce the risk of the
If this check
hidden function .1 to find out if it is still
is carried out at suitable intervals and if
it is found to be
Scheduled n.n run-umrnu!
r,,,:;,r,r,rl l ("'' " "'

the em,u,i,neJvu

I n a very sman number of cases, it is either

to find any kind
of routine task which secures the desired level of availability, or it is imn.r,,,,.,.""' to do
necessary to go back to the drawing board' and
in these
reconsider the
If the m u ltiple failure could affect
or the
is cmnrnlllSi)fV If the multiple failure only has economic consequences,
assessed on economic nrr..nn,"ic>
the need for
can be used to reduce the risk or to change the
in which
consequences of a multiple fa ilure are discussed in Chapter 9.
f"'<:">['!.OC'i <TM


1 23

Hidden Functions: The Decision Process

All the points made so far about the development of a maintenance strate5. 15:
gy for hidden functions can be summarised as shown in
Figure 5.15:

Identifying and
developing a
strategy for a
hidden failure

WiJlthe los$ of function caused

by ll)is fttilure moge, 9n its.own
become E!Vident to the operating
crewunder normal,circumstances?


J?fQctiye, ma;ntenanc i$
w<>rtll doing itit.cures
tlie, 'f''" pility 11,e,ded to
redy Je. pro(?,b.iJlty .of
a mllltipJe to


The failure is
evident. See
Parts 3 to 5
of this chapter

.1t:iitltJ,Je pto9JiytEtI< :,rinot .f<)un d,

pheq pedodicallt:""'ethertEf hi<.fdertf\Jf:lction
is. w9:tf<lr,g.(iJ9i scl)(;)c;t1,1lrJ .failurfirac;tirig task)

Further Points about Hidden Functions

Six issues need special care when
the first
They are as follows:
the distinction between functional failures and failure modes
the question of time
the primary and
functions of
what exactly is meant by 'the operating crew'
what are 'normal c ircumstances'
fail-safe' devices.
These are all discussed in more detail in the

5. l 5 .

1 24

Reliability-centred Maintenance

Functional failure and failure mode

every failure mode which is reasonably
At this
in the
to cause each functional failure will already have been identified
on the RCM Information Worksheet. This has two
imp lications:
firstly, we are not
what failures could occur. All we are trying
to establish is whether each failure mode which has already been identi
as a possibility would be hidden or evident if it did occur.

whether the operating crew can <lla.grnose

we are not
itself. We are
if the loss offunction caused by
the failure
the failure mode will be evident under normal circumstances. (In other
if the fail ure mode has any effects or symptoms
we are
which under normal cmcurnstanc:es, would lead the observer to believe
that the item is no longer capable of fulfilling its intended function -- or
at least that something out of the ordinary had occurred.)

ror' ex,am1::>1e, consider a motor vehicle which suffers from a b locked fuel line. The
(in other words, the average "operator") would not be able to diag
nose this fail u re mode without expert assistance, so there m ight be a temptation
to cal l this a hidden fai l u re. However, the loss of the function caused by this fail u re
mode is evident, because the car stops working.

The au,esi,ion
There is often a temptation to describe a failure as 'hidden' if a consider
able period of time elapses between the moment the failure occurs and the
moment it is discovered. In fact, this is not the case. If the loss of function
evm1mu1, becomes apparent to the
and it does so as a direct
and inevitable result of this failure on its own, then the failure is treated
no matter how much time
between the failure in ques
tion and its '" " "'f"""' '."'

For example, a tank fed by Pump A i n

5.4 may take weeks t o empty, s o
the failure o ft h i s p u m p might not be apparent a s soon a s i t occurs. This might lead
to the temptation to describe the failure as hidden. However, this i s not so be
cause the tank runs dry as a direct and inevitable result of the failure of Pump A
on its own Therefore the fact that Pump A is in a fai led state willinevitably become
evident to the """"'"""'1tnn
Conversely, the
of Pump C in Figu re 5.7 will o nly become evident if
Pump B also fails (unless someone makes a point of checking Pump C from time
and maintained in such a way that it is
to time.) ff pump B were to be
never necessary to switch on Pump C, it i s possible that the fail u re of Pump C on
its own would never be discovered.


1 25

This example demonstrates that time is not an issue when '''"... <'",,,,.,..,riT
hidden failures. We are simply
whether anyone w ill
come aware of the f ct that the f ilure has occurred on its own, and not
if they will be aware when it occurs.
Primary and secondary functions
Thus far we have focused on the primary function
which is to be capable of fulfilling the function they are aeng11eo to fulfil
when called upon to do so. As we have seen, this is usually after the pro
function of
tected function has failed. However, an important
many of these devices is that they should not work when nothing is wrong.

For instance, the primary function of a p ress u re switch

be l isted as follows:
to be capable of transmitting a signal when p ress u re
below 250 psi
The implied secondary function of this switch is:
to be i ncapable of transmitting a signal when p ressure is above 250

The failure of the first function is hidden, but the failure of the second is
evident because if it occurs, the switch transmits a
signal and the machine stops. If this is likely to occur in
it should
be listed as a failure mode of the function which is interrupted
primary function of the machine). As a result, there is usually no need to
list the implied secondjimction separately, but the failure mode would be
listed under the relevant function if it is reasonably
to occur.
The operating crew
crew refers
whether a failure is evident, the term
to anyone who has occasion to observe the equipment or what it is doing
at any time in the course of their normal daily
and who can be
relied upon to report that it has failed.
Failures can be observed by people with many different points of view.
They include operators, drivers, quality inspectors, ('r<'.li Ttc n"l .- n
whether any of these
sors, and even the tenants of buildings.
people can be relied upon to detect and report a failure
on four
critical elements:
the observer must be in a position either to detect the failure mode itself
or to detect the loss of function caused by the failure mode. This may
be a physical location or access to
infonnation (including
management information) which will draw attention to the fact that
something is wrong.

1 26

Reliability-centred Maintenance

the observer must be able to rec:ogm;e the condition as a failure.

the observer must understand and
that it is part of his or her job
to report failures.
the observer must have access to a r>1"r,,.r<a.nn,... for reporting failures.
Normal circumstances
often reveals that many of the duties
operators are actually maintenance tasks. It is wise to start from a zero
base when
it may transpire that either the
tasks or their
need to be radically revised. In other words,
crew under
if a failure will become evident to the
tnormal' circumstartces, the word nonnal h as the
the failure. If a
done to
task is
that nothing is
that the
it could
suc:ce;sru111v preventing the
failure is 'hidden' because it does not occur. However in Chapter 4 it
out that failure modes and effects should be listed and the
rest of the RCM process applied as
because one of the main purposes of the exercise is to review whether
any such tasks in the first
we should be
that no
task is
done to detect the failure. A surprising
number of tasks which
normal duties
are in fact routines oeag11ea to check if hidden functions are working.
j.,...,, I I ,._, . . , ..,
-,,,-, tY.,C 'lr\ TlC,-l'


a button on a control panel every day to check if all the

on the panel are working i s i n fact a failure-finding task.

We shall see l ater that failure-finding tasks are covered by the RCM
task selection process, so once
it should be assumed at this
that this task is not being done (even though the task is
in the
nonnal duties). This is beof the
cause the RCM process might reveal a more effective
or the need
or lower .....a.r,n,c:>. n , ' 1
to do the same task at a
there is often confrom the question of maintenance
siderable doubt about what the ' normal' duties of the
are. This occurs most often where standard operating procedures
are either poorly documented or do not exist. In these cases, the RCM
review process does much to help clarify what these duties should be, and
can do much to
lay the foundations of a full set of operating procedures. This

1 27
'Fail-safe ' devices
that a protective circuit is said to fail-safe \Vhen it
It often
not This usually occurs when
part of a circuit is considered instead
of the circuit as a whole.
An example
provided by a pressure switch , this time attached to a
static bearing. The switch was meant to shut down the machine if the oil p ressure
d u ring
i n the bearing fell below a certain level. It
that if the
electrical signal from the switch to the control panel was i nterrupted, the machine
would s h ut down, so the failure of the switch was
to be evident
However, further d iscussion revealed that a
coul d
deteriorate with age, so the switch could become
sensing changes
in the p ressure. This failure was hidden, and the maintenance program for the
switch was developed accord ingly.

To avoid this problem, take care to include the

and the actuators
in the analysis of any control
as wen the electrical circuit itself.

5.7 Conclusion

Figure 5.16: The evaluation of failure consequences

This chapter has demonstrated how the RCM process
fai l ures. As summarised in
framework for

1 28

all failures on the basis of their consequences. In so doin g it

set:iarattes hidden failures from evident failures, and then ranks the conorder of
sequences of the evident failures in
provides a basis for deciding whether proactive maintenance is worth
doing in each c ase

what action should be taken if a suitable proactive task cannot

be found.
The different types of proactive tasks and default actions are discussed i n
together with an
approach to consequence
the next
"'a<"'' IJ'VA

6 Proactive Maintenance 1 :
Preventive Tasks
6.1 Technical Feasibility and Proactive Tasks
As mentioned in Chapter 1 , the actions which can be taken to deal with
failures can be divided into the following two categories:
proactive tasks: these are tasks undertaken before a failure occurs, in
order to prevent the item from
into a failed state.
what is traditionally known as 'predictive' and
although RCM uses the te1ms scheduled restoration, scheduled discard
and on-condition maintenance
default actions: these deal with the failed state, and are chosen when it
is not possible to identify an effective proactive task Default actions in
clude failure-finding, redesign and run-to-failure.
correspond to the sixth and seventh of the seven
These two
questions which make up the basic RCM decision process, as follows:
what can be do Ile to predict or prevent each failure?
what if a suitable predictive or preventive task cannot be found?
Chapters 6 and 7 focus on the sixth question. This deals with the criteria
used to decide whether proactive tasks are technicallyfeasible. They also
look in more detail at how we decide whether
of tasks
are worth doing. (Chapters 8 and 9 review default ...,..,,..
When we ask whether a proactive task is technically
we are
simply asking whether it is possible for the task to prevent or anticipate
the failure in question. This has nothing to do with economics - econom
ics are part of the consequence evaluation process which has
considered at length . Instead, technical feasibility
on the technical characteristics of the failure mode and of the task itself.

Whether or not a proactive task is technically

feasible depends on the technical characteristics
of the failure mode and of the task

1 30

Reliabilitycentred Maintenance

Two issues dominate proactive task selection from the technical view
point. These are:
the relationship between the age of the item under consideration and
how likely it is to fail
what happens once a failure has started to occur.
The rest of this chapter considers tasks which could apply when there is
a relationship between age (or exposure to stress) and failure. Chapter 7
considers the more difficult cases where there is no such relationship.

6.2 Age and Deterioration

Any physical asset which is required to fulfil a function which brings it
into contact with the real world will be subjected to a variety of stresses.
These stresses cause the asset to deteriorate by lowering i ts resistance to
stress. Eventually this resist,mce drops to the point at which the asset can
no longer deliver the desired performance - in other words, it fails. This
process was first illustrated i n Figure 4.3, and is shown again in a slightly
different fon11 in Figure 6. 1 .
Exposure to stress i s measured in
a variety of ways i ncluding output,
distance travelled, operating cycles,
calendar time or mnning time. These
units are all related to time, so it i s
common to refer to total exposure to
stress as the age of the item. This
connection between stress and time
suggests that there should be a direct
relationship between the rate of de
terioration and the age of the item. If
this is so, then it follows that the point
at which failure occurs should also
Figure 6. 1:
depend on the age of the item, as
Deterioration to failure
shown in Figure 6.2.
However, Figure 6.2 is based on two key assumptions, as follows:
deterioration is directly proportional to the applied stress, and
the stress is applied consistently.

Preventive Tasks


Figure 6.2:


If this were true of all assets,

we would be able to predict
equipment life with great pre
cision. The classical view of
preventive maintenance sug
gests that this can be done all we need is enough i nfor
mation about failures.
In reality, however, the situation is much less clear cut. This chapter staits
looking at the real world by considering a s ituation where there is a clear
relationship between age and failure. Chapter 7 moves on to a more gen
eral view of reality.
Age-related Failures
Even parts which seem to be identical vary slightly in their initial re
sistance to failure. The rate at which this res istance declines with age also
varies. Furthermore, no two parts are subject to exactly the same stresses
throughout their lives. Even when these variations are quite smalI, they
can have a disproportionate effect on the age at which the part fails. This
is illustrated i n Figure 6.3, which shos what happens to two components
that are put i nto service with similar resistance to failure.


Figure 6.3:


.. ......... ..... ...


A realistic v;ew of
age-related failures

if) ...____,.-.---..-.-----,.-.-1

Age (x 1 0 000)

Part B is exposed to a generally higher level of stress throughout its life than part
A, so it deteriorates more quickly. Deterioration also accelerates in response to
the two stress peaks at 8 000 km and 30 000 km. On the other hand, for some
reason part A seems to deteriorate at a steady pace despite two stress peaks at
23 000 km and 37 000 km. So one component fails at 53 000 km and the other
at 80 000 km.

1 32

Reliability-centred Maintenance

shows that the failure age of identical parts
apparently identical conditions varies widely.
although some
number of parts
parts last much
than others, the failures of a
which deteriorate in this fashi on would tend to congregate around some
as shown in
Figure 6.4:

Frequency of
failure and
..avcrac,e life"

(x 10


So even when resistance to failure does decline with age, the point at which
failure occurs often much less predictable than c01mnon sense sugesrs.
Chapter 1 2 explores the quantitative implications of this situation in more
Itaiso ex1p1arns that the failure frequency curve shown in Figure 6.4
can be drawn as a conditional probability of failure curve, as shown in
life defines the age at which there is
6.5 below. (The term
a rapid increase in the conditional probability of failure. It is used to dis,.,r"' '" " "'' h this
from the average life shown in
Figure 6.5:

probability of
failure and
"useful life"

Age (x 1 0 000) -+-

failure modes are

numbers of apparently identical
"'""':\ """"'" in this
it is not unusual to find a number which occur
prematurely. Why this occurs is also discussed in Chapter 12. The result
failures is a conditional probability curve as shown in
of such
6.6. This is the same as failure pattern B in
1 .5.
Figure 6. 6:

The effect of

Age (x 10 000)-.

Preventive Tasks

1 33

Even this is actually a somewhat sm1p11st1lc

ures, because there are in fact three sets
of failure can increase as an item
older. These are shown in

Figure 6. 7:

Failures which
are age-related

These patterns were introduced in

1 and are discussed at muc h
A and
1 2. The characteristic shared
at which there is a
B is that they both display a
conditional probability of fai lure. Pattern C shows
increase in the
probability of failure, but no distinct wear-out zone. The next three
of these failure
of this chapter consider the
from the
viewpoint of preventive maintenance.

6.3 Age-Related Failures and Preventive Maintenance

For centuries certainly since machines have come into
mankind has tended to believe that most
tends to behave as
s hown i n
6.4 t o 6.6. In other
still tend to
assume that similar items performing a
for a period, perhaps with a small number of random
then most of the i tems will 'wear ouf at about the same time.
to items
are very
"'""...... .-.,,,.v items which suffer from a dominant failure mode.
In practice,
are conunonly
often where equipment comes into direct contact with the
,, f',. ,.,. .,.
are also associated with
o xidation and au c, ...,. ,,.,. .

1 34

Wear-out characteristics most often occur where

equipment comes into direct contact with the product.
Age-relatedfailures also tend to be associated with
fatigue, oxidation, corrosion and evaporation.
Examples ofpoints where equipment comes into contact with the product
pump i mpellers, valve seats,
include furnace
tooling, screw conveyors, crusher and hopper liners, the i nner surfaces
dies and so on.
metallic items - which are subjected
affects items
loads. The rate and extent to which
oxidation and corrosion affect any item depend of course on its chemical
composition, the extent to which it is protected and the environment in
which it is operating .
affects solvents and the lighter frac
tions of petrochemical products.
For items which conform to one of the failure patterns shown in Figure
that it is possible to determine an age at which
the failures from hapto take some sort of action to
it is
in the future, or at least to reduce the consequences of the
nr.,,,,.,.,n,T,,"' onm1ns which are available under these circumstances
are scheduled restoration tasks and scheduled discard tasks. They are
considered in more detail in the next two sections of this chapter.

6..4 Scheduled Restoration Tasks

As the name implies, scheduled restoration entails
to restore an existing item or component to its original condition ( or more
accurately, to restore its original resistance to failure). Specifically:

Scheduled restoration entails remanufacturing

a single component or overhauling an entire
assembly at or before a specified age limit,
regardless of its condition at the time
S cheduled restoration tasks are also known as scheduled rework tasks. As
the above definition ,,w,;,...,vcn,.J, they include overhauls which are done at
i ntervals.

Preventive Tasks


The Frequency of Scheduled Restoration Tasks

If the failure mode under cor1s1cten1t1cm confom1s
possible to
the age at which
toration task is done at intervals '-' " ,rn n",
HJP.>r .. JC.IH

Pattern A B. it
The scheduled res
In other words:

The frequency of a scheduled restoration task is govemed

by the age at which the item or compmient shows a rapid
increase in the conditional probability offailure.
In the case of Pattern at least four different restoration intervals need
to be analysed to determine the optimum interval (if one exists all).
the frequency of a scheduled restoration task can
on the basis of reliable historical data. This
seldom available when assets
go into
sible to specify scheduled restoration tasks in
scheduled restoration tasks were only
nance programs.
assigned to seven components in the initial program
for the
failure modes
Douglas DC 1 0).
to find
out if they would benefit from scheduled restoration tasks.
The Technical Feasibility of Scheduled Restoration
The above comments indicate that for a scheduled restoration
to be
the first criteria which must satisfied are that
there must be a point at which there is an increase in the conditional
probability of failure
the item must have a
we must be reasonably sure what the life is.
Secondly, most of the items must survive to this age. If too many items
it, the nett result an increase in
fail before
ures. Not only could this have unacceptable consequences, but it means
This in tum
that the associated restoration tasks are done out
disrupts the entire schedule planning process.
or environmental consequences, all
(Note that i f the failure has
the items must survive to the age at which the scheduled restoration task
is to be done, because we cannot risk failures which
damage the environment. In this context the comments about safe-life
limits which are made in the next
of this
apply equally to
scheduled restoration tasks.)


Reliability-centred Maintenance

scheduled restoration must restore the

resistance to
failure of the asset, or at least something close enough to the
dition to ensure that the item continues to be able to fulfil its intended
function for a reasonable period of time.
no-one in their right mind would try to overhaul a domestic l ight bulb,
simply because it i s n ot possible to restore it to its original condition
of the economics of the matter). On the other hand, it could be argued that retread
restores the tread to something approaching its original condition.

These points lead to the following

conclusions about the technical
of scheduled restoration:
Scheduled restoration tasks are technically feasible if:
there is a11 identifiable age at which the item shows a rapid
increase in the conditional probability ojjailure
most of the items survive to that age (all of the items if the
failure has safety or environmental consequmces)
they restore the original resistance to failure of the item.
The Effectiveness of Scheduled Restoration Tasks
still not be
Even if it
worth doing because other tasks may be even more effective . Pvmnlp.c
c'h ....v-,.
,;::.r 7.
,h,nr,, n n how this might occur in practice are discussed ,,,_,.,.-:.t
If a more effective task cannot be found, there is often a l.vlHpu1c.:vu
rarn""' ' "' t>r;n to
select scheduled restoration tasks purely on the
of technical
An age limit applied to an item which behaves as shown in
6.6 means that some items will receive attention before they need
but the nett effect may be an overall reducit while others
tion in the number of unan11c1pa1tect failures. However even then schedfor the following reasons:
not be worth
uled restoration

mentioned earlier, a reduction in the number of failures is not suffior environmental consequences ) because
cient if the failure has
we want to eliminate these failures altogether.

if the consequences are

we need to be sure that over a period
of time, the cost of doing the scheduled restoration task is less than the
cos t of allowing the failure to occur. When comparing the two, bear in
mind that an age limit lowers the service life of any
so it increases
the number of items sent to the workshop for restoration. Why this is
is shown in

Preventive Tasks
Figure 6.8:

"Useful life" and

'"8V'er;iae life u



l ife is 18 months.
In this example, the u seful life is 1 2 months, while the
I n a period of 3 years, the failure occurs twice if no
done, while the p reventive task would be done three times. I n other words, the
preventive task has to be done 50% more often than the corrective task which
would have to be performed if the failure was allowed occur on
2 000 i n lost p roduction and repai r, fail u res would
If each scheduled restoration tas k costs
cost 4 000 over a three year
So in this
(say) 1 1 00, these tasks would cost 3 300 over the same
case, the task is cost-effective.
On the other hand, if the average life was 24 months and all other figures
and would
remained the same, failures only occur 1 .5 times every
cost 3 000 over this period. The scheduled restoration task sti l l costs 300
over the same
so it woul d not be cost-effective.

When considering failures which have vJ.L,,.au COltlSE!QUICilCC!). bear in

mind that a scheduled restoration task may itself affect ,,.,...,".,.." 1t "Y"''
most cases, this effect is
to be less than the consequences of the
failure because:
the scheduled restoration task would normally be done at
it is likely to have the least effect on production
called production window)
the scheduled restoration task is likely to take less time than it would to
for the
repair the failure because it is possible to plan more
scheduled task.
If there are no operational consequences, scheduled restoration is
''""""'"'""""'..... if it costs substantially less than the cost
may be
the case if the failure causes extensive ,,.,,.,..,,.....,.,,,,, ...."' ,.,,."'''' '""""

6.5 Scheduled Discard Tasks

Again as the name ...... J-1.""'"'
component with a new one at

1 38

Reliability-centred Maintenance

Scheduled discard tasks entail discarding an item

or component at or before a specified age limit,
regardless of its condition at the time
These tasks are done on the understanding that replacing the old compo
nent with a new one will restore the original resistance to failure.
The Frequency of Scheduled Discard Tasks
Like scheduled restoration
scheduled discard tasks are only tech
nically feasible if there a direct relationship between failure and opera6.7. The frequency at which
by the
age, as
are done is detem1ined on the same

Thefrequency of a scheduled discard task is governed by

the age at which the item or component shows a rapid
increase in the conditional probability offailure
In general, there is a particularly widely held beliefthM
... ,............,.n,. 'itpm
"""''"''"" 'have a life',
and that installing new part before this ' life ' is reached will automatitrue, so RCM takes
make it 'safe ' . This is not
when considering scheduled discard tasks.
to focus on
For this reason, RCM
two different
of life-limits
""'",.. "''';;;._ with scheduled discard tasks. The first apply to tasks meant
to avoid failures which have safety consequences, and are called ,,..,,,
,,. -,,,.,..limits. Those which are intended to prevent failures which do not have
consequences are called econornic-life limits .
--.., -

)aJ;re-iizre Um.its

- -J-

or environmental
Safe-life limits
apply to failures which have
consequences so the associated tasks must prevent all failures. In other
no failures should occur before this limit is reached. This means
that safe- life limits cannot apply to items which conform to pattern A,
because infant mortality means that some items must fail prematurely. In
cannot apply to any failure mode where the probability of fail
ure is more than zero when the item enters service.
In practice, safe-life l imits can only apply to failure modes w hich occur
in such a
that no failures can be expected to occur before the wearout zone is reached.

Preventive Tasks

1 39

Ideally, safe-life limits should be determined before the item is put into
service. It should be tested in a simulated operating environment to deter
mine what life is actually achieved, and a conservative fraction of this life
used as the safe-life limit This is illustrated in

:ac ..o15 ,_co

C: :l:= (1)



o o-

o a.o




Age -...

Figure 6.9:

Safe-life limits

There is never a perfect correlation between a test environment and the

operating environment
a long-lived
to failure is also costly
and obviously takes a long time, so there is
not enough test data
for survival curves to be drawn with confidence. In these cases s afe-life
limits can be established by dividing the average by an arbitrary factor as
large as three or four. This implies that the conditional probability of
failure at the life limit would essentially be zero. In other words, the safe
life limit is based on a I 00% probability of survival to that age.
The function of a s afe-life limit to avoid the occurrence of a critical
only if it ensures that
failure, so the resulting discard task is worth
no failures occur before the safe-life limit.
Economic-life liniits
Operating experience sometimes
that the scheduled discard of
an item is desirable on economic grounds. This known as an economicrelationship of the
life limit. It is based on the actual
rather than a fraction of the average age at failure.
The only justification for an economic life limit i s cost-effectiveness.
In the same w ay that scheduled restoration increases the number of jobs
passing through the workshop, so scheduled discard increases the con
to discard. As a
the cost
sumption of the parts which are
effectiveness of scheduled discard tasks is determined in the same way as
it is for scheduled restoration tasks.
In general, an economic life-limit is worth applying if i t avoids or
reduces the operational consequences of an unanticipated failure, or if the
failure which it prevents causes "'.. "".u.. ,, ...... ,...... ''""'"'V .. ,........
we must know the failure pattern before we can assess the cost effective
ness of scheduled discard tasks.


For new assets, this means that a failure mode which has
mic consequences should also be
into an age-exploration program to
find out if a life limit
as with scheduled restorathere is seldom ..., .. ,., ... ,.,, ..
of task in an
initial scheduled maintenance program.
The Technical Feasibility of Scheduled Discard Tasks
The above comments indicate that scheduled discard tasks
feasible under the
Scheduled discard tasks are technically feasible if:

there is an 'identifiable age at which the item shows a rapid

increase in the conditional probability offailure

most of the items survive to that age (all of the items if the

failure has safety or environmental consequences).

There is no need to ask if the task will restore the original condition be
cause the item is replaced with a new one.

6.6 :Failures which are Not Age-related

developments in modern maintenance
One of the most
matna;gerne1n: has been the
that very few failure modes actually
conform to any of the failure
shown in
6.7. As discussed
in the following paragraphs, this i s due primarily to a combination of vari
ations in applied stress and increasing complexity.
astmnt1ot1m1s listed in
2 of this chapter, deterioration
proportional to the applied stress, and stress is not
is not
applied consistently. For instance, part 3 of Chapter 4 mentioned that
many failures are
increases in applied stress, which are caused
or external damage.
in tum by incorrect
increases i n
(starting u p a machine too quickly, accidentally putting i t into reverse while i t is
going forward, feedi ng material into a process too quickly) asiem101v
bolts, misfitting
and external damage (lightning, the 'thousandyear flood', and so on).

Preventive Tasks


In all of these cases, there is little or no re

lationship between how long the asset has
been in service and the likelihood of the
failure occuring, This is shown in Figure
6. 1 0, which is basically the same as Figure
TIME -..
4.4 with a time dimension added. (Ideally,
Figure 6. 10
'preventing' failures of this sort should be
a matter of preventing whatever causes the increase in stress levels, rather
than a matter of doing anything to the asset.)
In Figure 6. 1 1 , the stress peak perma
nently reduces resistance to failure, but
does not actually cause the item to fail (an
earthquake cracks a structure but does not
cause it to fall down). The reduced failure
resistance makes the part vulnerable to the
TIME ---..
next peak, which may or may not occur be
Figure 6.11
fore the part is replaced for another reason.
In Figure 6. 1 2, the stress peak only tem
porarily reduces failure resistance (as in the
case of thermoplastic materials which sof
ten when temperature rises and harden again
when it drops).
Finally in Figure 6. 1 3 a stress peak acce
TIME --+
lerates the decline of failure resistance and
Figure 6. 12
eventually greatly shortens the life of the
comporient. When this happens, the cause
and effect relationship can be very difficult
to establish, because the failure could occur
months or even years after the stress peak.
This often happens if a part is damaged during
installation (which might happen if a ball-bearing i s misaligned), if it i s damaged prior to instalFigure 6. 13
lation (the bearing is dropped on the floor in the
parts store) or if it is mistreated in service (dirt gets into the bearing). In these
cases, failure prevention is ideally a matter of ensuring that maintenance and in
stallation work is done correctly and that parts are looked after properly in storage.
TIME --..

In all four ofthese examples, when the items enter service it is not possible
to predict when the failures will occur. For this reason, such failures are de
scribed as 'random' .


The failure processes depicted in

6. 7 apply to fairly simple mech
anisms. In the case of
items, the situ ation becomes even less
pn01ctcto1,e. Items are made more
to improve their perfonnance
new or additional technology or by automation) or to
make them safer (using protective
developments in the field of civil aviation.

anti-icing equip
gear, moveable high-lift devices, pressure and temperature control systems
the cabin, extensive navigation and communications
equipment, com pfex i nstrumentatio n and complex ancillary support systems.

are achieved at the

and greater
In other
cost of
complexity. This is true in most branches of industry.
Greater complexity means balancing the lightness and compactness
with the size and mass needed for durabilneeded for
This combination of complexity and cmnpJronrllsie :
increases the number of components which can
and also increases
the number of interfaces or connections between components. This in
tum increases the number and
of failures which can occur.
For example, a g reat many mechanicaJ failures involve welds or bolts, while a
<>inn,tir-'.'.l,n+ proportion of electrical and electronic fai l u res i nvolve the connec
tions between components. The more such connections there are, the more
such failures there will be.

reduces the margin between the initial

of each component
and the desired perfomiance (in other words, the 'can' is closer to the
'want'), which reduces scope for deterioration before failure occurs.
that complex items are more
'TI1ese two developments in tum
to suffer from random failures than
Patterns D, and F
The combination of variable stress and erratic response to stress coupled
with the
complexity mean that in pract ice, a
proportion of failure modes conform to the failure patterns shown in
Figure 6. 1 4.

Preventive Tasks

1 43

Figure 6. 14:

Failures which
not age

age limits do little or
scheduled. overhauls can actually increase overall failure rates
by introducing infant mortality into otherwise stable
This is
number of
accidents around the
borne out by the high and
world which have occmTed e ither while maintenance is under way or
after a maintenance intervention. It is also borne out by the
machine operator who says that "every time maintenance works on it over
it takes us until Wednesday

the main conclusion to

From the maintenance
be drawn from these failure
is that the idea of a wear-out
so the idea of fixed interval
simply does not apply to random
ret)laicerne:m or overhaul prior to such an age cannot
As mentioned i n Chapter 1 , an intuitive awareness of these facts has led
some people to abandon the idea of preventive maintenance alt,l)Qt;tht:'!:r
Although thi s can be the right thing to do for failures with minor conse
quences, when the failure consequences are
must be
the failures or at least to avoid the c onsequences.
done to
The c ontinuing need to prevent certain
and the growing i nability of classical
to do so, are behind
of nev,.r
of failure
Foremost among these are the
known as
or on-condition maintenance.
discussed at length in the next chapter.

7 Proactive Mai ntenance 2 :

Predictive Tasks
7.1 Potential Failures and On-condition Maintenance
between how long an asset has been i n service and how l ikely it is to fail.
n n,,,,. u,:,r al though many failure modes are
most of them
that they are in the process
or are
some sort
about to occur. If evidence can be found that something is in the final
it may be possible to take action to prevent it from failing
completely and/or to avoid the c onsequences.
7. I illustrates w hat
in the final
of failure. It i s
because i t shows how a failure starts, deteriorates to the
and then, if it is not detected
point at which it can be detected
continues to deteriorate usually at an
until it reaches the point of functional failure CF').

Figure 7.1:

Point where we can find

out that it is failing
("potential failure")

Point where
it has failed

The PF curve

in the fai lure process at which it is

about t o occur i s known
the failure i s

to detect whether

A potentialfailure is an identifiable condition

which indicates that a functional failure is either
about to occur or in the process ofoccurring
In practice, there are thousands
p rocess of ,..,,,r1 1rt1 n n

of finding out if failures are in the

Predictive Tasks

1 45

v,.,, ..... ,..,,,,..,. of

failures include hot spots showing deterioratio n of furnace
refractories or electrical i nsulation, vibrations
fai lure,
cracks showing metal fatigue,
i n ae1:1mc)x

ure, excessive tread wear on tyres, etc.

If a potential failure is detected between point P

7. l.
i t may b e possible t o take action to prevent or to avoid the consequences
of the functional fail ure. (Whether or not it is
to take meaningful
the failure occurs, as discussed in
action depends on how
of this
fail ures
on-condition tasks.

On-condition tasks entail checking for potential

failures, so that action can be taken to prevent the
functional failure or to avoid the consequences
of the functional failure
On-condition tasks are so called because the
are left i n service on the condition that they continue to meet
performance standards. This is also known as
when the item
(because we are trying to predict whether -- and
to fail on the basis of its
behaviour) or condition-based
the need for corrective
action based on as assessment of the condition of the

7.2 The P-F Interval

we need to consider
In addition to the potential failure
between the point
of time ( or the number of stress
failure occurs in other
at which a
at which it
becomes detectable - and the point where it deteriorates into a functional
this interval is known as the P-F interval.
failure, As

The P-F interval is

the interval between the
occurrence of a potential
failure a11d its decay into
a functionalfailure
Figure 7.2: The P-F interval

The P-F interval tells us how often on-condition tasks must be done. If we
want to detect the potential failure before it becomes a functional failure,
the interval between checks must be less than the P-F interval.

On-condition tasks must be carried out

at intervals less than the P-F interval
the lead time to
The P-F interval i s also known as the warning
or thefailure development period. It can be measured in any units
an indication of exposure to s tress (running time, units of
but for practical reasons, it is most often
measured in terms
time. For different failure modes, it varies
from fractions of a second to several decades.
Note that if an on-condition task is done at i ntervals which are longer
there is a chance that we will miss the failure altothan the P-F
On t.he other hand, if we do the task at too small a
we will waste resources on the
the P-F
For i nstance, if the P-F i nterval for a
failure mode is two weeks, the failure
will be detected if the item ls chec ked once a week. Conversely, if it is checked
once a month, it is
to miss the whole failure process. On the other hand,
if the P-F interval is three months it is a waste of effort to check the item

equal to half
sufficient t o select task
it i s
the P-F interval. This e nsures that the inspection will detect the vv,...,.. u .u.
failure before the functional failure occurs, while most cases) provid
ing a reasonable amount of time to do somethin g about it. Thi s leads to
of the nett P-F interval.
The Nett P- Interval
between the
The nett P-F interval is the minimum interval
,, "''"'""' ,,,,.,..., of a potent ial
and the occurrence of the functional failure.
This is illustrated in Figures
P-F interval:
Inspection interval:
7.3 and
which both show
9 months _.,
1 month
a failure with a P-F interval of
Nett P-F
nine months.
l i t!t I I r I f ..oE- interval:
8 months


Nett P-F interval (1)

Time -

Predictive Tasks

14 7

Figure 7.3 shows that if the item is

the nett P-F i nterval
is 8 months. On the other hand, if it is
monthly intervals
as shown in Figure 7 .4, the nett P-F interval is 3 months. So in the first case
the minimum amount of time available to do something about the failure
is five months
than i n
the second, but the inspection
task has to be done six times
more often.
Figure 7.4:

Nett P-F interval (2)

Time --=>-

The nett P-F interval governs the amount of time available to take what
ever action is needed to reduce or eliminate the consequences of the fai l
ure. Depending on the operating context of the asset, warning of incipient
fa i lure enables the users of an asset to reduce or avoid consequences in
a number of ways, as follows
downtime: corrective action can be planned at a time w hich does not
disrupt operations. The opportunity to plan the coITective action properly
also means that it is likely to be done more quickly .

For example, if an electrical component is found to be

before it
burns out, it m ay be possible to replace it when the machine is normally idle.
Note that i n this case, the fail u re of the component i s not
i t m ight
be doomed whatever happens but the
consequences of the fail
u re are avoided.

repair costs: users may be able to take action to eliminate the '"'-'"''" ...........
damage which would be caused by unanticipated failures. This would
reduce the downtime and the repair costs associ ated w ith the failure.

For instance, a timely warning might enable users to switch a machine off
before (say) a collapsing bearin g all ows a rotor to touch a stator.

safety: warning of failure provides time either to shut down a plant

before the situation becomes
or to move people who
otherwi se be injured out of harm's way .

For instance, i f a crack in a wall is discovered i n good time, i t may be po::,s1ble

to shore up the fou ndations and so p revent the wall from deterioratin g
that i t falls down. I t is h ighly likely that w e would have t o vacate t h e ore,m,:;es
consequences which
while this work is done, but at least we avoid the
would a rise if the wall fell down.


1 48

For an on-condition task to be technically feasible, the nett P-F interval

the time required to take action to avoid or reduce the
must be
consequences of the failure. If the nett P-F interval is too short for any
not techthen the on-condition task i s
sensible action to be
t h e time requ i red varies widely. I n some cases i t may b e a matter o f
end of an operating cycle o r the end of a shift) or even m in utes
shut down a machine or evacuate a building). In other cases it can be weeks
or even months
until major shutdown).

P-F intervals are desirable for two reasons:

it is possible to do whatever i s necessary to avoid the consequences of
the failure (including
the corrective action) in a more considered and hence more controlled fashion.
are required.
fewer on-condition
why so much energy i s
devoted to finding potential
failure conditions and associated on-condition techniques which
to make
P-F intervals. However, note that it is
use of very short P-F intervals in certain cases.

failures which affect the balance of large fans cause serious probFor
so on-line vibratio n sensors are used to shut the fans down
lems very
when such failures occur. In this case, the P-F interval is very short, so monitoring
is continuous. Note also that once again, the monitoring device is being used to
a void the consequences of the failure.

Pfl"' Interval Consistency

The P-F curves ilJustrated so
fatr in this chapter indicate that
the P-F interval for any
failure is constant. In fact, this
is not the case some actually
range of

Figure 7.5:

P-F intervals

Time -

P-F interval associated with a change i n noise

rattles away for anything from two weeks
co11ap1se::.t" I n another case, tests might show that any
thing from six months to five years elapses from the moment a crack becomes de
point i n a structure until the moment the structure fails.
tectable at a

Predictive Tasks


be selected which is
Clearly, in these cases a task interval
stantially less than the shortest ofthe likely P-F intervals. In this way, we
can always be reasonably certain
the potential failure before
it becomes a functional fai lure. If the nett P-F interval associated with
for suitable action to be taken to
this minimum interval is
then the on-condition task
deal with the consequences of the
technically feasible.
On the other
of them can be - then
val, and the task in question should
other way of dealing with the failure.
v.U'l.J U .:::. U

7 .3 Technical Feasibility of On-condition Tasks

In the l ight of the above
the criteria which any on-condition
to be technically feasible can be summarised as follows:

Scheduled on-condition tasks are technically feasible iJ:

it is possible to define a clear potentialfailure condition
the P-F interval is reasonably consistent
it is practical to monitor the item at i11tervals less than the
P-F interval
the nett P-F interval is long enough to be ofsome use (in
other words, long enoughJor action to be taken to reduce
or eliminate the consequences ofthefiuictionalfailure).

7.4 Categories of On-condition Techniques

are as fo llows:
The fou r major
condition monitoring
which involve the use
equipment to monitor the condition of other ecn1101ment
based on variations in product
techniques, which entail the '' ""'".'"'""'"''"' u se
primary effects
gauges and process mc,m1:ontng ,.,._,,,, ...,,,uv,,i"
based on the human
These are each reviewed in the
i"A<" h n o rt l l <:>C'

1 50
Condition monitoring
The most sensitive on-condition maintenance
the use of some
of equipment to detect potential failures. In other
eo1mt1m1:nt is used to monitor the condition of other equipment.
These techniques are known as condition monitoring to distinguish them
from other
of on-condition maintenance.
Condition moni toring embraces several hundred different techniques,
so a detailed study of the subject is well beyond the scope of this ,,.,.,,, .....'".
However. Appendix 4 provides a brief summary of about 1 00 of the better
All of these techniques are
to detect failure

particle effects
chemical effects
physical effects
t"c,,..,,,....,..,,, ...,,i,nrr; effects
electrical effects.
can be seen as
sensitive versions of the human
senses. Many of them are now very sensitive indeed, and a few give several
of failure. However, a major limita
months (if not several
every condition monitoring device is that it monitors only
tion of
one condition. For
a vibration analyser only monitors vibration
and cannot detect chemicals or
ity is bought at the price of the versatility inherent in the human senses.
The P-F intervals associated with different monitoring techniques vary
from a few minu tes to several months. Different
also pinpoint
failures with
Both of these factors must be
considered when assessing the feasibility of any technique.
condition monitoring
can be spectacularly efbut when
are inappropriate they can
and sometimes bitterly disappointing waste of time.
the criteria
As a
whether on-condition tasks are technirigorously to
feasible and worth doing should be
,::, 1.; :lHvl. "U,

=r'"> >irr..>'"I T'\ rt t'b.f' t-> t'> Hn > r> <'



Product quality variation

In some
an important source of data about
failures is
the quality
function. Often the emergence of defect in
article produced by a m achine is
related to a failure mode in the
machine itself. Many of these defects emerge
timely evidence of p otential failures. If the data
and evaluation
procedures exist already, it costs very little to use them to
warnof equipment failure
One popular technique which can often be used in this way .is Statis tical
some attribute of product
S PC entails
Process Control
such as a
urements to draw conclusions about the
2.6 in Chapter 2 showed how such measurements might appear
for a process which is in control and in
3.4 and 3.5
in Chapter 3 showed two ways in which a process coul d be out of control
and out of specification other words,
many cases, the
transition from being in
SPC charts
i"f'Cl>.nn, n r ll u
track this transition.
S PC chart on which
7 .6 overleaf shows a
re,1tct1r1Qs are in control to start with. A failure mode
the measurements to start drifting in one direction.
For example, as a g rinding wheel wears, the diameter
increases u ntil the wheel is adjusted or reo,racea.

In zone 2 i n
7 .6 the process i s ,out o f control but still within
fication. (Oakland 1991 describes how
shifts of this
a cusum chart'
identifiable condition which indicates that a functional failure is about to
occur. In other words, it is a potential fai lure.
the situation the process
as shown in zone 3 i n
This example describes only one o f many ways which SPC can be
used to measure and manage process
A ful l
o f all
is well beyond the scope of this book. t-t n.u n, u,o.r
point to note at this
is that if deviations on charts like these can
then the charts are sources of
related directly to specific fai lu re
on-condition data which can make a valuab le contribution to overaJl pro
active maintenance efforts.


Reliability-centred Maintenance

In control and
in specification
= OK

and in specification
= potential failure

and out-of -spec
= functional failure

Figure 7.6: On-condition maintenance and SPC

Primary effects monitoring

Primary effects (speed, flow rate, pressure, temperature, power, current,

etc) are yet another source of information about equipment condition.
The effects can be monitored by a person reading a gauge and perhaps
recording the reading manually, by a computer as part of a process control
system, or even by a traditional chart recorder.
The records of these effects or their derivatives are compared with refer
ence information, and so provide evidence of a potential failure. How
ever, in the case of the first option in particular, take care to ensure that:
the person taking the reading knows what the reading should be when
all is well, what reading corresponds to a potential failure and what cor
responds to functional failure
the readings are taken at a frequency which is less than the P-F interval
(in other words, the frequency should be less than the time it takes the
pointer on the dial to move from the potential failure level to the func
tional failure level when the failure mode in question is occurring)
that the gauge itself is maintained in such a way that it is sufficiently
accurate for this purpose.

Predictive Tasks


Potential failure
Figure 7. 7:

Functional failure

for on*condition

The process of
marked up
coloured) as shown in
or anyone else needs to do i s look at the gauge and
zone, or
more drastic
the pointer is in the
action if it is in the fu nctional failure (red?) zone. However the gauge must
still be monitored at i ntervals which are less than the P-F interval.
reasons, this
which are
state. Also take care to ensure that
..,u,e> ...... u,,F, a
marked up
in this way are not taken off and remounted in the wrong place.)
'"' .... " , ,' 1 1 "

11 .

The human senses

vprhr'lc the best known on-condition inspection
feel and srneH). The two main disadvanon the human senses
tages of using these senses to
potential failures are that:
time it is possible to detect most failu res

the human senses,

the process of deterioration is
quite far advanced. This means
that the P-F intervals are
so the checks must be done more
frequently than most and response has to be
the process is subjective, so it is difficult to develop
criteria, and the observations
very much on the '-'"""'""'"'"'''"'"'
even the state of mind of the observer.
as foliows:
the average human being is highly versatile and
detect wide variof failure conditions, whereas any one condition , uvuvv,.
nique can only be used to monitor one
of potential failure
vvho are
done by
it can be very cost-effective if the
at or near the assets anyway in
course of their normal duties

1 54

Reliabiliv-centred Maintenance

of the potential
a human is able
failure and hence about the most appropriate action to be taken, whereas
a condition monitoring devi ce can only take
and send a signal.
Selecting the Right Category
Many failure modes are preceded by more than one often several - dif
so more than one category of on-condition task
ferent potential
orn...rAnr, ,:, t;::> Each of these will have a different P-F interval, and
and levels of skill.
each will
O.V!:lomn, ln

f'f"ln<:!llf1t:>r a ball

whose failure is described as

. Figure 7 .8 shows how this failure could be preceded
failures, each
by a variety of
of which could be,,detected
a different oncondition task.
Point where
failure starts
to occur

Figure 7.8:

Different potential
failures which


failure mode


This does not

mean that all ball
i-. o,:.r,r,N <'.' will exhibit these potential failures, nor
have the same P-F i ntervals.
wil l they
is technically feasible and worth doing depends
The extent to
very much on the
context of the bearing. For i nstance:
the bearing may be buried so deep in the machine that it is i mpossi b le to monitor
its vibration c haracteristics
it is only possible to detect particles in the oil if the bearing i s operating in a totally
enclosed oil-lubricated system
background noise l evels may be so high that it is i mpossible to detect the noise
made by a failing
it may not be

to reach the bearing housing to feef how hot it is.

This means that no one

be more cost
of tasks will
effective than any other. It is important to bear this in mind, because there
to present condition
in particular as the answer' to all our maintenance nr.<>l"'> J S:,. ITH'

Predictive Ta::.,-ks

1 55

In fact, ifRCM is correctly applied to

it is not unusual to find that condition m,,n , ,r,'>rtn
feasible for no more than
in this part of this
failure modes, and worth doing in less than half these
categories of on-condition maintenance together are usually suitable for
to i mply that condition
about25 35% of failure modes.) This is not
....., mPl,f'
,u...., ...
should not be used where i t is good it
but that we must also remember to develop suitable crr:n,, ,,:,c
the other90% of our failure modes. In other
condition monitoring
is only part of the answer and a
at that.
So to avoid unnecessary bias in ta.-,k .,.d.._,..., ,,
to nrececie
which are reasonably
consider all the
range of on-condition
failure mode,
could be used to detect those
to determine which
apply the RCM task select ion
any) of the tasks is likely to be the most cost-effective way
ting the failure mode under consideration.
As with so much else in maintenance, the
on the
context of the asset
u. , ..

H l 'l rn l in cn>

7 ..5 On-condition Tasks: Some of the Pitfalls

When '-'VJ,,c,..,..... ,.. u,,:;; the technical feas bility of on-condition mamtiemimoe,
two issues
tial and functional failures, and the distinction between potential failure
and age. These issues are discussed in more detail below.
Potential and functional failures
In practice, confusion often arises over the distinction between ...---u -
and functional failures. This
because certain conditions can
po1cenua1 failures in one context and as functional
common in the case of leaks.
a m i nor leak in a flanged
on a
water. In this
a pote ntial failure if the
task would be 'Check
joints for leaks'. The task
i s based on the
amount of time it takes f or an 'acceptable' minor l eak to become an
major l eak, and suitable corrective action would be i nitiated whenever
l ea k was discovered.

1 56
However, if the same pipeline was carrying a toxic substance like cyanide, any
as functional failure . In this case it i s not feasible
leak at all would be
for l eaks, so some othe r method would need to be found
This would almost certa inly entail some sort of modification.



...,v,,v._,.. ,.,_,,. ,,u,-..

it is to agree what is meant by

what should be done to prevent it.

The P-F interval and
these principles for the first time, people often have diffiWhen
between the ' life' of a component and the P-F
interval. This leads them to base on-condition task frequencies on the real
..., ..,,;;.,J,.,.,,,,.... 'life' of the item. If it exists at
this life is usually many
than the P-F interval, so the task achieves little or nothing.
we measure the life of a component forwards from the moment
it enters service. The P-F interval is measured back from the functional
so the two
are often completely unrelated. The distinction is important because failures which are not related to age (in other
as those
are as likely to
which are not

a random

Age (years) ---->-

done at
2 monthly

detected at
2 months

Figure 7.9: Random failures and the P-F interval

7.9 depicts a component which conforms to a random failFor example,
u re pattern (pattern
One of the components fai led after five years, a second
after six months and a third after two years. I n each case, t he f un ctional failure was
failure with a PF i nterval of fou r months.
by a
7. 9 shows that in order to detect the potential failure, we need to do a n
1 ncnc.,--t1,...,,n task every 2 months. Because t h e failures occur on a random basis,
we don't know when the next one i s going to happen, so the cycle of inspections
m ust
as soon as the item is put i nto service. In other words, the timing of
has nothing to do with the age or l ife of the component.

this does not mean that on-condition tasks apply only to items
which fail on a random basis. They can also be applied to items which
as discussed later in this chapter.

1 57

7.6 Linear and Non-linear P-F Curves

Part I of this ,ni.-.t,,.t" ,.,.,.,n1,:1 1n,,.,,
be described by the P-F curve. I n this
curve i n more detait
a look at non-linear P-F curves and then
on to consider linear P-F curves.
'"'"er''""' that deterioration
accelerates in
To see why this is so, let us consider in more detail what
the final
happens when a ball bearing fails due to 'normal wear and tear'
1 0 o verleaf illustrates typical
and frequently loaded
which is rotating c lockwise. The most
of the
will be the bottom of the outer race.
the inner surface of the outer race moves up and down each
over it. These cyclic movements are tiny, but
are sufficient to
cracks which develop as shown i n
7 . 1 0.
Figure 7 . 1 0 also
how these cracks eventually
table symptoms of deterioration. These are of
and the associated P-F intervals are shown
ov , n,,-... f c.
raises several
if a quantitative technique such vibration
i.1v ,...,. .,... "... ...... ..,"'"'....., , ,, we cannot predict when failure
a straight line based
two observations.
This in tum leads to the notion that after an initial deviation is obadditional vibration
should be taken at
shorter intervals until s ome further point is reached which action
should be taken . In practice, this can only be done if the P-F i nterval
long enough to allow time for the additional
lt also does not
need to be taken at a
escape the fact that the initial
which is known to be less than the P-F intervaL
well known and the Pof the P-F curve i s
(In fact, if the
to take
it should not be
F interval is
after the first
of deviation discovered. This
sm!2:ests that the process of deterioration should only be tracked by
naamrs i fthe P-F curve
understood or if the
...,,,_.r',ITr'<H' t:' l HO il , ;

n ri ,., , .. , ru-.. .n ,

P-F interval is


Reliability-centred Maintenance

Strains on the o uter race

eve ntually cause subsurface fatigue cracks

Cracks m igrate
to the surface of
the outer race

Ball forces lubricant into the c a'

causing a sliver of metal to st!
proud of the surface. This is
sheared off, forming a particle
which can be detected by oil
analysis i n enclosed systems.
The crater l eft behind changes
the vibration characteristics of
the bearing, and can be detec E.. .....
initially by vibratio n analysis. As the
balls pass over the crater, they make it bigger. Soon the balls themselves get
damaged because they are no longer rolling on a smooth surface. At some point,
the bearing becomes audibly noisy, and then starts getting hotter. Deterioration
continues at an accelerating pace u ntil the balls eventually disintegrate and the
bearing seizes.

Figure 7.10:
How a rolling element bearing fails due to 'normal wear and tear'

Predictive Tasks

1 59

different failure modes can often exhibit simil ar symptoms.

For example, the symptoms described in Figure 7.1 O are based on fail u re due
to normal wear and tear. Very similar symptoms would be exhibited in the final
where the failure process has been initiated
stages of the fai l u re of a
by dirt, lack of lubrication or brinelling.

In practice, the precise root cause of many failures can only be identified using sophisticated instruments. For
it might be possible
to determine the root cause of the failure of a
a ferrograph to separate particles from the lubricating oil and
particles under an electron microscope.
However, if two different failures have the same
and if the
P-F interval is broadly similar for each set of symptoms - as it probably
would be in the case of the bearing examples the distinction between
root causes is irrelevant from the failure detection viewpoint. (The dis
tinction does of course become relevant if we are
to eliminate
the root cause of the failure.)
failure only becomes detectable when the fatigue cracks
to the
up. The point at which this hap
surface and the surface starts
pens in the life of any one bearing depends on the
of rotation of
the bearing, the magnitude of the load, the extent to which the outer race
smface is ..............
.,... prior to or during
itself rotates, whether the
installation, how hot the bearing
of the
shaft relative to the housing, the materials used to manufacture the bear
ing, how well it was made, etc. Effetively this combination of variables
makes it impossible to predict how many operating
before the cracks reach the surface, and hence when the
will start
exhibiting the symptoms mentioned i n
7 . l 0. (For those inter
ested in pursuing this subject
chaos theory i n particular the
'butterl1y effect' shows how tiny differences between the initial condi
tions which appl y to any dynamic system lead to dramatic differences
after the passage of time. This may explain why minute vmiations between
the initial conditions of two rolling element
can lead to
differences between the ages at which they fail. See
Deterioration accelerates in the final
of most fai lures.
deterioration is likely to accelerate when bolts start to
when filter
elements get blinded, when V-belts slacken and start slipping, when elec
trical contactors overheat, when seals start to fail, when rotors become
unbalanced and so on. B ut it does not accelerate in every case.

Reliability-centred Maintenance

1 60

Linear P-F curves

If an item deteriorates in a more or less linear fashion over its entire life,
it stands to reason that the final stages of deterioration will also be more
or less linear. A close look at Figures 6.2 and 6.3 suggests that this is likely
to be true of age-related failures.
For example, consider tyre wear. The surface of a tyre is likely to wear i n a more
or less l inear fashion until the tread depth reaches the legal minimum. If this
minimum is (say) 2 mm, it is possible to specify a depth of tread greater than 2
mm which provides adequate warning that functional failure is imminent. Thi s
is o f cou rse the potential fail u re level.
If the potential failure is set at (say) 3 mm, then the P-F i nterval is the d istance
the tyre could be expected to travel while its tread depth wears down from 3 m m
t o 2 mm, a s illustrated i n Figure 7. 1 1 .
Tread depth when new
== 1 2 mm

.. -

r. f<S'ct4'
..: .


at least
5 000 km


Operating Age
(x 1 000 km)



Figure 7 .11:

A linear P-F curve

Figure 7 . 1 1 also suggests that if the tyre enters service with a tread depth of (say)
1 2 mm, it should be possible to predict the P-F interval based on the total distance
usually covered before the tyre has to be retreaded. For instance, if the tyres last
at least 50 000 km before they have to be retreaded, it is reasonable to conclude
that the tread wears at a maximum rate of 1 mm for every 5 000 km travelled. This
amounts to a P-F interval of 5 000 km. The associated on-condition task would
call for the driver to:
'Check tread depth every 2 500 km and report tyres whose tread
depth is less than 3 mm.'
Not only will this task ensure that wear is detected before it exceeds the legal limit,
but it also allows p lenty of time - 2 500 km i n this case - for the vehicle operators
to plan to remove the tyre before it reaches the limit.

In general, linear deterioration between 'P' and 'F' is only likely to be

encountered where the failure mechanisms are intrinsically age-related
(except in the case of fatigue, which is a somewhat more complex case.
This failure process is discussed in more detail in later.)


16 1

Note that the P-F interval and the associated task

be deduced in this way if deterioration is linear. As we have seen, the P-F
interval cannot be determined in this way if deterioration accelerates
between 'P' and 'F' .
A further
about linear failures concerns the
at which one
should start to look for
fai lures.
7. 1 1 suggests that it would be a waste of time to measure
the overall depth of the tyre tread at ten or twenty thousand km, because we know
that it only approaches the potential failure point at 50 000 km. So
should only start measuring the tread
of each
after it has passed the
point where we know tread depth will be approaching 3 mm -- in other words, when
the tyre has been in service for more than
40 000 km.
However, if we want to ensu re that this checking
adcmteid in practice,
if the
consider how the checks for a 4-wheeled truck would have to be
of a set of tyres i s as follows:

Left front
Right front tyre
Left rear tyre
Right rear tyre

Distance travelled by truck and by each tyre

1 40 000
47 500
22 000
1 2 500
38 000

1 42 500
50 000
24 500
40 500

1 45 000
27 000
4 500
43 000

1 47 500
1 000*
7 000
45 500

rt:ar:,1,;:irt::)r;l at
of UF tyre dropped below 3 mm and tyre 'v,_,,,._...,,.,,
rarHQf'Drl with new spare

t 6 inch nail caused tyre to blow out at 1 3 000 km -


If we are seriously going to try to

it passes 40 000 km in service, we have to devise a system which tells him to:
start checking the UF
only when the truck reached 1 32 500 km
check the UF and RIA
when the truck reached 1 42 500 km
at 1 45 000 km
but check the RIA tyre only at 1 47 500 km.
Clearly this i s nonsense,
would be far greater than the cost of simply
depth of every tyre on the vehicle every 2 500 km. other words, i n this example
the cost of fine-tuning the p lanning system would be far
than the cost of
the tasks. So we would simply ask the d river to
the tread depth of
every tyre at 2 500 k m intervals, rather than d i rect his attention to

Hn"'"'"""r if the process of deterioration linear and the task itself is very
expensive, then it might be worth ensuring that we
001tent1a1 failures when it is
For i nstance, if an on-condition task entails shutting downand
turbine to check the turbine discs for cracks, and we are certain that dt=:te,wr.atlcm
,,.,,,...," n.o,v\mlC' detectable after the turbine has been in service
of time (in other words, the
then we should
the turbine out of service to c heck for the cracks after it

1 62
which there is a reasonable l i kelihood that detectable cracks will start to emerge.
Thereafter, the frequency of checkin g is based on the rate at which a detectable
crack is l i kely to deteriorate i nto a failure.
For the record, the age at which cracks are l i kely to start becoming detectable
ls known as the crack initiation life, whereas the time (or n u mber of stress cycles)
w hich elapse from the moment a crack becomes detectable u ntil it grows so large
that the item fails is known as the crack propagation life.

In cases like these, the cost of doing the task would be much greater than
the cost of the associated planning systems, so it is worth ensuring that we
only start doing the tasks when it is really necessary. However, if it is felt
worthwhile, bear in mind that the planning process
that this
has to employ two completely different timeframes, as follows:
used to decide when we should start doing the on
condition tasks. This is the operating age at which potential failures are
likely to start becoming detectable.
the second time-frame governs how
we should do the tasks after
rea;cnea. This time-frame is of course the P-F interval.
be felt that the turbine disc is u n li kely to develop any detec
For example, it
table cracks u ntil it has been in service for at least 50 000 hours, but that it takes
a minimum of ten thousand hours for a detectable crack to deteriorate i nto disc
don't need to start checking for cracks until the item
has been i n service for 50 000 hours, but thereafterit must be checked at intervals
of less than ten thousand hours.

of sophistication
Planning with this
a very detailed un
derstanding of the failure mode u nder consideration, together with highly
so1m1:mc:ate:a planning systems. In practice, few failure modes are this well
understood. When they are, even fewer organisations possess planning
"''"'I'"'..,.,,., which can switch from one time frame to another as described
above, so thi s issue needs to be approached with care.
In closing this discussion, it must be stressed that all the curves P-F
and age-related - which h ave been drawn in this part of this chapter have
been drawn for
mode at a time.
concerning tyres, the failure process was ' normal'
For i nstance, in the
wear. Different failure
(such as flat spots worn on the tyres due to emerM
b raking or
to the carcass caused by hitting kerbs) would lead to
rhtt,oro,nt conclusions because both the technical characteristics and the conse
q ue nces of these failure modes are different.

It is one matter to speculate on the nature of P-F curves in

but it
is quite another to determine the magnitude of the P-F interval in practice.
Thi s issue is considered in the next section of this chapter.

Predictive Tasks

I 63

7.7 How to Determine the P-F Interval

simple matter to determine the P-F interval
It is usually a
related failure modes whose final
of deterioration are linear. It is
done by applying logic similar to that used in the
difficult to deterOn the other hand, the P-F interval can be sunpnsmgly
mine in the case of random failures where deterioration accelerates. The
main problem with random failures is that we don't know when the next
one is going to occur, so we don't know when the next failure mode is
to start on its way down the P-F curve. So if we don't even know

long it is? The

paragraphs review five ooss11owtnes.
fourth and fifth of which have any merit.
Continuous observation
ln theory, it is
to determine the P-F interval
failure occurs,
serving an item which is in service until a
when that happens, and then continuing to observe the item until it fails
that we cannot chart a ful l P-F
item intermittently, because when we
discovered that it was
fai ling we still wouldn 1 t know
when the failure process started.
What is more, if the P-F interval is shorter than the intermittent observa
tion period we might miss the P-F curve
in which
with new
would have to start all over
because continuous obserthis approach is impractical,
if we were to try to establish every
vation is very
P-F interval in this way.
until the functional failure
occurs means that the item actually h as to fail. This
end up with u s
t o the boss after (say) the compressor blew up: "Oh, w e knew it
but we just wanted to see how
it would take before it
finally went so that we could determine the P-F interval !"
Start with a short interval and HY/. , /, ,., ,, "'
to '-' '"'F,A'-'sa
The impracticality of the above
the checks at some
that P-F intervals can be established by
short but arbitrary interval
1 0 days), and then
out what the interval should be",
this i s
u re occurs, s o we would still end up blowing u p the compressor.

This approach is of course potentially very dangerous, because there
that the initial
interval, no matter how short,
is also no
will be shorter than the P-F interval to begin with (unless serious considerato the failure process itself).
tion is
Arbitrary intervals
The difficulties associated with the two approaches described above lead
quite seriously that some arbitrary 'reasonto
short' interval should be selected for all on-condition tasks. This
arbitrary approach is the least satisfactory (and the most dangerous) way
because there is
no guarantee
than the P-F
that the
interval. On the other hand, the true P-F interval may be much longer than
in which case the task ends up being done much
the arbitrary
more often than necessary.
only needs to be done once a month, that task
For instan ce, if a daily task
thirty times as much as it should.

P-F interval is to simulate the failure
The best way to establish a
in such a way that there are no serious consequences when it eventually
does occur.
this is done when aircraft components are tested
to failure on the
rather than in the air. This not only provides data
about the life of the
as discussed in Chapter 6, but it also
enables the observers to study at leisure how failures develop and how
is expensive and it
However, laboratory
quickly this
takes time to
even when it is accelerated. So it is only worth
doing in cases where a
number of components are at risk
such as an aircraft t1eet - and the failures have very serious consequences.
A rational approach
The above pairagrarms indicate that in most cases, it is either impossible,
1morarct1ca1 or too
try to determine P-F intervals on an
ricaJ basis. On the other hand, it is even more unwise simply to take a shot
these problems, P-F intervals can still be estimated
in the dark.
.:,............,.,",""!' accuracy on the basis of judgement and ex1oer1en,ce.
is to ask the right question. It is essential that anyone who
to determine a P-F interval understands that we are
In other words, we are
how much time (or


1 65

how many stress

comes detectable until the moment it reaches the -.- ,,., .. n '""" "
how often it fails or how
We are not
who have an inti77ie second trick is to ask the
of the asset, the ways in which it fails and the
means the
of each failure. For most equipment, this
it t the craftsmen who maintain it and their first-line
instrn1mmts such as condi
If the detection process requires
then appropriate
should also take
tion monitoring
in the

.:>J UtpLvll.h".>

C'lH"\at"U t i' rH't.'


In practice, the author has found that an em::cm,e

P-F i ntervals is to provide a number of mental 'coat-hooks' on which
their thoughts. For i nstance, one could ask: "do you think that the P-F
i nterval is likely to be of the o rder of days, weeks or months?" If the answer is
weeks, the next
is to ask: "One, two, four or

If everyone in the group achieves consensus, then the P-F i nterval has
go on to consider other task selection
been established and the
criteria such as the
of the P-F interval and whether the nett
interval is lon g .., .. v-.. 5: to avoid the failure consequences.
If the
co1nscnsus, then it is not nr.,.,,,,,,lhl.::.
a positive answer to the question "what is the P-F interval?". When this
happens, the associated on-condition task must be abandoned as a way of
the failure mode under consideration, and the failure must be
dealt with in some other way .
The third trick is to conceiztrate
mode at a tfrne. In other
if the f ilure mode is wear, then the
should concentrate
on the characteristics of wear, and should not discuss
conosion or
the symptoms of the other failure modes are almost i denti
cal and the rate of deterioration is also very
Finally , i t must be c learly understood everyone
part in such
that the
is to arrive at an on-condition task interval
which is less than the P-F interval, but not so much less that resources will
on the
The effectiveness of such a group is redoubled if matru1gerne1rn
of human
presses an appreciation of the fact that it is
that humans are not infall ible. However, the
that if the failure has
consequences, the
wrong could (literally) be fatal for themselves
need to take special care in this area.
Pnr'\ l l nc f'I

'1<'.>f< Pr"f< l r> Ct

1 66


7.8 When On-condition Tasks are Worth Doing

On-condition tasks must

the following criteria to be worth doing:

it has no direct consequences. So an on-condition

if a failure
task intended to prevent a hidden failure should reduce the risk of the
because the
to an acceptably low leveL In
many of the potential failures which normally affect
function is
evident functions would also be hidden . What is more, much of this type
suffers from random failures with very short or non
existent P-F intervals, so it is fairly unusual to find an on-condition task
which is technically feasible and worth doing for a hidden function. But
this does not mean that one should not be sought.
or environmental consequences, an on-condition
if the failure has
task is only worth
if it can be relied on to give
of the failure to ensure that action can be taken in time to avoid the safety
or environmental consequences.
'-'1J.'-J U ;;:;. .,u 'iXl'l l"'.nn'lr O'

if the failure does not involve

the task must be cost-effective, so
of time, the cost of doing the on-condition task must be
over a
it. The question of cost-effectiveness
less than the cost of not
to failures with operational
as follows:

is likely to be cost-effective. This is because the cost of inspection is

on pages 1 04 and 1 05.
usually low. This was illustrated in the
- The only cost of a functional failure which has non-operational con
sequences is the cost of repair. Sometimes this is almost the same as
the potential failure which precedes it In such
the cost
cases, even though an oncondition task may be technically feasible,
it would not be cost-effective, because over a period
the cost
of the inspections plus the cost of correcting the potential failures
the functional failure
than the cost
an on-condition task may be justified
pages 1 07 and
if the functional failure costs a lot more to
than the potential
if the fonner causes e< ar,,...Y>ri n,-.: , a amtag1e,

Predictive Tasks

1 67

7.9 Selecting Proactive Tasks

It is seldom difficult to decide whether a
task i s ror n n u,"H
feasible. The characteristics of the failure govern this decision, and they
are usually clear enough to make the decision a s imple
Deciding whether they are tvorth doing usually needs morejudgement.
For instance, Figure 7 .8 indicates that it may be technically feasible for
two or more tasks of the same
to prevent the same failure mode.
They may even be so c losely matched in terms of cost-effectiveness that
which one is chosen becomes a matter of personal nr,:"r,,. ...,,.r,,.,,,
The situation is complicated further when tasks from two different
categories are both technically feasible for the same failure mode.

For example, most countries nowadays specify a minimum

tread depth for
tyres (usually about 2 mm ) . Tyres which are worn below this depth m ust either be
repl aced or retreaded. I n practice, truck tyres- especially tyres on s i mi lar veh icles
in a single fleet working the same routes - show a fairly close relationship between
age and fail u re. Retreading restores n early all the o riginal failure resistance, so
the tyres could be schedu led for restoration after they have covered a set dis
tance. This means that all the tyres i n the truck fleet would be retreaded
had covered the specified mileage, whether or n ot
needed it
Figure 6.4 in chapter 6, repeated below as Figu re 7 . 1 2, could h ave been d rawn
for just such a fleet. This shows that i n terms of normal wear, all the tyres last
between 50 000 and 80 000 km. If a scheduled restoration
were to b e
adopted o n t h e basis o f t h i s i nformation, there
in the conditional
probability of this fail u re mode at 50 000 km and none of these fai l u res occur
before this age, so all of the tyres woul d be retreaded at 50 000 km. H owever, if
this policy were adopted many tyres wo4ld be retreaded long before it was really
necessary. In som e cases, tyres which co.u ld have lasted as much as 80 000 km
would be retreaded at 50 000 km, so they could lose up to 30 000 km of useful life.
it is
defin e
On the other hand, as discussed in part 6 of this
a potential fai l u re condition for tyres related t o tread depth.
is quick and easy, so it is a simple matter to check the tyres every 2500 km and
to a rrange for them to be retreaded only when they n eed it This would enable the
fleet operator to get an average of 65 000 km out of his tyres (in terms of normal
wear) without endangerin g his d rivers, instead of the 50 000 km which he gets if
h e does the schedu le d restoration task described above an i ncrease in usefu l
tyre l ife o f 30%. S o i n this case on-condition tasks are m uch more cost-effective
than scheduled restoration.

Figure 7.12:

Failure of tyres due

to normal wear in a
hypothetical truck fleet

Age (x 1 0 000)



On-condition tasks
On-condition tasks are considered

in the task selection process, for

they can
performed without moving the asset from its
while it is in operation, so they seldom
installed position and
interfere with the ........;;, :....:::::::;: process. They are also easy to organise.
potential failure conditions so cotTective action
defined before work starts. This reduces the amount of
and enables it to be done more quickly.
by identifying e(1uipment on the point of potential failure, they enable it to
all of its useful life
illustrated by the tyre example).
Scheduled restoration
If a suitable on-condition task cannot be found for a particular failure, the
next choice is a scheduled restoration task. It too must be technically
teatstt>Ie. so the failures must be concentrated about an average age. If they
are, scheduled restoration
to this age can reduce the incidence of
functional failures. This may be cost-effective for failures with major
the scheduled restoration
economic consequences, or if the cost
the functional failure.
task is significantly lower than the cost
of scheduled restoration are that:
it can only be done when items are stopped and (usually) sent to the
workshop, so the tasks nearly always affect production in some way
the age limit
to all
so many items or components which
have survived to
ages will be removed
work, so they O',,..,...,,....t-"" a much
restoration tasks involve
workload than on-condition tasks.
H"''"''"""""r scheduled restoration is more conservative than scheduled dis
instead of throwing them away.
card because it involves
Scheduled discard tasks
the least cost-effective
Scheduled discard is
but where it is technically 1eals1c11e, it does have a few desirable

Predictive Tasks

1 69

Safe-life l imits may be able to

certain critical
while an
limit can reduce the frequency of functional failures that
V""'"'"'n.... u
.. consequences. However, these tasks suffer from all
P tHii u n ,r. r n ,rl'aC" as scheduled restoration tasks.

For very small number of failure modes which have
or environ
mental consequences, a task cannot be found which on its own reduces the
risk of failure to an
low level, and a suitable modification does
In these cases, it is sometimes pos
s ible to find a combination of tasks
(usually from two different task catesuch as an on-condition task
and a scheduled discard task), which
reduces the risk of the failure to an
Do the on-condition
ceptable level. Each task is carried out
task at intervals less
than the P-F interval
for that
at the frequency
task. However, it must be stressed that
situations in which this i s necessary
are very rare, and care should be taken
not to employ such tasks on a 'belt and
braces' basis.
Do the scheduled
restoration task at
The task selection process
intervals less than
The task selection process is summa
the age limit
7. 1 3. Thi s basic order of
rised in
preference i s valid for the vast
ity of failure modes, but it does not apply
in every
case. If a lower order
task is clearly
to be a more costeffective method of m2magmtg a fai lure
Do the scheduled
than a
then the lower
discard task at
order task should be selected.
intervals less than
the age limit
Jligure 7.13:

The task selection process

Default action depends on

the failure consequences
(See chapters 4, 8 and 9)

Defau lt Actions 1 :
Fai l u re-finding Tasks

8.1 Default Actions

LJrt:i,u,,n" ,.,.h ,':>nto't'<'.' have mentioned that if a proactive task cannot be found
which is both
feasible and worth doing for any failure mode,
then the default ation which
the consequen
ces of the failure, as follows:
1\....- v 1.vLtt.., ""'iu;i,pi.v.1. ...,

ple failure associated with a hidden function to a tolerably low level,

then a periodic failure-finding task must be peiformed. If a suitable
default decision
task cannot be found, then the
is that the item may have to be reo1es1lgnea.
,._ ..... ,, .... . .-. ,..,u.,......:::.

if a proactive task cannot be found which reduces the risk of a failure

or the environment to a tolerably low
which could
item must be redesigned or the process must be changed.
if a proactive task cannot be found which costs less over a period of time
than a failure which has operational consequences, the initial default
this occurs and the operadecision is no scheduled maintenance.
tional consequences are still
then the secondary default
dec ision is
task cannot be found which costs less over a period of
if a
time than a failure which has non-ope rational consequences, the initial
default decision is no scheduled maintenance, and if the
are too
the secondary default decision is once again redesign.
The location of the de.fault actions in the RCM decision framework is
shown in
8 . 1 opposite. At this point, we are answering the seventh
of the seven questions which make up the RCM decision process:
what should be done if ll suitable proactive task cannot be found?
"'rl" - ' n
Chap ter 9 deals with
and also considers routine tasks which/all outside the RCM
decision frame.vork such as walk-around checks.
1 1 TH

Will the loss of function
caused by this failure


become evident to the

Figure 8. 1: Default actions

8.2 Failure-finding
Why bother?
of maintenance straMuch of what has been written to date on the
of maintenance: nnc11ctrve.
and corrective. Predictive tasks entai l
if C'Ar,-.,::,.1:h a, n
is failing. Preventive maintenance means
items or
,r,,n...7",nnPn,tc> at fixed intervals. Corrective maintenance means
or when
have failed.
either when
are found to be
of maintenance tasks which falls into
However, there i s a whole
none of the above categories.
a fire
we are not
We are simply
if it still works.
Tasks designed to c heck whether sornecnm,g

1 72

to hidden or unrevealed failures. Hidden

in tum only affect pm..,,._,; v..., devices.
If RCM is '"'"....""'"'t""'
to almost any modem,
it is not unusual to find that up to 40% of failure modes fall into the
Furthem1ore, up to 80% of these failure modes
u1111 nn"' -nnn1ncr so up to one third of the tasks generated by comprehensive,
uu11uc.u n,tauueiw,zce strategy developmentprograms arefailurefinding tasks.
A more troubling
is that at the time of writing, many """"J.,a.tJl_i.t;,
maintenance programs provide for fewer than one third of protective
devices to receive any attention at all (and then
at inappropriate
and maintain the plant covered by
these programs are aware that another third of these devices exist but pay
them no attention, while it is not unusual to find that no-one even knows
that the final third exist This lack
and attention means that
most of the
devices in industry our last line of protection
when things go wrong are maintained poorly or not at alL
This situation is
u ntenable.
If industry is serious about
and environmental integrity. then the
whole question of failure..f',
top priority as a matter
.. ..,nrl
....,,... needs to be
.. n,,,.n ,,u
of n,u.e,,nn,
.,;. As more and more maintenance
become aware
ne11ecte:o area of maintenance, it is l ikely to
issue in the next decade than pre
become a
dictive maintenance has been in the last ten years. The rest of this chapter
,,.,...,....,....""' this issue in some detail.
r.rr,ro, T H / 0

Multiple failures and failure-finding

A multiple failure occurs if a
function fails while a n..-.,,,t,,.,t1,.,,,.
device is in a failed state. This phenomenon was illustrated in
5 . 1 1 on page 1 17 showed that the probability of a
on page 1 14.
multiple failure can be calculated as follows:
Probability of failure of

the protected function

Average unavailability

of the


This led to the conclusion that the probability of a multiple failure can be
the unavailability of the protective device - in other
its availability. Chapter 5 went on to explain that the
best way to do this is to prevent the protective device from getting into a
fai led state
applying some sort of proactive maintenance.



Chapters 6 and 7 described how to decide whether any sort

maintenance is technically feasible and worth
when the
criteria described in these two chapters are applied to hidden functions,
to any
i t transpires that fewer than 1 0% of these functions are
form of predictive or preventive maintenance.
maintenance is often
to reduce the probability of the multiple
i s still essential to do
failure to the required leveL This can be done
whether the hidden function is still
For example, we cannot prevent the failure of a brake
bulb. So if there no
warning circuit to show that a bulb has failed, the only way to
that a burnt-out bulb will fail to warn other drivers of our intentions
it is still working and replace it if it has failed.
Such checks are known as failure-finding tasks .
, v 1 1tvL.LlvJ.\,.,.:, .:, ,

Scheduled failure-finding entails checking

a hidden function at regular intervals to
find out wlzetlzer it has failed
This chapter looks at
defines the formal technical
how to determine failure-finding
feasibility criteria for fai lure-finding and considers what should be done
task cannot be found.
if a suitable

Technical aspects of failure finding



is to satisfy ourselves that

v, ......,..,u,_,,., if it called upon to do so. In
we are not
whether the device looks OK
ch<:ckm2; whether it still works as it should. (This
t asks are also known as functional
consider some of the key i ssues in this area.
Check the entire protective system
A failure-finding task must be sure of aetectmg all the failure modes
..........r.. i-,.:.,,r n,r;;;-. device
fail . Thi s is
which are realsonai)!V
cases, the '""H''"'"''"""'
actuator. Ideally'
should respond to, and

\,HJ,> cHJ'VUJl\,J


sensor to
the circuit

1 74
For exam ple, a pressure switch may be designed to shut down a machine if the
below a certain level . Wherever possible switches
lubricating oil p ressure
by d ropping the oil pressure to the required level and
like this should be
checking whether t he machine shuts down.
Similarly, a fire detection circuit should be checked from smoke detector to fire
alarm by blowing smoke at the detector and checking if the alarm sounds.

Do not
Dismantling anything always creates the possibility that it will be
incorrectly. If this happens to a hidden function; the fact
that it is hidden means that no-one will know it has been left in a failed
state until the nex,t check ( or until it needed). For this reason, we should
of checking the functions of protective devices with
always look for
out disconnecting or otherwise disturbing them.
have to be dismantled or
some devices
removed altogether to check if they are working properly. In these cases,
care must be taken to do the task in such a way that the devices will
still work when
are retumed to service.
mathematical implica
tions of the fact that a failure-finding task might induce a failure are con
sidered later in this chapter.)
It must
possible to check
In a very s mall but still
number of cases, it
carry out a failure-finding tasks of any sort These are:
where it is
access to the
device in order to
check it ( this is almost
a result of thoughtless design).
when the function of the device cannot be checked without destroying
it (as in the case of fusible devices and rupture discs). In
ot11ter1:ec1mclo1es are available (such as circuit breakers instead of fuses).
However, in one or two cases our only options are to find some other
way of managing the risks associated with untestable protection until
or to abandon the processes concerned.
so1metn1ng better comes
Minimise risk while the task is being done
It should be
to carry out a failure-finding task without signifithe risk of the multiple failure.
An example of a borderline task is ove rspeeding something in order to check
protection mechanism works.
whethe r the

If a protective device has to be disabled in order to ca1Ty out a failure

or if such a device is checked and found to be in a
then altemative protection should be
should be shut down until the original
is discussed in more detail later.
Failure-finding should not be carried out on
where it is called
for but would
be too dangerous, (If society is
should be allowed to exist at all.)
it is debatable whether such
The frequency must be
task at the required intervals.
It must be practical to do the
before we can decide whether a required interval
we need to determine what interval is actually
. This issue is
considered next

8.3 Failure-finding Task Intervals

This section of this chapter describes how to determine the
failure-finding tasks. It will start by explaining that this
on two variables - the desired availability and the
of failure of
device. It goes on to look at how we establish the 'desired'
availability, and then examines different methods which can be used to
establish failure-finding intervals under different circumstances.
Failure-finding intervals, availability and reliability
and preventive maintenance
We have seen that
are each based on j ust one variable (P-F interval and useful life respecThe following paragraphs will show that not one but two variables
are used to set
availability and
Figure 8.2 shows a situation in which ten motorbikes have been in service for four
years. This means that the total service l ife of the fleet of b i kes this
10 bikes x 4


The brake light on each m otorbike has

onc e year for fou r years.
(This example assumes that no attempt is made to check the
annual checks.) Over the fou r year
the lights have been
to be i n
failed state o n four occasions, as shown i n
8.2. S o
mean time between
of the brake lights is:
40 years i n service
4 failures = 10 years.

1 76

Reliability-centred Maintenance


= Checked/OK
= Checked/failed
= Failed during
this year

Figure 8.2:

Brake light

I n this case, the failure-finding i nterval of one year is equal to 1 0% of the MTBF
of ten years. However, we don't know exactly when each fai led l ig ht ceased to
function. One might have failed the day after the last check, anothe r the day before
the current check, and the rest at some time in between. All we know for sure is
that each of the fou r lights failed some time d u ri n g the year preceding the check.
So in the absence of any better information, we ass u me that on average, each
failed light failed half way through the year. I n other words, on average, each of
the fai led lights was out of service for half a year. This means that over the fou r
year period, our failed l ig hts were i n a failed state for a total of:
4 failed l ights x 0.5 years each in a failed state = 2 years.
So o n the basis of the above info rmation, it seems that we can expect an average
u navailability from o u r brake l ights of:
2 years in a failed state + 40 years in service
This corresponds to an availability of 95%.

The above
suggests that there is a linear correlation between the
unavailability (5%), the failure-finding interval ( 1 year) and the reliability of
the protective device as given by its MTBF ( 1 0 years), as follows:

0.5 x failurefinding interval + MTBF of the protective device

It can be shown that this linear relationship is valid for all unavailabilities
of less than
provided that the protective device conforms to an expo
nential survival distribution (failure pattern E or random failure). (See
Cox & Tait 199 1 , Pp 283 - 284 or Andrews & Moss 1 993 , Pp 1 10 - 1 1 2)
l!,X,c tuamiJz task time and
Note that the 'unavailability ' of the protective device does not include.
any unavailability incurred while the failure-finding task is being carried
out, nor does it include any unavailability caused by the need to repair the
device if it is found to be failed. This is so for two reasons:


the unavailability required to carry out the

task and to
small indeed relative to the unrei s likely to be
effect any
vealed unavailability between
to the extent that it w ill
negligible on
,_,,..,.,."\1l Yl r1 !:'

be needed
task and any
both the
should be carried out under tightly controlled conditions. These condi
reduce - if not
eliminate - the chance
tions should
of a multiple failure while the intervention is under way. This entails
either shutting down the
alternative protection until the
been fully restored. If this is done
the unavailability resulting from the ( controlled) intervention can be
ignored in any assessments of the probability of multiple failure.
In the RCM decision process, the latter point is covered the criteria for
task worth
If there a
ficant increase in the likelihood of a multiple failure while the task is
I H 1"';P_T1t n n 1 n n

s ion p rocess defaults to the secondary default actions discussed l ater.

,., ,,., .,,,,-v,,. FF!

and ,.,,""'"' '"''"

If we use the abbreviation 'FFr to describe the

and MTrvE' to describe the MTBF of the
availability equation can be
FFI = 2 x unavailability x
This tells us that in order to determine the
we need to know its mean time
which we can determine
and the desired availability of the device
the unavailability to be used in the fonnula).
For i nstance, assume that the riders of our motorbikes decide
not satis
'Oc:;i<:::ort to 99%. The
see it inl"'l
fied with an availability of 95%, and would
: ....,,-.._..,..,
associated u navailabi lity is 1 %,. If the MTBF of the brake
at four years, checking interval needs to be
from once a year to:
FFI = 2 x 1 % x 4 years
2%, of 48 months
1 month.
I n other words, based o n their availability
and the
fail u re
data, the bikers need to check whether their brake
once a month.
If they want an availability of 99.9%, they n eed to check about twice a week.

1 78
the above calculations are only valid if the brake
on all the bikes are used about the same number of times each week. If
there is a wide variation, both the MTBF and the
should be calculated in terms of distance travelled, or even more
in terms of the number of times the brakes - and hence the brake
point to note at this
is the connection
are used.
between the
the desired availab ility and the MTBF).
For people who are uncomfortable w ith mathematical formulae, form
above can be used to develop a simple table, as follows
99,99% 99.95/o 99.9% 99.5% 99%

0. 1 %

0.2 /o



8% 95'%

I 0 /o

Figure 8.3: Failure4inding intervals, a vailability and reliability

Required avt:UUWliiuv
established the rel ationship between availab ility, reliability and
1 1 1 r,:._Tl! T\f1 t n n t n t,oi-,,,:; f c the next i ssue to consider is how we decide what
we require. Part 6 of Chapter 5 explained that this can be done
as follows:
prepared to tolerate for the
1 . first ask what probability
multiple failure which could occur if the hidden function w as not work
when called upon to do so
function will fail in
2: then determine the probability that the
the period under consideration
3: finally determine what availability the hidden function must achieve
to reduce the probability of the multiple failure to the desired level
we need to find out the mean
time between failures of the hidden function. Once this has been done, we
8.3 and select the task
"""''"'"'"- to look at
are in a P"';')
f",'trr,::,.c,">r,nA,;: to the level
establish ed in
3. This process
is illustrated in the
.111.t.\..;:l vU..11,,,'J'

lwV<. .< v '3FVUU' sl

8.4 summarises the duty/stan dby pump example in Chapter 5, where:

1 above, the users decided that they wanted the probability of the multi
to be l ess than 1 in 1 000 in any one year

rate of u nanticipated failures of the duty pump

could be reduced to an average of 1 i n 1 o years

I 79

MTBF = 1 O years

Figure 8.4:

Desired avtwa.011111v of a orotected device

this meant that the unavailability of the stand-by pump must not exceed 1 10, so
the availability of this pump has to be 99% or better
Figure 8.3 suggests that to ach ieve an availability of
for the
someone would n eed to carry out a failure-finding task (in other words, check that
it is fully functional) at an i nterval of 2/o of its mean time between fai lures. Records
m ight show that the stand-by pump has a m ean time
failures of 8 years
about 400 weeks), so the failure-finding task
should be:
of 400 weeks = 8 weeks

Rigorous Methods for Calculating FFI

,.,,.......... u ..... suggests that it is
intervals which incorporates all the varifor determining
ables considered so far. In fact this can be done
( l ) and above, explained in the
by defining a few
a probability of a multiple fai]ure of l in I
of l 000 000 years. Let us call
a mean time between
this MMF If this is so, then the probability
96 )
note on
in any one year is 1/MMF '
we have seen that if the demand rate of the
once in 200 years, this
of failure for the
protected function of I in 200 in any one year, or a mean time
protectedjimction of 200 years. Let call this
so the probability of failure
in any one year
will be
This also known as the denumd

1 80

Reliability-centred Maintenance
is the mean time between failures of the
the failure-finding task interval.
is the allowed

ruv..-f'n-,rt f"n,n

r>rr)TL>,"TH U'J


If we substitute the above ,.u",1--1'""'''""''-J..,.,. equation (1) becomes:

This C afl be


:::: (1/MTED) X
as follows:

.... 4

above states that :

FFI 2 x
So substituting
from equation 4 into equation 2


2 x

.... 2

interval to be detennined in
as follows:
If we apply this formula to the figures used in the duty/stand-by pump system
is 1 000 years,
is 8 years and MrEo is 1 0 years, so:
mentioned above,
2 months
modes in a
all the failure possibilities which could cause
,., .... ... u,J"" this
device to fail have been grouped together as one
The vast majority of protective de
failure mode ('stand-by pump
vices c an be treated in this way, because all the failure modes which could
device to cease to f u nction are checked when the func
tion of the device as a whole is checked.
J.4 r,n ,,:,.,u,r it is sometimes appropriate to carry out a detailed FMEA of
the device in order to identify individual failure modes which might on
their own cause the device to be unable to provide the required protection.
This is usually done in two sets of circumstances:
when some of the failure modes arc known to be susceptible to pro
but others are neither predictable nor preventable.
In these c ases, the appropriate on-condition or scheduled restoration/
discard task should be applied to the failure modes which qualify, and
tasks applied to the remainder of the failure modes



device is new and the only failure data which

when the
available (from data
of the device but not to the device as a whole.
In these cases, equation
above can be modified to accommodate the
MTBF of eac h component of the device.
When the failure-finding task can cause
which affects the whole '"'""'"' ""-'""'"
A major practical
findin g is that the
to detect This
happens in one of two ways:
the task stresses the system i n such a way that it
fail (as might be the case when a switch i s
imposes stresses o n the mechanism
if the system needs to be disturbed to do the
chance that the person
it will leave the

it to

in a failed state_

In both cases, the device will be in a failed state

the moment the
r'CH"Y\ T"\ !l ,:;.-t,:/1 If p is the
that it will be left in such
a test, then p a decimal) will be its
is the mean time '"""" nu>. n
test, it can be shown
FFI = 2 x
x M TJ\Q - p)



( 1 - p) .can be
In this
If the act of switching is the only cause of failure
and ifthe failure conforms to an
survival distribution,
probability of a multiple failure the demand rate
multiplied by
For ex1:1mt>le. if the demand rate is 40 years and the switch lasts average of
then the probability of a multiple failure is:
600 000
1 in (40 x 600 000) 1 in 24 000 000 years.
Thi s is so because if the failure is caused
of operating the switch to check if it has failed will ,,,.u,....,. , u... .,,"'"'...., ,"'H
enable you to find out if the l ast


stress the switc h and so create the

of the check.

of the switch caused it to fail

that it will fail as result


1 82

set of circumstances (random failure caused

So under this
operating the item), a failure-fi nding task which
item to check whether it has failed will have no effect at all on the proof how often the task is done. In
bability of a multiple failure,
to do the task at the
the answer to the
is 'no because there is no suitable interval. So in this
wants the probability of a multiple failure of the
case, if the
switch described above to be less than 1 in 1 00 000 000 years, the only
way they can achieve this is by reducing the demand rate on the switch,
either more switches a more reliable switch.
ancVor by
All of this
that failure rates which are

seldom indicate whether the failure under consideration is hidden

or evident

do not indicate whether the underlying failure-pattern is agein which case some form of scheduled restoration or scheduled
or whether it random

the operIf this is so, then it

ation of a switch is likely to be
task could be identified which reduces the prothat a
bability of a multiple failure to the
that as a
important switches
breakers -- should not be treated as single failure modes. Rather, they should
to a detailed FMEA, and the most appropriate maintenance
for each failure mode.
rt,:., fL>l ,''Vt"\<;1r1

Sources of Data for FFI Calculations

possess several thousand protected
Most modern industrial
hidden functions. The multiple failmost of w hich
ures associated with m any of these systems will be serious enough to
one of the
approaches to failure-finding.
If accurate d:'tta about the probability of failure of the protected function
and the mean time between failures of the hidden function are available,
quite quickly. If this information is not
the calculations can be
available and very often it is not then it is necessary to estimate what
to be in the context under consideration. In rare
these variables are
cases, it might be
to obtain data from one of the following:


Failure-finding Tasks

1 83

the manufacturers of the equipment

commercial data banks
other users of similar equipment.
More often, however, the estimates have to be based on the knowledge
of the people who know the most about the
many cases these are operators and maintenance craftsmen. (When
data from external sources, take special note of the
context of
the items for which the data was gathered compared to the context in
which your equipment is operating.)
Once a failure-finding task frequency has been established and the
tasks are
done on a
it becomes
assumptions used to determine the frequency quite rapidly. However, this
does require the keeping of absolutely meticulous
when each failure-finding task is done, but also about:
whether or not the hidden function is found to be functional each time
the task is done
how often the protected function fails (this can often be inferred from
the number of times the protected function makes
of the nrr,r, rn
device for instance, from the number of times a pressure relief valve
actually has to relieve the pressure in the
On the basis of this information the actual mean time between failures can
be calculated and, if necessary, the task frequency revised
Failure modes where the MTBF and/or the associ ated failure patterns
are completely unknown and a satisfactory guess cannot be made
should be put into an age-exploration program right away to establish the
true picture. If the situation i s such that the uncertainty cannot be tolerated
while the data are being gathered -- in other words, if the consequences
of guessing wrong are s imply too serious for the
( or in some
cases, society as a whole) to accept - then every effort should be made to
change the consequences. This in turn will nearly always necessitate
some form of redesign.

An Informal Approach to Setting Failure-finding Intervals

to warrant the time and
Not every hidden function is i mportant
effort needed to do a ful l rigorous analysis. This applies 111ainly to multi
ple failures which do not affect safety or the environment It could also
apply to multiple failures w hich could affect
but where the
ted function is inherently very reliable and
is -marrinal
- - n -


Reliability-centred Maintenance

view of the entire

In these cases, it may be sufficient to take a
to a decision on
context, and go
m lts
a desired level of availability for the hidden function. This decision is then
used in conjunction with the MTB F of the hidden failure to set a task i ntereven go so far as
the table in
8.3. (Some
to use an availability of 95% for all hidden functions where the associated
'""".. .,...,......, failure cannot affect safety or the environment. However, gene
ral policies of this nature can be dangerous so they should only be used
who have extensive
with this type of
records about hidden failures are not available
seklorn will be it will be necessary to guess at the MTBF's
these records should be compiled as quickly as
w ith:But
possible to validate the initial estimates.
Other Methods of Calculating Failure-finding Intervals
The range of techniques for
intervals described so
far in thi s chapter is no means exhaustive. Many additional variants
have been developed the Aladon network of RCM
include formulae for:
independent; fully redundant
.1 Pn'lc where the multiple
intervals for cuC't
do not affect
or the environment
As this book only intended to provide an i ntroduction to this subjec t,
these formulae are not included i n this chapter.
C U C 1"t:> n'\ C

rY U :i l h in l ,:,,



The Practicality of Task Intervals

The methods described so far for calculating failure-finding intervals some
long intervals . In
or too short, as follows:
a very short
task interval has two main implications :
sometimes the interval i s simply far too short t o b e practical . Examwould be failure-finding tasks w hich
to be shut down every few
happen if a fire alarm
the task could cause habituation (which
is tested too
and we move on to the next
In these c ases, the proposed task is
of the RCM decision-making process, as discussed later.

1 85

in rare cases, task intervals emerge which are

the demand rate (MrED). It makes little sense to carry out
than the
task at intervals {FFI) which are
itself (MTE0), so in these cases, the answer to the
practical to do the task at the
interval?" wil l be 'no' . ..... r'"""-"'c,,...
task is not done on
bear in mind that if a
is more than 4 or 5 times
can be shown that:
,Hr,_U-U.... ...,, ..., .. ..,.

If this value of
will almost
inadequate and
discussed in the next chapter.

8..4 The Technical Feasibility of :Failure-finding

The issues discussed in Parts 2 and 3
for a failurefinding task to be technically fonc,a-.l,.,.., it should be :--n.Q;hlo, .n u1"' to do the
at it should be possible to do it nnf h,">lllT
the risk of the multiple fail ure, and it
vU,:H VJ.'-

1 'F\"rC> 'l C ll n <l'

Failurefinding is technically feasible if

it is possible to do the task

the task does not increase the risk of a multiple failure
it is practical to do the task at the required interval.
The objective of a failure-finding task is to reduce the
of the
multiple failure associated with the hidden function to a tolerable level.
It is only worth doing if it achieves this

Failure-finding is worth doing if it reduces the probability

of the associated multiple failure to a tolerable level

1 86


Failure-finding is a Default Action!

things from
Bear in mind that successful proactive maintenance
some time
that they will
albeit not very much - in a failed state. This means that proactive mainte
nance i s inherently more conservative (in
safer) than failure
,.,.u,"'u,,.,, so the latter should only be specified
task cannot be found. For this reason, it is wise to avoid RCM decision
ahead of proactive maintenance in the
task selection process.
What it' FaHurefinding is Not Suitable?
that a
task is not technically feasible or
which might enable
we have exhausted all the
asset. Where this
from the
us to
,.,....'""'"'"' "",, by the consequences of the multiple
as follows:

task cannot be found and the multiple failure

or the environment;
must be changed in
is compulsory
order to make the situation safe. In other words,
task can if a
not be found and the multi
failure does not affect
or the environment,
t o take
then it i s
if the multiple
failure has very ex1Je11Lsn1e
If it

marised i n
is a fuller descrip tion of this aspect of the procthan the two boxes at the
foot of the left hand column
r1 1 'Jo (Tr,l m

Figure 8.5:

the decision process

Other Default Actions

Three default actions are shown at the foot

8 . 1 . The first of these
failure-finding was covered in Chapter 8 . This ,t,,, ......,". focuses on no
It also
scheduled maintenance and
the role of
walk-around checks.

9.1 No Scheduled Maintenance

task cannot found in tum, then
default action if the multiple failure has
or environ
mental consequences. We have also seen that if evident failure has
en1;rnm1:rze1um cmr1stauenices and a suitable t'>.,,.L,.,,,,,,,,,., task canmake situation safe.
not affect
However, if the failure evident and it
or the
does not affect
envuonmemt, or if it hidden and the multiple
or the en-vlfonmt::nt, .then the i nitial default decision is to
scheduled maintenance. In these
a functional fail ure occurs, at which

associated multiple failure does not have safety or environmental con


operational or
Note that if a suitable
task cannot be found a failure under
either of these
it simply means that we do not carry out
scheduled maintenance on that
m its
form. It does not
mean that we simply
about it. As we
the next section of this
,,rEl:nn::.r there may be circumstances under which it is worth
to reduce overall costs.
of the

1 88


9.2 Redesign
has arisen
as we have
which must be followed to ..!.v
ri.:;n r,:::,.
,r,,,n a successful maintev:vp
nance program. In thi s part of this chapter,
and maintenance, and then
which affect the
by redesign in the task selection process.
consider the
is used in its broadest sense i n this chapter. Firstly,
The tenn
item of equipment. This
it refers to any
to the
means any action which should result in a ..,,, ..,,........vp. to a drawing or a parts
component, adding a new
list. It includes
rerUll,Cln' P Clll entire machine With One Of a different make Or type,
machine. It also means any other once-off change to a
of the plant It even
which affects the
a method of dealing with a specific failure mode ( which
the capabi li ty of the person being trained.)


Design and Maintenance

(making a new part,

a new
program) and the cost of implementing
vu.,,yL.Liu;;;., a new
the change
the part, conducting the training
or people h ave to be taken out of
indirect costs are incurred
There is also the risk that
service while the
w ill fail to eliminate or even alleviate the problem it i s meant to
solve. In some cases, it may even create more problems.
the whole question of modifications should be approached
caution. Two issues need particular attention:
what do we consider first or maintenance?
between inherent reliability and
or maintenance ?
Reliability, design and maintenance are inextricably linked. This can lead
equipment before
the RCM process con... ._, ,."'"''""''" "' .. ;:;., its maintenance
siders maintenance first for two reasons.

1 89
Most modifications take from six months to three years from concep
tion to commissioning, depending on the cost and complexity of the new
design. On the other hand, the maintenance person who is on duty
has to maintain the equipment as it exists today, not what should be there
or what might be there some time in the future. So
realities must
be dealt with before tomorrow's
Secondly, most organisations are faced with many more apparently
desirable design improvement opportunities than are physically or eco
nomically feasible. By focusing on failure consequences, RCM does
much to help us to develop a rational set of priorities for these T'\rr,1t"['r"
especially because it separates those which are essential from those that
are merely desirable. Clearly, such priorities can only be established after
the review has been carried out
Inherent reliability vs desired performance
Among other things, Part 2 of Chapter 2 stressed that the i nherent reliabi
lity of any asset is established by its design and by how it is made, and that
maintenance cannot yield reliability beyond that inherent in the
This led to two important conclusions.
Firstly, if the inherent reliability or built-in
of an asset
greater than the desired performance, maintenance can
achieve the
desired perfonnance. Most equipment is adequately "V'-'"' ' ' '-'''"'' v,n,;..,u'-''-
and built, so it is usually possible to develop a ,,...,.,., .,,.,."'"'''Vl maintenance
program, as described in previous chapters. In
in most cases,
RCM helps us to extract the desired performance from the asset as it is
currently configured.
On the other hand, if desired performance exceeds inherent reliability,
then no amount of maintenance can deliver the desired
these cases 'better' maintenance cannot solve the
so we need to
look beyond maintenance fo r the solutions. Options include:
modifying the equipment
changing operating procedures
lowering our expectations and deciding to live with the problem.
This reminds us that maintenance is not always the answer to chronic
reliability problems. It also reminds us that we must establish as soon and
as precisely as possible what we want each piece
to do in its
,...,n.,or<1, 1'1:ni f'T
context before we can starting talking
about the appropriateness of its design or its maintenance
vu-iJU.U' l U \.


1 90

Redesign as the Default Action

8.1 shows that [PrlP<::1H11

.........,..,. ic, appears at the b ottom of all four column s
In the case o f failures which have safety or
en 1vm)nrne11ta1 consequences, it is the compulsory default action, and in
the other three cases, it 'may be desirab le ' . In this part of this
consider each case in more
with the safety case.
u""'"'"''"'u ..........,...,.,.u ... .

or environrnental consequences
If a failure could affect
or the environment and no
can be found which reduces the risk of the failure
,.,vA .,...,.,,uu.,,t-,.. must be changed, simply because we
or environmental hazard which cannot be ade
usually undertaken with one
to reduce the probability of the fail ure mode occurring to a level which
the affected component
This usually done by
with one which
or more reliable.

the item or the process in such a way that the failure no l onger
or environmental consequences. This is most often done
r'lr/Vta,,h , ro r,,c; n r 1 .,,:H>
which wer e
abnormal conditions
to alert
- to shut down the equip1nent in the event of a
to e liminate or relieve abnormal conditions w hich follow a failure and
otherwise cause more serious
- to take over from function which has failed
situations from
Remember that if such a device is added, its maintenance requirements
must also be analysed.
or environmental consequences can also
be reduced by
hazardous materials from a process, or even
dangerous process altogether.

As mentioned in
Of the
RCM does not raise the o f economics. If the level of risk assoas
we are obliged either
ciated with any failure
or to make the process safe. The alternative is to
"n"" that are known to be unsafe

",.."'"'11 1

P t"l,tf1 1'1''\n t'Yl P nt



1 91

Hidden failures
the risk of a
failure can be reduced
In the case
the equipment in one of four ways:
make the hidden function evident by adding another device: Certain
hidden fu nctions can be made
adding another devi ce which
draws the attention of the operator to the failure of the hidden function.
a battery used to power a smoke detector is a classical h idden
function if no additional p rotection is p rovided. H owever,
is fitted
to most such detectors in such a way that the l ight
out if the
m akes the
of the
In this way the additional
battery, not about the
(Note that the l ight only tells us about the condition of
ability of the detector to detect smoke.)

Special care is needed in this area, because extra functions ins ta] led for
thi s purpose also tend to be hidden. If too many
added, it becomes
difficult if
tasks. A much more effective approach to
substitute an evident function for the hidden
the next n1r''.'.l O'rsl1'\n
substitute an evidentfunction for the hiddenjimction : In most cases
this means substituting a
difficult to in
which is not fa il-safe. This is
if it is done, the need for a fai lure-finding task falls away once.
For example, one
used way to warn the d river of a
that his
b rake lights have failed is to
a warning
which switched on if the
brake lights fail . (In many cases, this llght i s also switched on for a short while
when the ignition is switched on. However, so are all the other
o n the
dashboard. Under these circumstances one
overlooked, so its f unction is effectively hidden.)
The system might also be configured i n such
only be tested by d i sabling a brake
and i nvasive
which is
on. This is a
likely to be dismi ssed
than it solves,
m ultiple fail u res associated with this
the design.
quences, so it is necessary to
One way to eliminate this problem i s to make the
and of the warning system evident This can be
cables for the
l ight, and
the cables
through them at the b rake l ights every time h e u ses the brakes.
In this situation, it
a pinp rick of light at the end each
d river if either a brake
or a cable fails. In other words, the function
protective device is now evident, so

1 92

Reliability-centred Maintenance

substitute a more reliable (but still hidden) devicefor the existing hid
8.3 suggests that a more reliable hidden function
den function:
(in other words, one which has a higher mean time between fai lures)
will enable the organisation to achieve one of three objectives:
to reduce the probability of the multiple failure without changing the
failure-finding task intervals. This increases the level of protection
- to increase the interval between tasks without changing the probabil
ity of the multiple failure. This reduces resource requirements
- to reduce the probability of the multiple failure and increase the task
increased protection with less effort.
duplicate the hidden function : If it is not possible to find a single pro
tective device which has a high enough MTBF to give the desired level
of protection, it is still possible to achieve any of the above three objec( or even triplicating) the hidden function.
tives by
Let us return to the example of a d uty pump with a stand-by. It was explained
on page 1 79 that if the users want the p robability of a multiple fail u re to be less
than 1 in 1 000, and the u nanticipated failure rate of the duty p u m p is red uced
to 1 in 1 O years, then the availabil ity of the stand-by pump has to be 99% or
better. This led to the conclusion that a failu re-finding task should be done on
the stan d-by p u m p every 2 months i n order to achieve an availability of 99%
on an MTBF for this pump of 8 years).
However, now let us assume that someone has decided that the p robability
of a m u ltiple fail u re in this system should not exceed 1 in 1 00 000 ( or 1 Q5) , rather
than 1 i n 1 000. If the m ean time between unanticipated failures of the d uty pump
(Mreo) is u n changed at 1 O years, applying formu la 4 i n Chapter 8 shows that the
u navailability ( Ur,vE ) of the stan d-by pump should n ot exceed:
MrE /M MF = 1 011 00 ooo = 1 0-4
u 1wE
So the u n availability of the stan d-by pump must now not exceed 1 0-4 (0.01 %).
I f the MTBF of the stand-by pump is u nchanged at 8 years, applying formu l a 2
from Chapter 8 yields the fol lowi ng:
2 x 1 Q-4 x 8 years
1 4 hours
Activating a stan d-by pump this often is plainly impractical, so more thought has
to be g iven to the design of this system.
In fact, Figure 9.1 opposite shows that i f we were to add a second stand-by
pump, and ensure that the availability of each stand-by pump on its own exceeds
99% (corresponding to an unavai lability of 1 %, or 1 0-2), the p robability of the
m ultiple failure would be:
1 0 1 X 1 0'2 X 1 0-2 :::: 1 Q-5
or 1 in 1 00 000. Figure 8.3 suggests that this can be achieved by doing a failure
finding task on each stan d-by pump at the original frequency of once every 8
weeks. I n othe r words, a m uch h igher level of protection is achieved without
changing the task interval.



pump _,,

1 year uncnangea at 1 In 1 0

Stand-by -+.a11,:_:),a-n,;_n:_r,:_9_
....__ .
pump 1

Figure 9.1:

The effect of
duplicating a
hidden function

Stand-by -+_
pump 2

Probability of a multiple failure in any 1 year:

1 in 10 x 1 In 100 x 1 in 100 = 1 in 100 000

Operational and non-operational consequences

task cannot be found which is worth
If a technically feasible
consequences, the
doing for failures with operational or
immediate default decision is no scheduled maintenance.
may still be desirable to modify the equipment to reduce total costs. To
achieve this, the plant could be modified to:
reduce the number of times the failure occurs, or .....,. .,,,,.," '"' eliminate it
by making the coimooment
reduce or eliminate the consequences of the failure
make a preventive task cost-effective (for msian,ce.
ponent more acc;es:s1...
Note that in this case the failure consequences are
economic so
whereas they were the -, .....
modifications must be
or environmental consequences.
default action if the failure had
There is no one w ay to detennine whether a modification will costeffective. Each case is
a different set
which include a before-and-after assessment of maintenance and
the remaining technologically useful life of the asset, the likelihood that
the number of other
the modification will
the capital resources of the company and so on.
A detailed cost-benefit
which takes all these factors into account
c an be very
so it is helpful to know beforehand whether
this effort i s
t o be worthwhile. T o help make a
the decision
assessment, Nowlan &
, 1'.J

1 94

Reliability-centred Maintenance

Figure 9.2:

Decision diagram
for a preliminary
at a proposed

Redesign is
not justified

No matter how rel iable, all

assets are
seded new technology.
So the first question to ask
whether the asset under

Redesign is
not justified

Redesign is

not justified
n o t worth modiis
fying it. On
if it i s
t o b e around fo r a while
the modification
Redesign is
a chance to pay for itself. This is why
not justified
the first
9.2 asks:
b; the remaining


Redesign is

equipment high ?

demand that modifications should pay for themselves within a specified period - say, two years. This
sets the
horizon of the equipment at two years. This
of policy
reduces the number of
initiated on the basis of projected costwhich will pay for themselves
benefits and ensures that only
are submitted for approval. So if the answer to the fi rst question
9.2 is no'
is probably not ''""'u .... ...,.....

Other Default Actions

Fm examole Figure 9.3 shows a stainless steel
hopper which is periodically blocked lumps.
So far, the ACM p rocess has revealed that
this failure mode costs 400
it occurs, and that it cannot be p re
vented by maintenan ce. It has been sugges
ted that one
to eliminate the failure mode
m ight b e to
a stainless steel
the hopper outlet at a cost of 6 000.
If the hopper were due to be superseded
unlikely that this
within two years, it is
modification would
i n view of the fact that several months would
e lapse before it could b e com missioned. O n
th e other hand, if t h e hoppe r were to remain i n
service for several more years, t h e modifica
tion would be worth further consideration.


Proposed modification:
a stainless

The problem:

Jumps which
block the

Figure 9.3:

A stainless steel hopper

If the answer to the first amestiton

' the next
to consider
is whether the failure is ;t(-l
This question eliminates items which fail so seldom that the cost of
than the benefits be derived from
,._,u,.,o;:;;.,u would probably be
for a low failure rate.
of course a preventive task the
This is why a no' answer to this
does not
abort the
modification - the maintenance task itself might be so
that the
modification is still
"' " '"

1 1

"' "'

- "' v;.,, u

, n c f" 1'tc:>rf

ov-::i, mn, !o if the blockage in the

n o-one would pay m uc h attention to it If it occurred

If the failure rate is

of the failure:
involve ,najor operational consequences ?
should be taken further.
If the answer is yes, then the
A 'no' to this question means that the failure only has minor effect on
.-..-.,a ..n, h , rr costs, but we must still consider the maintenance costs associated with the failure
Is the cost
Note that this question is
from two directions. As
a ' no' answer to the failure rate
a very


Reliability-centred Maintenance

A 'no' answer to the question of operational consequences means that

failures might not be affecting operating c apability, but
may result
in excessive repair costs. So a 'yes' answer to either of these two ques
tions brings us to the design change itself:
costs which might be eliminated by the design change ?
Are there
This question refers to the operational consequences and the direct costs
of proactive and/or corrective maintenance. However, if these costs are
not related to a specific design feature, it is unlikely that the problem will
be solved by a design
So a 'no' answer to this question means that
it may be necessary to live with the economic consequences of the fail ure.
On the other hq.nd, if the problem can be pinned down to a specific cost
element, then the economic potential of redesign is high.
In the case of the hopper, it is h oped that the grid would p revent the lumps from
reaching the hopper outlet, and so elim inate the cost of 400 per blockage.

But will the new design work? In other words:

ls there a high probability, with existing technology, that the modification
will be success.fill?
Although a particular design change might be very desirable economi
cally, thtre is a chance that it will not have the desired effect. A change
directed at one failure mode may reveal other failure modes, requiring
several attempts to solve the problem. Any design change which entails
adding hardware also adds more failure possibilities maybe too many.
So if a cold-blooded assessment of the proposed change indicates a low
p robability of success, the
is unlikely to be economically viable.
For i nstance, in the case of the hopper we would need to be sure that lumps would
not sim ply accumulate on the grid and coagulate i nto a possibly m uch more costly
p roblem in the long term.

which makes it this far deserves a detailed

Any proposed
cost-benefit study:
Does an economic trade-off study show an expected cost saving ?
Such a study compares the expected reduction in costs over the remaining
u seful life of the equipment with the costs of carrying out the modifica
tion. To be on the safe side, the expected benefit should be regarded as the
projected saving if the first attempt at improvement is successful, multi
plied by the probability of success at the first try. Alternatively it
be considered that the design change will always be successful, but only
some of the savings will be achieved.

Other Default Actions

1 97

If we are certain that the modification to the hopper will work, a discounted cash
flow analysis on the figu res provided for the hopper ( at a discount rate of
shows that the modification will pay for itself
i n five years if the blockage occurs fou r times per year,
* in seven years if it occurs three times per year and
* i n more than ten years if i t occu rs twice per year.

This type of justification is not necessary, of course, if the reliability

characteristics of an item are the subject of contractual wa1Tanties or if the
changes are needed for reasons other than cost (such as

9.3 Walk-around Checks

Walk-around checks serve two purposes. The first is to
damage. These checks may include a few
on-condition tasks fo r
the sake of conven ience, but
can occur at any time and
is not related to any definable level of failure resistance.
As a result, there is no basis for defining an explicit potential failure
condition or a predictable P-F interval. S imilarly, the checks are not based
on the failure characteristics of any particular
but are intended to
spot unforeseen exceptions in failure behaviour.
Walk-around checks are also meant to spot problems due to
or negligence, such as hazardous materials or
around, spi llage, and other items of a housekeeping
managers an opportunity to ensure that general standards of maintenance
are satisfactory, and can be used to check whether maintenance routines
are being done co1Tectly. Again, there are rarely any explicit vvi.....,u,,Hu
failure conditions and no predictable P-F interval.
Some organisations distinguish between formal scheduled tasks and
walk-around checks on the pretext that one is mainly technical and the
other predominantly managerial, so they are sometimes done by different
people. In fact it does not matter who does them, as
as both are done
of pro
to ensure a reasonable
frequently and thoroughly
tection from the consequences of the failures concerned.


The RCM Oecision Diagram

10.1 Integrating Consequences and Tasks

Chapters 5 to 9 h ave provided a detailed explanation of the criteria used
to answer the last three of the seven questions which make up the RCM
process. These questions are:
in what way does each failure matter?
what can be done to prevent each failure?
what should be done if a suitable preventive task cannot be found?

This chapter summarises the most important of these criteria. It also

describes the RCM Decision Diagram, which
all the decision
processes into a single strategic framework. This framework is shown in
Figure l 0. 1 overleaf, and is applied to each of the failure modes listed on
the RCM Information Worksheet.
Finally, this chapter describes the RCM Decision Worksheet, which is
the second of the two key working documents used in the application of
RCM (the Information Worksheet shown in Figure 4. 13 being the first).

10.2 The RCM Decision Process

The RCM Decision Worksheet is illustrated in Figure l 0.2 opposite. The
rest of this chapter demonstrates how this worksheet is used to record the
answers to the questions in the Decision Diagram, and in the light of these
answers, to record:
what routine maintenance (if any) is to be done, how often it is to be
done and by whom
which failures are serious enough to warrant redesign
cases where a deliberate decision has been made to let failures happen.





The RCM Decision

1 99

Figure 10.2: The RCM Decision Worksheet


Is a task to detect whether
the failure is occurring or
about to occur technically
feasible and,worth doing?


Is a task to detect whether

the failure is occurrin g or
about to occur technically
feasible and worth doing?



oncondition task

o ncondition task

Is a scheduled restoration
task to reduce the failure
rate technically feasible
and worth doing?

I s a scheduled restoration
task to avoid failures
technically feasible
and worth doing?

Schedu led
restoration task

restoration task

Is a scheduled discard
task to reduce the fai lure
rate techni cally feasible
and worth doing?



discard task

Is a failure-finding task to
detect the failure technically
feasible and worth doing?

Is a scheduled discard
task to avoid failures
technically feasible
and worth doing?


discard task



Could the
m ultiple
failure affect
safety or the
enviro nment?

ls a combination of tasks
to avoid failures technically
feasible and worth doing?

of tasks

Redesign is

Redesign i s -Ye
.0_ No scheduled Redesign may
be desirable

The RCM Decision



Is a task to detect whether

the failure is occurring or
about to occur technically
feasible and worth doing?



Is a schedu led restoration

task to reduce the failure
rate technically feasible
and worth doing?




restoration task

Is a scheduled d iscard
task to reduce the failure
rate technically feasible
and worth doing?
discard task

Is a scheduled restoration
task to reduce the fail ure
rate technically feasible
and worth doing?

restoration task


Schedu led
on-condition task


on-condition task

Is a task to detect whether

the failure is occurring or
about to occur technically
feasible and worth doing?

ls a scheduled d iscard
task to reduce the failure
rate technically feasible
and worth doing?

No schedu led

Redesign may
be desirable

discard task

No scheduled

Redesign may

be desirabl e

Figure 1 0. 1 :

] 00
1 991 Aladon Ltd

The decision worksheet is divided into sixteen columns. The columns
headed F, FF and FM identify the failure mode under consideration. They
as shown
are used to cross-refer the information
10. 3 below:
Cooting Water Pumping Systen
WORKSHEET r.;;..;.--------


WORKSHEET rstffisvs:fEii______

@ 1 990 ALAOON LTD

Figure 10.3:
the information
and decision

The ne1011rms on the next ten columns refer to the umst11Dns on the RCM
in Figure 1 0. 1 , as follows:
the columns headed H, S, E, 0 and N are used to record the answers to
concerning the consequences of each failure mode
the next three columns
H3 etc) record whether a proactive task has been
and if so, what type of task
if it becomes necessary to answer any of the default questions, the
columns headed H4 and H5, or S4 are used to record the answers.
The last three columns record the task which has been selected (if any),
with which it is to be done and who has been selected to do
it. The
task' column is also used to record the cases where reor it has been decided that the failure mode does not
need scheduled maintenance.
paragraphs, each of these four sections of the decision
In the
worksheet is reviewed in the context of the associated questions on the
" '""0

J.JJ.,..._ ......... .

The RCM Decision Diagram


Failure Consequences
of questions H, S, E and O in
l 0. 1 are dis
cussed at length in Chapter 5. These questions are asked for each failure
mode, and the answers recorded on the decision worksheet on the basis
shown in Figure 1 0.4 below.

,..,,\: .....,,,,, ,, ,. ,, , ,.--,,,..,1 ---

Write the letter N

in column Hand
go to question H 1

Write the letter Y

- - in column S and

go to question S 1

/ r"I T 1:::;::; ,. - -

Write the letter Y

in column E and
go to question S 1

Write the letter Y

.r--, , ,o.:, ,. - - in column O and


- - - - - - - - - - Write the letter N in column 0

go to question 0 1

and go to question N 1

Figure I 0.4: Using the decision worksheet to record failure consequences

Figure l 0.5 shows how the answers to these questions are recorded on the
decision worksheet. Note that:
each failure mode is dealt with in terms of one
ces only. So if it is classified as having enviromnental consequences,
we do not also evaluate its operational consequences least when per
forming the first analysis of any asset). This means that if for instance
a 'Y ' is recorded in column nothing is recorded in column 0.
once the consequences of the failure mode have been
next step is to seek a suitable preventive task.
7 .5 also summarises the criteria used to decide whether such tasks are worth
''") Ta.rrt'Vt"H


N __,.. ________ ..__.., ________ A hidden failure:

y" N

y ---- ------ Environmental consequences:

To be worth doing, a ny preventive task

m ust reduce the risk of a multiple failure
to a n acceptable l evel

y --------- ------- Sa fety consequences:

preventive task
To be worth doing,
m ust reduce the risk this fail u re on its
own to a n

To b e worth doing, any

failure on its
m ust reduce the risk of
own to an acceptable level
Operational consequences:

To be worth
over a period of time
m ust cost less than
cost of the operational consequences
the cost of repair of the failure which
is meant to p revent
Non-operational consequences:

To be worth doing, over a period of time

any p reventive task m ust cost less than
the cost of
the failure which it
is m eant to

Figure 10.5:

Fat1ure consequences - a summary

Proactive Tasks
The eighth to tenth columns on the decision worksheet are used to record
whether a
task has been selected, as follows:
the c olumn headed H l/S l/01/N l i s used to record whether a suitable
oncondition task could be found to anticipate the failure mode in time
to avoid the consequences
the column headed H2/S2/02/N2 is use d to record whether a suitable
the column headed H3/S3/03/N3 is used to record whether a suitable

The RCM Decision


In each case, a task is

suitable if it worth
and r"','' "' '"''" " '
feasible. Chapters 6 and 7 explained in detail how to establish whether a
These criteria are summarised in
task is technically
l 0.6.
feasible and worth
In essence, for a task to be
a positive answer to all of the
be possible to
shown in
and the task must fulfil
10.6 which apply to that
criteria in Figure 1 0.5. If the answer to any of these
the worth
If all
questions is 'no' or unknown, then that task as a whole is
of the questions can be answered
then a Y is recorded in the
appropriate column.
Figure 10.6:

Y -------------- Is a task to detect whether a failure is occurring

or about to occur technically feasible?:
failure condition? What is it?
Is the re a clear
ls this interval
What is the
to be of any use? Is the P-F i nterval rnacnr "" '"' 1"
sistent? Is it p ractical to monitor the item at 1ntc,r\/,1<:::
less than the PF i nterval?

Is a scheduled restoration task to reduce the

failure rate (avoid all failures in the
technically feasible?
Is there an age at which there a
the conditional p robability of fail u re?
Do most of the items s urvive to this age (all in the
or environmental l"rlf'lC.C,rli l,On!'a c\
ls it possible restore the
fail u re of the item?

Is a scheduled discard task to reduce the failure

cally feasible?
Is there an age at which there is a rapid i ncrease i n
t h e conditional p robability of fail u re? What is this age?
(all in the
Do most of the items survive to this
or environmental cor1seammces1

rate (avoid all failures in the

later in this chapter, and the

it must be done are recorded as
analysts move on to the next failure mode.
as mentioned in
Chapter 7, bear in mind that if it seems
be more
cost-effective than a higher order
then the lower order task should
also be considered and the more effective of the two chosen.


Reliability-centred Maintenance

The Default Questions

The columns headed H4, H5 and S4 on the decision worksheet are used
to record the answers to the three default questions. The basis on which
these questions are answered is summarised in Figure 1 0. 7. (Note that the
are only asked if the answers to the previous three
questions are all 'no' .)
li'igure 10. 7:

The default questions

Y ---------- Is a failure-finding task technically
I I I feasible and worth doing?
Record yes if it is possible to do the task and it is practical to do it at the required
frequency and it reduces the risk of the multiple failure to an acceptable level.


4 B 4 N
4 c 2 N

Could the multiple failure affect
N N N N N ---- safety or the environment?
I I (This question is only asked if the
answer to question H4 is no.) If the answer to this question is yes, redesign
is compulsory. If the answer is no, the default action is no scheduled mainte
nance, but redesign may be desirable.

Y I_ Is a combination of tasks techni

N - cally feasible and worth doing?
I Yes if a combination of any two or
more preventive tasks will reduce the risk of the failure to an acceptable level
(this is very rare). If the answer is no, redesign is compulsory.

5 B 2 Y Y
2 A 5 Y Y



1 A 5 Y N N Y N N N -------------- In these two cases, the consequences

1 B 3 Y N N N N N N '----------!- of the failure are purely economic and
I I I I no suitable preventive task has been
found. As a result, the initial default decision is no scheduled maintenance,
may be desirable.

Proposed Task
task or a failure-finding task has been selected during the
If a
decision-making process, a description of the task should be recorded in
the column headed 'proposed task'. Ideally, the task should be described
as precisely on the decision worksheet as it will be on the document which
reaches the person doing the task. If this is not possible, then the task
should at least be described in enough detail to make the intent absolutely
c lear to whoever writes up the detailed task description.

The RCM Decision


For example, consid e r a situation where an on-condition task has been soE:,c1i1eo
7 explained h ow such
for a rolling element bearing.
can suffer
from a variety of potential failure conditions, i ncluding noise, vibration, heat, wear
and so on. Many machines have more than one and often several such no::.rir,nc,
at the very least, the 'proposed task' should
. is to be checked for what cond ition. I n other words, if a
checked for noise, the p roposed task should read 'check
noise', and not just 'check bearing'.

This issue is discussed in more detail in the next ,.,,,.,...,,"

then the
If the
The actual form

if the ACM p rocess reveals (say) that the t!:lton,nn

;....,...,..,., .... :::i mechanism of
a guard has to be redesigned for safety reasons, the nl'r,n,.., ,.::,ort task' should state
mechanism required for
something like 'more secure
simply write
On the other hand, it should
mechanism wil l b e used.
e rs to d ecide exactly what sort of

This issue also discussed further in the next n n ,..-.n:r

Finally. if a decision has been taken to allow the failure to occur, in
most cases the words 'no scheduled maintenance' should be recorded in
is h idden failure where
the 'proposed task' column. The
'the risk/reliability profile is such that
is not
, as
explained on page 1 85.
Initial Interval
Task i ntervals are recorded on the decision worksheet in the "Initial
Interval' column. We have seen
they are based the
the P-F interval
scheduled restoration and scheduled discard task i ntervals depend on
the useful
of the item under consideration.
failure-finding task i ntervals are governed by the
multiple failure, which dictate the
time between occurrences
worksheet, record each task interval on its
without reference to any other
a task at a particular
because the reason for
over time indeed the reason for doing the
all could disappear. So
if the
of task X is b ased on the frequency of task Y and task Y
of task X become s rneammgless.
ea, the
is later ernmu1m,

in the next chapter, if we are confronted with a number
of tasks which need to be done at a wide range of different frequencies,
the time to consider
them into a smaller number of work
.,... nrr maintenance schedules. However, the initial
pa,c1G:1ge:s is when ..,,,.,,,
remain on the decision worksheet to
task frequencies should
remind us how the schedule
were derived (in other words, to
preserve the 'audit trail'.).
Note also that task intervals can be based on any appropriate measure
of exposure to stress. This includes
time, distance
or throughput, or any other
travelled, stop-start
to the failure mechme:ast1rao1e variable which bears
anism. However, calendar time tends to be used where vv,,,,, .. ,,,;,:v because
and cheapest to administer.
it the

1"...,.,"'"""',-." '"''

Can Be Done By
The last column on the decision worksheet is used to list who should do
each task. Note that the RCM process considers this issue one faifore
the subject with any
mode at a time. In other words, it does not
preconceived ideas about who should (or should not) do maintenance
and confidence to do this
asks who has the
work. It
ta,-,;k correctly.
be allocated to mainThe answer could be anyone at alt Tasks
function, specialist
insurance inspectors 1
cec:nnJc1ans, vendors, structural
or laboratory technicians.
A sometimes controversial issue which arises at this
on-condition and failure-finding tasks. It some
times makes sense to allocate these tasks to maintainers, but in many
maintainers to do these tasks has the
if they are skilled tra<1es1oeo01e
, ):
will be very high
if the task
is short, the inspection
sometimes more than once per shift. This can lead to so many
tasks that maintainers do little more than travel from one task
to the next. This Trl\!P l 1 1 TI U
the tasks makes the use of maintainers for this purpose expensive. often
them in this capacity.
to the point where it is simply not worth
tasks boring and are often
many skilled people find
reluctant to do them at all.
Tra rs n ,O.rl.f>'1

The RCM Decision Diagram


skilled craftspeople are very scarce in many parts of the world, so it is

often difficult to spare them for this kind of work in the first place.
A second option is to use operators to do high frequency tasks. Thi s option
can be attractive because it is usually more economical and or12an1sa1t1or1ally easier to use people who are near the
most of the time to
do high frequency tasks. Operators are also often more highly motivated
to look after 'their' machines. However, three conditions must be satis
fied before operators can be used with confidence to do these tasks:
they must be properly trained in how to
the appropriate potential failure conditions in the case of on-condition
and must be
properly trained to do high-frequency fail ure-finding tasks
they must have access to simple and reliable
for reporting
any defects which they do find. (The design of these procedures is dis
cussed in more detail in Chapter 1 1 )
they must be sure that action will be taken on the basis of their reports, or
that they will receive constructive feedback in cases of
operators for thi s purpose can also have profound implications i n
terms of industrial relations and reporting relationships, s o i t is a n i ssue
which needs to be handled with care .
.... --, as w ith most of the other decisions i n
is in the best position to do each task is best decided by the people
who know the equipment best. This issue is discussed at greater length in
Chapter 1 3 .
rn , ,,,,T, ,, ,..r..-H' ' "'

10.3 Completing the Decision Worksheet

To illustrate how the decision worksheet i s completed, we consider three
fai lure modes which have been discussed at length in previous chapters.
These are:
the bearing which seizes on a pump with no stand-by, as discussed on
pages 1 05 and 1 06
the bearing which seizes on an i dentical pump which does have a stand
by, as discussed on pages 1 08 and l 09
the failure of the stand-by pump set as a whole, as discussed on pages
1 1 8 and 1 79.

System Nil




Sub-system Nll

A 1
A 2





main pump bearing








is: 1 2

N Y(


scheduled maintenance

instead of
tank. When
the duty pump


and ensure that

test is complete,

! 4 weekly


The RCM Decision


The associated decisions are recorded o n the

worksheet shown
10.8. Please note three
points about this ,..,, ....u u ,.,,,.,.
the first two pumps could suffer from many more failure modes than the
failure under consideration. Each of these other failures would also be
on its own merits.
listed and
a number of other preventive tasks could have been chosen to ,,u,,1....,.1vu ,,.._,
the failure of the
the decisions in the
are for the purpose of illustration
the stand-by pump is treated as a 'black box' . In practice, if such a pump
were known to suffer from one or more dominant failure
failures would be analysed individually.
In essence, theRCM worksheets
show what course of action has
been selected to deal with each failure
also show
was selected. This information is invaluable the need to do any main
tenance task is challenged at any time.
The ability to trace each task
back to the functions and desired
...,,,.,,....,,,,.....,...,,.,...,,.,a of the asset also make it a
tenance program up to date. This is because
reassess tasks which are affected
the asset ( such as a
in shift'
regulations), and avoid n <:> c> h r1 < be affected by the ..,........,!-,,....,,
,.,._.._,.,,H ... ..... -.... ,J

10.4 Computers and RCM

The information contained in theRCM and Decision Worksheets lends
itself readily
stored in a computerised database. In
if a
number of assets are to be
it is almost essential to used computer for this purpose. A computer can also be used to sort the n ..,,.....,..ri
of other
tasks by interval and skillset, and to
tasks task category, and so
modes by consequence
on). Finally,
the analyses in a database makes it .... .. ..... ..,,. J
t o revise and refine the
as more is learned
as the ""' "- ,. t , .,..
13) .
it.surely will see
7 of
..... ,,n,auc..- note that a
ever be used to store and
nn and perhaps to assist with the more complex failure<:>t,:...,u,
, n 1-.nrri,'l ......
sort RCM :u:V.lH
comfinding interval calculations. For reasons discussed in
puters should never be used to drive theRCM

_L_LVVVv Vvl,

1 1 Implementing RCM
1 1. 1 Implementation - The Key Steps
We have seen how the formal application of the RCM process ends with
completed deeision worksheets. These specify a number of routine tasks
which need tobe done at regular intervals to ensure that the asset contin
ues to do whatever its users want it to do, together with the default actions
which must be taken if an appropriate routine task cannot be found.
about how
in this process learn
the asset works and about how it fails. This on its own frequently causes
the participants to change their behaviour in ways which often lead direct
ly to remarkable
in asset performance. However, in order
to derive the maximum long-term benefit from RCM, steps must be taken
to implement the recommendations on a formal basis. These
ensure that:
all the recommendations are approved -------- , by the managers with
for the assets
all routine tasks are described dearly and ,...,,..,,..,...,"'""' "
all actions which call for orn:.:-e -ou cnamgts (to designs, to the way the
or to the capability of
and maintainers) are
identified and nnp1ementcd. ,..,-,. ..,..,"''"'1- 1 "
routine tasks and operating procedure cnalngt!S are incorporated into
appropriate work ..,...,.,.,,..,.....,,......,.,
"" IJ '"''" "'V,. ,.,,

this in tum entails:

which ensure that they
vv,, .. ..,,u,;;;.. the work packages into
by the right people at the right time and that they
that any faults found are dealt with
are summarised in
11.1 opposite. The most important
of them are discussed in more detail in the rest of
;:nH, llTl1'HT

Implementing RCM Recommendations

No scheduled maintenance
Redesign guard
Check agitator gearbox oil level Weekiy
Check tension of main drive chain
Calibrate gauge
No scheduled maintenance



Figure 11.1: After RCM

21 3


1 1.2 The RCM Audit

Ifit is correctly applied, the RCM process provides the most robust frameavailable f or formulating asset management sma.teg1es.
prof oundly affect the safety, environmental integrity
and economic well-being of the
using the assets . .... "'.-.'"""""....
does badly wrong in
of the best efforts of the people
",.,,.. u.., ....
the process, every decision will be subjected to a
oftenmt,nsely adversarial review
from regulainsurers and shareholders to rep1resent:atI'ves
tory authorities
victims their survivors). As a
which uses RCM
should take
care to ensure that the people who apply it know what
itself that their decisions are sensible
and also to
The latter step is known as the RCM audit
RCM audits entail a formal review of the contents of the RCM In
formation and Decision Worksheets. This section of this chapter looks at
who should do the audit, when it should be done and what it entails.



Who should do the audit

f or the asset if some
the overall
thing goes badly wrong, so it is in the interests of themselves and their
rn r. H ,,u,-.,n.
taken to
J.V J ,_,.,::, to satisfy themselves that reasonable steps are
prevent such occurrences. As mentioned in
l , senior managers
do not necessarily have to do the audits themselves, but may aeJlegate
them to anyone in whose judgement they have enough confidence. How
ever, i f this is
it should
be understood that the auditors are
mana1en1enlt:, so the latter still bear the ultimate
for the decisions . (Whoever carries out the audits should
also be thoroughly trained in RCM.)
they should
with any
If the auditors
In so doing,
who performed the
discuss the matter with the
themselves may be
the auditors should be prepared to
wrong. (In most cases, no more than 5% of the decisions are queried.)
l ,_


When the audit should be done

Audits should be carried out as soon as possible after each review has
(preferably within two
for three reasons:

the people who did the
(If this
start to lose
efforts put into
whether manageJ.U"",. "">", and more
ment was serious about involving them in the first

can still recall

the sooner the decisions are 1mp1em1(nteo,
derives the full benefits of the exercise.
is reached about each
When overall
the decisions are
implemented as described in the rest of this ,.. n ,:a,n,

What the audit entails

An RCM analysis needs to be audited from the point view of method
the auditor seeks to ensure that
and content. When
the RCM process has been
the content,
the auditor seeks to ensure that the correct information has been ,.,.,lfn.::.,,
and conclusions drawn both about the asset itself and
it fonns part Issues which most often need attention are as follows:
0 1

The analysis should be carried out at the
leveL The most common
fault is to analyse assets at too low a
and the usual
numbers o f items with only one o r two functions defined per item.

All the fu nctions of the asset should be

points to l ook for include the following:

each function statement should define

one n n,,.,.,, ,.., ....
standard . As
although it may i ncorporate more than one
one verb
a rule, each function statement should contain
is a protective device)

performance standards s hould be '"',rnntifo:

... ....... ,A and should indicate what
fl1TOYtTH,IO Context rather than
the asset must be able do in
(what it can
its rated
u _ ...,...,.,

all protective devices should be

scribed ( 'to do X if Y
the functions o f all gauges and indicators should be
desired levels of accuracy,

Reliability-centred Maintenance


All the functional failures associated with each function should be listed
of each performance standard
(usually complete failure plus the
in the fu nction statement).

Failure modes
Ensure that fai lure modes which have happened or which are .....,..-:,.n"""'"' 1 '"'
likely have not been omitted. Failure mode descripti ons should also be
"''"",.,"LL'"' In particular,
should i nclude verb, not

a component

the verb should be a word other thanfails or nw.'LTunc11unzs

appropriate to treat the failure of a sub-assembly as
( option on page 87)
switch and valve failures should i ndicate whether the item fail s in the
open or closed position
Failure modes should relate directly to the functional failure u nder consi
and failure modes and effects should not be transposed, as in:
Failure Mode
o ut

Failure Effect
Pump impeller jammed by rock

Another common mistake i s to combine two substantially different fail

ure modes i n one description, as follows:
1 Screens damaged o r worn

Failure effect

2 Screens worn.

should make it possib le to decide:

whether ( and how) the failure will be evident to the operating crew

how) the failure poses a threat to

what effect (if any) t. h e failure has on production or operations (output,


Failure effects should not mc:on,orate actual 'consequence words' like

or 'This failure is evident' . However,
should list l ikely total downtime as opposed to
time, and should
indicate what must be done to rectify the fail ure (repl ace, repair, reset, etc)
which is said
Finally, auditors should satisfy themselves that
to be anal ysed

Implementing RCM Recommendations


Consequence evaluation
Special care should be taken to ensure that the hidden function uu1esr1on
,'n 1......
<::ir' the
n t1,r-n
(que stion H on page 200) has been answered
In y<-U
should have been attached to the terms on its mvn and
on pages 1 24
under normal circumstances in this
and 1 26. Special attention s hould also be
to the evaluation of the
safety and environmental consequences of evident
and to the
effectiveness of any tasks which might have been selected to manage
failures in these two ca1:egon. e s.
Task selection
Any tasks which have been selected should not only
the criteria for
technical feasibility as
6, and 8, but they should
also address the consequences of the fa ilure. Key points to look out for:
if the
H is 'No' and the answer to
H4 is No' ,
then question H5 must be answered. If the answer to HS is Yes' , the
n-rr,nn c,.:,rt task should not be 'no schedu led maintenance'
' Ye s' ; the nrr,nn<OA,i task should not

if the answer to question s or E

be ' no scheduled maintenance'

if the failure has ,,.,.,..,,..,._u..,. '"'.H""'- or non-operational consequences, the task

must be cost-effective.

detail to
tasks or default actions s hould be described i n ,,:.nrmcrh
leave the auditor in no doubt to what is intended. In
task descriptions should not simply list the type of task
condition tasks' or 'scheduled
to the failure
mode in question. It should not incorporate a combination of tasks because this
signifies two different fa ilure modes
the answer
S4 is yes). For

d,,._., ,.. c.i,

vIU,U.UIJ < v ,


Inspect chain for wear and



tension of chain


Initial interval

to the criteria set out

6 , 7 and 8 . I n particular, look out fo r a ,r,,.,,'
...,. ....,1"n,
....,, . "',,u to confuse
P-F intervals with useful life in on-condition task intervals.



11.3 Task Descriptions

Before any task reaches the person who has to do it, it must be described
an,,.n n,n
detail to leave no doubt at all as to what i s to be done .
..,,.n ,, n,,..c:,rt
overall level of skill
will be
and experience of the workers involved. However, bear in mind that the
that somemore that is left out of a task description, the
one will miss out a
or choose to do the wrong task
to be taken with the description of any
thi s context, special care
failure-finding task which calls for a hazardous situation to be simulated
in order to test the function of a
what action must be taken i f
defect i s encountered.
should the defect be
to a
or should it be rectified
C' Ht'"'" '"'HPr,r or to the maintenance
immediately?) Instructions like 'check component A for condition B and
should be used with caution, because the 'check' part
part could
take a few seconds while the
take several hours. Thi s can play havoc with the duration of planned
downtime. Instructions of thi s sort should in fact be written as 'check
defects to
. Only use
component A for condition B and
such as 'check gearbox oil level
'if nec:essarv for quick servicing
and top up with Wonderoil Type 900
Examples of the right and the wrong way to specify tasks are shown i n
1 1 .2 below:
Check coupling

C heck feedscrew coupling for loose bolts and
if necessary
for cracks and
Visually check agitator coupling
report defects to the maintenance supervisor etc

Calibrate gauge

Fit O - 20 bar test gauge to test

and check if
on pressu re gauge Pt1 204 is within 0.5 bar
of the reading on the test gauge when the test gauge
reads 8 bar. Arrange to
out-of-spec gauges
when plant is shut down for cleaning
Remove pressure gauge Pl 1 204 to workshop
and calibrate following procedure in manual 27 A

Figure 11.2: Task descriptions

Pages 206 and 207 explained that each task should be defined as
as oos>S1tJ1e on the decision worksheet. This saves the duplication of effort
which occurs if detailed procedures have to be written up later
one else. It also reduces the possibility
if time does not
the RC!vt
the procedures to be ,:;:n;:.r-ifi,"A
analysis, then they must be
later. As mentioned below, this can
often be done as
of an ISO 9000-type initiative.
Note that if detailed task
are to be ......,,. ,,,..,.'.l,,.,,,rt
should ideally be done by someone who
should understand
analysis. If this is not possible, the third
asked to define the tasks on the decision worksheet
that he or she is
in more detail, and not to re-audit the analysis.
cH""'' '- -' " ' '-'V>

the task is listed should also clearly state the

number where relevant
who should do the task (operator,
..,,,n .. ,on,,....., with which the task is to be done

sw.ir1.ot:a and/
or isolated while the task is
precautions which must be taken
these items can

to and fro after the job
ISO 9000 and RCM
should be
what work
of RCM is to
A major
(In other words, to e nsure that 'they do the
thrust of quality
like ISO 9000
hand, a
possible in order to minimise the
should be doing as
chance of errors. (In other words, to ensure that
tasks from RCM decision
worksheets to end-user documents can be seen as the
where the
output of an RCM
becomes the input to an I SO 9000
writing exercise. It also
that if both initiatives are to
it makes sense to apply RCM first.


Reliability-centred Maintenance

1 1.4 Implementing Once-off Changes

At the end of a typical RCM analysis, it is not unusual to find that between
and 10% of the failure modes default to redesign. Part 2 of Chapter 9
mentioned that in the context of RCM,
in any of the following three areas:
configuration of an asset or system
to the
to a process or
to the capability of a person usually by rrn,r n 1 , a
by the auditors, these changes need to be
Once they have been
implemented as (horoughly and as quickly as possible. Key issues in each
of these three areas are discussed below
L.ftuH :l::' c:. ,, to the
All modifications should be:
Chapter 9 explained that modifications should be
ne1. r ci::>m;eoue11ces. Modifications intended to deal
or environmental
consequences should reduce the risk (frequency and/or severity) of the
consequences to a level which is acceptable.
rithm which can be used
modifications intended to deal with
failures that only have economic consequences
suitably qualified engineers. As a rule, attempts

the RCM process, but the
should not be made to
should consult afterwards with the people who did the review
in order to develop a COITectly focused

Steps must be taken to ensure that modifications

are carried out as intended in terms of
cost and quality, and that
manuals and parts lists are updated correctly

managed. Modifications should not interfere with essential

routine maintenance activities in other parts of the
and the mainreo1uuerr1ents
should be
assessed and implemented.
ov ...,v.U.-<'-'<.JL!.,.HJ'.U

to the way in which the plant is

,'h !i, n c,,"'c to the way in which plant must be operated are handled
in the same way as routine tasks which are incorporated into operating
in the next part of this chapter.
!J.1.'-''"'""''"' ...u"', as
, TJ,tt. n :vr. .,

'\c., .. ....,,., ...'V .....

n n ,?Yr7T,0/1

Implementing RCM Recommendations

22 1

Changes to the capability

in Chapter 4, the RCM process tr'e;m1tn.t1J1 reveals failure
modes caused by slips or
on the
of r,n,:,rr, rr-.rQ or maintainers
.. -L,.,,,,,'"," human
become apparent any
operators or maintainers who participate
in the process, and
usually modify their behaviour
learn what
they are doing wrong.
However, we also need to ensure that people who have
in the process acquire the relevant skills. In most cases. the most
efficient way to do this is to revise or extend<:\V1('t;l"\IT
"""",." ...u.,;:.,
done in
or to develop new programs. In most,ns,
consultation with the training r1.,.,,v, .... o,... t



1 1.5 \Vork Packages

Once the maintenance
be packaged in a form which can be v.1..u......,,....
and which can be
in a neat and '''""Y Y\."/ ''
the tasks. This can be done in
to the people who will be

maintenance procedures to
incorporated into the vv..,,. .... u,..
the balance of the maintenance routines are ,n.u.;'"' .... into sec)ar,ate
schedules and checklists,
-n.-r,r>Arh l rC> O


Standard Operating Procedures

The previous pait of this
which must
should be documented
be made to the way in which an asset is
or SOP's. (In situations
in standard operating
it will almost certainly be necessary to
not exist
order to ensure that the
are implemented.) In many
are also the simplest and
way to manage
which need to be done by operators,
tasks should only be
need to be done at intervals of one week
should be
to be done by operators at
into separate schedules and planned, organised and controlled in the same way
maintenance schedules, as described in Part 6 of this ....,.,. ...,,VL'-'


Reliability-centred Maintenance

Check tension of main drive chain Monthly Mechanic

Annually E&I technician
Calibrate gauge
No scheduled maintenance
if low 4 yearly Operator


Figure 11.3:

Transferring a task from a decision worksheet to an SOP

Maintenance Schedules
A maintenance schedule is a document listing a number of maintenance
tasks to be done by a person with a specified level of skill on a ..,..,.,,....,_.,. ...,...,.
1 1 A shows the relationship be
asset at a
tween these schedules and the RCM decision worksheets:



Figure 11.4:

, ,,<:i, nci'c,rr1 rvv tasks from a decision worksheet to a maintenance schedule

Compiling schedules from RCM decision worksheets is a fairly ._,..,, ,...,.;;;,, a

forward process.
a few additional factors need to be taken into
account as
in the following uai aJ;:r n1Ju.

Implementing RCM Recommendations


In Chapter 7 it was mentioned that if a wide range of different

intervals appear on a decision worksheet,
should be consolidated into a
the schedules based on
smaller number of work
the worksheets. Figure 11.5
an extreme
of the
task intervals which could appear on decision
and how
r, n ,:u-,,, . ,,.,,
u v"-'p-t\vtH..,n
be consolidated into a smaller number of schedule +rc:,.
in tenns
Intervals of
them and
of tasks on
to do
them. tend to dictate basic schedul e
if schedule intervals are multi2-weekly
of one
as shown in the
Note also that if a
changed in this fashion, it should a/3-monthly
into a schedule
Task intervals
so could move an on
condition task frequency outside the
P- F interval f o r Lhat failure, it could
a scheduled discard task
the end of the 'life' of the

Figure 11.5:

task frequencies

<'i"\'tYlr\i'U"!P t'l f

When a low frequency schedule incorporates a higher frequency sched"'
ule, shoul d the latter be incorporated as a
instruction, or should it
an annual schedule inbe rewritten in full? In other
schedule ) , or should all the
'do the three
mr,n J' l '\ lU schedule be written out in
annual schedule?
In fact it is wise to rewrite the schedules in order to avoid the .,...,...... 1 '"'"'.,
of contradictions.


For instance, consider what could happen i n a situation where a three

schedule includes the instruction 1check gearbox oil
the annual schedule for the same m achine starts with the instruction 'do the
monthly schedule', and l ater says 'drain, flush and refil l

ever we

term ' life'

WeibuJ/ distributions


clown into
functions and
then identifies the fai lure
function hito functional
modes which
functional failure. This

to establish

in the

case of failure modes which have nr,P>rl t" J i , n ,i l

other words it


establish the


C\J UU>t"'l t P

from the

of information which
i nclude:
that the fail u re has r,,ft"n,rori which

obt a i ned


Reliabilizv-centred Maintenance

13 .. 2 RCM Review Groups

ln the light of the issues raised in Part 1 of this chapter, we no,v consider
who should participate i n a typical RCM review group, what each group
actually does, and what the participants get out of this process.
Who should participate
The people mentioned rnost frequently in Part 1 of this chapter were first
l i ne supervisors, operators and craftsmen. This suggests that a typical
RCM review group should include
the people shown in
Figure 1 3. l .

Figure 13. I


A typical RCM
review group



(M and/or E)

External Specialist (if needed)

ln practice,
(Technical or Process)
the p laces on every
group do not have to be
filled by exactly the same people as shown i n Figure l 3. 1 . The objective
is to assernble a group which can provide most if not al] of the i nformation
described in Paii 1 ofthis chapter. These are the people who have the most
extensive knowledge and experience of the asset and of the process of
which it forms part. To ensu re that all the different viewpoints are taken
into account this group should include a cross-section of users and main
tainers, and a cross-section of the people who do the tasks and the people
who rnanage then1. In general, it should consist of not less than four and
not more than seven people, the i deal being five or six.
The grou p should consist of the same lnclividuals throughout the
analysis of any one asset. If the faces presen t at each meeti ng change, too
much time is lost going over ground which has already been covered for
the benefit of the newcon1ers.
As uggested in Part l of this chapter, specialists ' can be specialists

in any (Ir th..: f'olluwiJJg:


each review group will be identified

it may become necessary to group the
This 1neans that the final decision
order to
out a sensible
who then has to define the boundaries of the
i'.> /lllJ US,n"\tCH"\ f

1 4 What RCM Achieves

14.l lVleasuring Maintenance Performance

As discussed at
1 1 , the application o f RCM results i n
outcomes, as follows:
maintenance schedules to be done by the maintenance department
for the operators of the asset
a list of areas
must be made to the design of the
in order to deal with situations
asset or the way in which it is
where the asset cannot deliver the desired performance in its current
Two other less
outcomes which were mentioned in
are that participants in the process learn a
deal about how the asset
and also tend to function better as teams.
deal of time and
all these outcomes
if RCM is applied as described in Chapter 13. However, if
it yields returns which far outweigh the costs
involved. Most ao10111cat1or1s pay for themselves in a matter of months
although some have paid for themselves in two weeks or less. The wide
of ways in which RCM pays for itself are discussed at length in
In order to place this discussion in perspective, we
4 of this
first need to consider different ways in which it is possible to measure the
.-.,,,,,.r,v,rn 'l n ,,,:,. of the maintenance function.
can be considered from two quite distinct
u,o,un ri.n'\T' The first focuses on how well maintenance ensures that assets
continue to do what their users want them to do. This is usually referred
to as maintenance
and it is l ikely to be of most interest to
the users or customers' of the maintenance service. The second viewpoint
used. TI1is is
concentrates on how well maintenance resources are
refeffed to
It is usually of more interest to
malmt:enam;e These two issues
r, t, , ,XH

n\ (l"

What RC!vl Achieves


14.2 Maintenance Effectiveness

Chapters 1 and 2 emphasised that the
sure that
desired the user. As a result. any
of how well maintenance is
to fulfil their functions to the
ment of how well the assets are
desired standard. This is influenced i n turn
three issues:
can be measured i n
users have different
of different functions
i ndividual assets can have more than one and often several
in Chapter 2.
These i ssues are considered in more detail in the
'H'<'A,,n, ,cHT

t' H>'> ,>ru


Different Ways of Measuring Maintenance Effectiveness

mechanised and fully loaded manuThe prim ary function of any
is to produce at least many units of ,,...,.,.,,,,....J.,.,,,,
it was built
loaded' mtans that
is operating seven
per week/24 hours per
and that there i s
ready market fo r every unit which the
In this context, any failure which reduces output results in lost
the simplest overal I measure of the
In cases like
formance of the facility as a whole is total
what the users
the owners feel
If the
will not be satisfied until the
effectiveness in terms of total output
nised when
any system
All is not necessarily weH if overall output is
is producing the
number of units could still be
product quality,
lems which affect
to be measured
,md so
so these also

ways in which we can measure how

There are
an asset is fulfill ing its functions. Five of the most common
as fol lows:
This is the most
. It is usually measured
or 'failure rate .

understood me:an1mg
'mean time between



it lasts. Thi s is usually thought of as i ts ' life' or i ts
at the end of which the item under consideration fails and is eitherrebuilt
this phenomThi s is usually referred
when it
it out
to as 'downtime' or 'unavailability' , and measures how much of the
time the item i s incapable of fulfilli ng a stated function to the satisfac
tion of the user, in relation to the amount of time the user would like it
that it has survived to
in the
zt is
of that period. We have seen that this is the conditional
probability of failure. This could
be described as a measure of
'dependability ' , if only to distinguish it from the other three variables.
One common variation of this measure is the 'B 10 life'. Chapter 1 2
that this i s usually measured from the moment the item i s put
and is the
before which not more than l 0% of the
items can be
to fail. (In other words, the conditional probabilof fa ilure in the stated period is
""'"l"H ... . . . ,,..

the term efficiency actually has two

to input, while
TI1e first measures
quite distinct
the second measures how well
well it shou ] d be rtr>,r-tr,,rn ,I THT
Depending on
the technology used
gas, combined
etc), this usually ranges from
about 35% to about 58%. However, if a station which should average 40/o energy
95% of the energy which
38%, it will be
it should be
I n the context of this book, the first {40%) measure is a
functional perlormance standard. As explained i n Chapter 3, this is used to judge
whether the item has failed. The second (95%) measure i s used to j udge the
effectiveness with which thEi or,qai11si:1t1cm is achieving the desired performance
on an on-going basis.
and also does so in two ways - how
fast an asset should work relative to the pace at which it could work (desired
no ,rtr11m!:inro versus i nitial
and how fast it
works relative
to the pace at which it should work ( actual performance versus desi red perform
We have seen that desired performance m ust be less than i n itial capa
because allowance must be made for deterioration. So in the context of
efficiency compares the pace at which an asset actually works with
the pace at which it should work, not with the pace at which it could work.

What RClvf Achieves


measurements can also apply in a

to the consumption of maintenance consumables (such as
oil and
solve nts and reagents
h yd raulic oil) and process consumables
and in the extraction of ,,.,.., , r,ar,:, J t> l
in chemical
All five of these measures are valid. It is
a matter
is the most appropriate in the context under consideration.


if a
costs per unit of
it i s
output among all those used by an electric
want i t to generate
.. lnterms
much of the time as orn;s1t)le.
of this function, the m ost ao1oror0nate measure of maintenance effectiveness is
load. They may even
operational reasons. Slowdowns or s hutdowns of this nature affect the utilisation
of the asset as opposed to its availability. In essence,
meas u res what
percentage of time the mach i ne i s available t o fulfil its
quirement, while utilisation measures how m uch it
On the other hand, the
m ight only be used oer1001ca11v
peak demands for power (peak toads).
will be that the generator comes on stream as soon as it
measure of effectiveness wil l be how often it does so
by a failure
it fails to do so.
measu red in terms of number
When meas u ri ng safety ,
of days or number of m anhours worked between lost time incidents wr rata11t1es t
This is a form of 'mean time between failures'. Similar measures are
for environmental i ncidents.
On the product qual ity front, a scrap
4% can be seen as a measure
of unavailability, in the sense that while a machine is
(A scrap
of 96%). Scrap rates can also b e expressed a s
2 0 parts p e r million, which
is another way of expressing a failure rate. Both are valid measures of maintenance
in h ighly mechanised or automated orc,ceisses.

Different Expectations
Every function h as associated with
and/or durabi lity and/or availability and/or ae1)en1oam
For i nstance, two of the functions associated with the bodywork of a car
isolate the occupants of the car from the elements' and 'to look
. Most
car owners expect the bodywork to be able to fulfil
the car is a convertible or
ext)ectea life of the
o r a window). On the other hand, everyon e knows that cars
start 'to look u n acceptable' in the space of a few
or weeks.
case we have a conti n uity
be meas u red in hundreds of
thousands of kilometres or decades, while in the second
pectation is measured in h und reds of kilometres or

by the fact that the loss of nearly every function
This i ssue is
can be caused by more than one sometimes dozens of failure modes.
Each failure mode has a.),)v..... 1a1....., u1 with it a specific failure rate ( or MTBF),
and each will take the function out of service for an amount of time which
to that failure mode. As a result, the continuity characteristics
of any function will
be a
of the continuity character
istics of all the failure modes which could cause the loss of that function.
/"'"'"'''; "fo"

For instance, take the functio n 'to l ook

which was mentioned above.
tn addition to the accumulatio n of dirt, this function could be lost due to rust or
(sideswiped in a parking lot)
corrosion, fading of the paintwork, external
and vandalism, among others. It should also be apparent that some of these
failure modes have l ittle or nothing to do with maintenance. For instance, external
damage is mainly function of how this car or the other vehicle involved - i s
may play a small part by adding rubbing
to re
OPE::ra1tea. although
d uce damage and/or by making it easier and cheaper to replace damaged panels.
The p robability of vandalism is also a function of where the car is used (the opera
so it is almost completely beyond the control of the designer and the
...,.. ,,., ,.,,, ,,. nu The rate of accumul ation of dirt i s a function of where and when a car
conditions and climatic conditions), and it is managed by a suitable
is used
the car). Corrosion and fading of paintwork can
be influenced
at the
stage (although yet again the operating
context climatic conditions and the provision of shelter - and to some extent
maintenance activities - polishing and chassis washing - can play a part in modof these

conclusions, as follows:
we need a thorough
of all the failure modes which are
likely to cause each loss of function in order to be able to design, operate
and maintain an asset in such a way that the effectiveness ex1oec:ta1:im1s
which we have of each function will be achieved
it is unreasonable to hold the maintainer of an asset alone accountable
for the achievement of any continuity (reliability/availability/durabilfunction of any asset. The
achievernent of these
is also a function of how it is oe:1gnec1,
the associated
Accountability for
built and
between the people responsible for all of these
'maintenance' effectiveness as it is
functions. (In other
defined in this v:10.,F" is not only a measure of the effectiveness of the
motmt:enarn;e ri,r:
t It measures how effectively everyone associwhatever is necessary
ated with the asset is
to ensure that it continues to do what its users want it to do.)
f" n ',ITH.:>r

1 1
vlrtl', "' 11

What RCM Achie ves


Different Functions
the most important
the effectiveness of
maintenance activities is the fact that every asset has more than one and
sometimes dozens of functions. As
of con
tinuity expectations is associated with each function. This means that i f

For instance, let u s consider how maintenance effectiveness

b e measure d
by t h e owner o f a typical suburban gas station. F o r
p u rpose o f this example,
the 'asset' i s a s to rage and pumping system
In this system,
tank with capacity of 50 000
unleaded gasol i ne i s stored i n an
l itres. It is periodically filled by a road tanker to a l evel of 48 000 litres. An upper
level switch i n the tank switches on a l ocal warning
if the tank has been filled
to a l evel of 48 500 l itres, and another switches on another
i n the
main office if the l evel d rops to 5 000 litres. A low level alarm sounds in
if the tank level drops to 2 000 l itres, and a local ultimate
level alarm
the tank l evel reaches 49 000 l itres. The tank i s double skinned to ensure that
gasoline is contained i n the event of leak in the inner skin . A !eve! indicator
indicates the fuel level in the tank.
The tank supplies
to five pumps. Each pump is switched on and off
a handle i n t h e nozzle. The
also incorporates
a pressure switch which
the pump when the vehicle fuel tank filled to the
tip of the nozzle. A flow m eter measures the amount fuel delivered
the pump i s activated and displ ays the volume and value of the fuel delivered to
the customer. This meter is zeroed each time the nozzle is returned to cradle.
(This system e m bodies additional secondary functions which deal with access
valving, ease of use
customer, other
o nto and i nto the tank, drains,
would also be listed in a reallife s ituaprotection, appearance and so on.
tion. However, for the purpose of this example, we only consider the functions
described above.) On thi s basis, a list of functions
read as fol l ows:
to pump between 25 and 40 litres/m inute of
to the vehicle
to indicate volume and value of fuel d e livered to customer to within 0.03% of
actual vol umeivalue
by customer or when customer's fuel tank is full.
to shut off pump when
to contain the r,c,,<:,nl ,, no
to store between 2 000 and 48 000 l itres of rt:!:l<:nllr,o
to switch on a warning light in main office if the tank l evel
to 000
to switch on a local warning light if the tank level reaches 48 500 l itres
to sound an alarm in main office if the tan k level drops below 2 000 litres
to sound an alarm if the tank level reaches 49 000 l itres
to contain the contents of the tank i n the event of a leak
to indicate the level of fuel in the tan k to w ithin 0.05% of the actual
the mai ntenance effectiveness of this system, the
gas station wil l


so each functional fail u re n eeds to be cons idered on its own merits, as fol lows:
Functional failure A: fails to pump a t all: Obviously, if a pump isn't working, it
r1!:!c,n11r,.o However, there are five pumps i n the station
cannot be
"''"'"'r,rt"' on the
SO the level of av,illab1111:v JtcaHH! ....,,.nc;t,JvllUv
of demand. For
us that he ''hardly ever" has all five p umps
exc1m1::>le. the station
i n use at once - so seldom that we can i g no re the possibility. He might also tel l
i n use simultaneously for a total of not more than
one hour a
and then never for more than about ten minutes at a time. If pump has an average availability of 95%, two pumps will be out of service
simultaneously for no more than 2% of the time. In other words, fou r p u mp s
would be available 98'%, o f the time, while there i s a demand for fou r pumps 4%
of the time. Under these circumstances, only a tiny fraction of customers would
and then not for very long. Thi s m ight tempt the owne r
need to
t o accept an
(If he regularly had five or more customers want
to buy gas at the same time, he would expect a much h igher availability.
But it may cost him somewhat more to achieve it, especially if he has to pay a
p remium for rapid response when calling out technicians to deal with failures.)
i rritating to take their business elsewhere,
might find slow pumps
Consequently, the owner is
esi::;,ec1a11v if there are faster alternatives
likely to want any of his p umps which wasn't failed completely to pump at the
requ i red rate 1 'al l the time - o r at l east, as close to all the time as you can make
it". This
tum out to m ean
99.8% of the time that the pump i s not
otherwise out of action another form of 'availability'.
40 litres/minute: If the pump pumps too
fast, it is likely to no1'\or!'.lto
::i"'' __.. s ufficient back pressure to keep tripping the 'tank
mechanism in the nozzle. Customers would have to learn
full' pressure
to throttle back the
rate by not
the handle so m uch, which
many regu lars m ight also find irritating enough to cause them to take thei r busi
ness e lsewhere. As a result, the owner is l i kely to say that he wou ldn't want this
failed state to occur "too often". He m ight then q uantify this expectation as a
failure rate - say not more than o nce in fifty years on any one pump.
Function 2: to indicate volume and value of fuel delivered to customer to within
0.03% of actual volume/value. Thi s function can fail i n two ways, as follows:
Functional failure A: indicates that more than 0. 03% less fuel has been delivered
than actual: If this
the station owner appears to be selling less fuel
is actually
so he l oses money. The failure becomes .:in,v,,.,,,,,....
after a while,
ratio of fuel sold to fuel received will start to
Nevertheless, the owner wou l d p robably still seek a low failure rate - say not
on any one pump. (If the indicator fails completely,
more than one in 1
one lucky customer
it s hows that nothing
been delivered. If this
might get a free tank of fuel, then the station manager wou ld shut down the
affected pump until the p roblem is rectified.)

What RC!"vt Achieves


Functional failure B: indicates that more than 0. 03% more fuel has been delivered

in the com1...;,<.,4U<C,t H,,<;,<;> would l ead
any one pump.
c-1' ".>n ri ,, nr,

Function 3: to shut offpump

tank is full. This function can also fail i n three ways, as

customer's fuel

carries on pumping after the customer rel eases the

sensor should shut i t oft when the tank is fufl. As a res ult, the customer will end
up with m uch more gas i n the tank than h e or she wanted. This wou ld almost
certainly lead to a row about how m uch should be
for and
of a customer. As a result,
failure rate say once in 1 000 years on any one pump.
- Functional failure B: fails to shut off when tank is full: Many customers
the sensor to tell them when the tank full. !f it fails to do
the pump should
shut off when the customer releases the handle. However, it
that the
tank will overflow onto the s hoes of the customer before he or she i s able to
tion. This too would l ead the owner to
in a 1 000 years for any one pump
- Functional failure C: both local switches unable to switch off pump: If
sensor and the handle both fail to shut Qff the pump, it will
is shut
gasoline all over the forecourt u ntil the electrical
circuit breaker. This would create a n asty fire hazard, so the owner woul d
expect a very l ow
i n 1 000 000 years.
if each switch independently achieves 1 i n 1 000.)
Function 4: containment: When asked about this function, the
m ight say something like "we have h ad one leak i n the
system in the
one too many. Here the user is
last ten
iveness i n terms of a failure rate. When
one in 500 years for a 'small' leak, which h e
5 litres per hour. (It i s highly u nlikely that anyone wou ld
means that the system
i n terms of availab ility, because
Even 99.9%
would be leaking 1 /o of the tim e - about 800 hours out of ten
still means that it wou l d leak for 80 hours.
This function
Function 5: to store between 2 000 and 48 000 litres
can also fail in three ways, each of which
be considered ser)anatelv
as follows:



Functional failure A:
below 2 000 litres: Based on normal patterns
are ordered when the tank l evel apof demand, fresh suppli es of
nrri!'.l f't"!o<::: 5 000 l itres, and we are told that they are nearly
before the level reaches 2 000 litres. If the level in the tank drops m uch below
2 000 l itres, there is a g reatly increased chance that the tank wil l empty, caus
ing the station to lose business. As a resu lt, the station manager expedites the
to 000 l itres
i ndicated by the low l evel alarm).
delivery if the l evel
He says h e needs to expedite deliveries about once a year, which he says is
effectiveness in terms of a
. Here he is
by increased demand and/or
rate. (Note that this failed state i s
It h as nothing to do with the maintenance department in the clasto 'cause the business to continue'.)

Functional failure B: level rises above 48 000 litres: The l evel in the tank is only
l i kely to rise above 48 000 litres if the delivery driver is not paying attention to
the tank level i ndicator when filling the tank or if the level indicator itself has
failed. In both cases the warni n g light comes on at 48 500 litres. We are told
that this happens "about once every six months" another failure rate which
i nvolved m i ght say they accept
Functional failure C: tank contains something other than gasoline: The tan k
if it is filled with somethi n g else
something other than
(say) diesel. If this happens, customers could fill their tanks with the wrong
cou l d
claims for
the resulting bad
him out of bus
iness, so he would rather this didn't happen at all. When reminded that 'never'
is an unattainable ideal, he m ight decide to accept a failure rate of (say) once
in 1 00 000 years.
light if level drops to 5 000 litres. The
Function to switch on a local
the level in all the fuel tanks every day i n o rder to
station manager
track consumption,
orders more fuel when level s approach 5000 l itres. The
low level warning
serves as a reminder if the level indicator fails or if there
between readings. This l ight i s n eeded about o nce
is a sudden
If it does not work when needed, the low level
every two
l evel drops to 2 000 litres. If an initial order is placed at
this late stage, the tank will almost certai nly run dry and the station will b e out
The owner says he will accept a mean time between
(MMF) of 400 years. In the l ight of this expecta
1 6 tells us that the maxim u m unavai labil ity the station
can tolerate for the l ow level warning light is Mm/MMF = 2/400 = OB%. This means
that the low level alarm is
maintained effectively if its a vailability remains
above 99.5%.
Function to switch on a local warning light if level rises to 48 500 litres. The
high level warni n g
is backed up by an audible alarm , so following similar
the owner m ight come to the conclusion that he will
logic to the above
accept an
97.5/o for this warning

What RCM Achieves

30 ]

Function 8: to sound an alarm if the level in the tank drops below 2 000 litres.

If the level in the tank

to 2 000 litres and the !ow level


there is a 50% chance that the tank will run d ry before the tanker arrives, and
for about one hour on average u nder
the station would be out of
c i rcumstances. This leads the station owner to conclude that he will not accept
this multiple fail u re (level drops below 2 000 litres while low level alarm
As u1;:,\.,1.1;::,,::,u
1 00
m ore than "once in a h undred
l evel alarm of MreiMMF 1 / 1 00
alarm ls being maintained emect1ve1v

Similar logic would be followed to determine availabilities for functions

It would also be used to establ ish effec
9. 1 0 and I 1 in the
tiveness measures for the functions of this
in the above list. However, for the functions ui,1,..,u,,,,..
ex1oec:tatim1s of the gas station owner can be summarised as follows:




Each pump
Each pump
Each pump
Each pump
Each pump
Each pump
Each pump
Whole system





50 years
1 000 years
50 000 years
1 000 years
1 000 years
1 000 000 years
500 years
1 year
6 m onths
1 00 000 years Tank

99 /o

illustrates several important
of maintenance
as follows:


UL alarm

about the measurement

we ,u-e not
effectiveness. The
ment effectiveness we are
distinction is important, because ., ... ..,.., e'mt:>nas1s
from the
to focus on
to its functions helps
maintainers i n
what the equipment does rather than what it is.
r\a,A,t'YY\ ,'1T\f'>''


assets have a surprisingly l arge n umber of fu nctions.
Each of these functions has a unique set
Before it is
maintenance effecwe need to know what all
or otherwise in each case.
This means that it is not possible to list a single continuity statem ent for an entire
or "to last at l east
asset, such as "to fail not more than once every two
. We need to be
about which function must not be lost
more than once every two years or must not fail tor at least e leven years ( or
which functional failure must not occur more than once every
two years, or which functional failu re m ust not occur before eleven years).

functions often embody

functions. As a
fail than
function must be considered when
up maintenance effectiveness
measures and
For instance, the primary functions listed for the gasoline system are to pump
and to store fuel
1 and 5 respectively). However, two of the highest
expectations of the owner centred around two
functional failures 2-8 (a failure which could put him out of b u si ness) and 3-C ( a failu re with seriou s
safe ty lrt"lr\llf"!ltiAn.::CC \

Standards and the OEE

1t 1s tempting to
If a function embodies multiple performance
measure of effectiveness for the entire
try to develop a single
function. For
the primary function of a machine performing a
'J'""'''" in a manufacturing facility usually mc:011001-ate:s three
r\Pl!"TA1rtTI '1 nr,: St an 1(1fil'(1S. as fo llows:
it must work at all
it must work at the
it must produce the required quality.
The effectiveness with which it c ontinues to meet each of these '-'"'"'J'-'"' "This ., u ._ ..... ...,..,v
tions is measured by
a com-osite measure of the effectiveness with which this machine is
function on an
basis could be determined
ful filling its
as follows:
by multiplying these three
overall effectiveness availability x efficiency x yield.

What RCM


For i nstance, the p rimary function of a

To mill 1 01 1 workpieces per hour to a
If this machine is out of action
is 95%. If it is only able to produce 96
Applying the above
e ncy is 96%,. If 2% of its o utput are
an ove rall effectiveness of 0.95 x 0.98 x 0.96 0.894, or

measure is sometimes refened to as overall

equipment effectiveness', or OEE.
measures of this sort are
popular because they allow users to assess maintenance effectiveness at
They also seem to offer a basis for
the r\P1r1'IYrm r; n ,,,::,.
of similar assets
these measures
actually suffer from numerous
the use of three variables in
equal rnaiht,ncr This may not be the case in rr,r,rr;o

all three have

VV yJl j:;, O U L< ..... .

For instance, in the milling machine

above, the iMn,rk'r-110,.,0

be makin g a g ross p rofit of 1 00 on a finished

This means that 1 % downtime or 1 loss of att1,r 10 1.,,..u
the company one
sale per hou r - a lost profit of 1 00 per hour. On the other hand,
that the organisation has to write off 1
per hour,
worth of work-i n-process in addition to 1 00 lost
a total loss of 300 per
houL Consequently, the machine in the above ovi::imr.1a
(5 x 1 00) + (4 x 1 00) (2 x 300) 1 500 per hour
due to downtime, slow running and rejects. However, a n identical machine pro
ducing the same product might suffer from 4 /o downtime, run at 98% of its rated
speed and p roduce 4'% scrap. I n this caS,e the 'overall effectiveness' would be
0.96 x 0.98 x 0.96 = 0.903, or 90.3 /o. This is
a better performance
than the first machine. However, this machi ne is
(4 x 1 00) + (2 x 1 00) + (4 x 300} 1 800 per hour
than the first machine!
which is actually a significantly worse

It is poss1 :0 1e for many assets to operate too fast as well as too slowly.
asset would i ncrease the OEE as defined
means that it possible to obtain an apparent
r,prtc,,rrrH ll"ll"P
the asset to
i n a fai led state.
machine was that it
For instance, a primary performance standard of the
The 1' means that If the machine
should produce 101 1
hour, it is in a failed state
produces more than 102 units
it starts going faster than a
u,nrv_,n_,,nac,c orbecause
heat and damage the
However, if it operates at 103
102%. This increases the 'overall'-''"'"''"""'''"''
at a time when the machine Iv ctv\UQIIV



the OEE as defined above only relates to the primary function of any
because as i n the case of the "" ...'"_..,,u,.,
asset. This
every asset machine tools included have many more funcand each of these will have their own
tions than the primary
v...,,,,.v,, u""'....._,.., ...,, ,,.,.,-,.,.._ ..,..,., ...,,,,.,.. Consequently, the OEE is not a meas
but only a measure of the effectiveure of 'overall' effectiveness at
ness with w hich the primary fun ction of the asset is
truly user-oriented maintenance
finally, for the reasons discussed
Pn ltPr1nr1 '1 Plc need to tum their attention away from
tiveness towards functional effectiveness. So if measures of this sort
must be
it is much more accurate to refer to them as measures of
rather than 'overall equipment
'primary functional effectiveness'
effectiveness' .
The two most important conclusions to emerge
are that:

the contribution which maintenance is 1'YH>lrincr

, uu.n-1'1,fi', to the
na.,f"r,,rrn ,,:) nr,.,. of any asset, the effectiveness with which
fu lfilled must be measured on an
basis. This in turn
urn1erstana1mg of all the fu nctions of the asset,
urn1erstanct1mg of what is meant when it
,:>'1'H 1l n rt n n,

the ultimate arbiter of effectiveness is the user (whose ex1oec:tat10r1s must

w ill vary quite 1e,;mmatre1v
What users
in tum be
fom function to function and from asset to asset, depending on the
,.,... ,,_,..,.,,, h ri,n

14.3 Maintenance Efficiency

of this
As mentioned at the
the resources at its
measures how well the maintenance fu nction is
in which thi s can be done are gener01soo:;aL The
are only discussed briefly in this part of this
ally well
for the sake
mea sures can be
These are
into four
maintenance costs,

t' h <;int'.:>t

'r n r>HH"l f ' U

What RCM Achieves

Maintenance costs
the dJrect costs of main
The costs referred to in
the indirect
tenance labour, materials and contractors,
The latter issues were discussed
associated with poor asset
in Part 3 of this cmtDtt:\L
the direct cost of maintenance i s now the third
M>.<>r "l f'1
n ,.. costs, behind raw materials and either direct
production labour or energy. In some cases, it has risen to second or even
first place. As
controlling these costs has become a top
Some industries offer scope for substantial reductions in direct mainthose whose processes
mature or stable
tenance costs,
''"""'"'"'"""''"""""" and/or which have a
legacy of second
thinkin other industries.
ing embodied in their maintenance
esr,ecuHJlV those which are newly
their pro
the sheer volume of maintenance work to be done
cesses at
at such a pace that maintenance costs are
is often
absolute terms over the
ate the pace and direction of
reductions in total m aintenance costs.
The most common ways in which m aintenance costs are measured and
analysed are as follows:
Total cost of maintenance
and .., ...... ,.,,..., ..."'......
- for the entire facility
- for each business u nit
- for each asset or
Maintenance cost per unit of output
Ratio of
to labour expenditure.
L\,,.'-' LU lVHJ;:;:,.J vUl

The cost of maintenance labour typically amounts to between one third
on the
and two thirds of total maintenance costs,
overall wage levels in the
on contract labour
nance labour costs should include
and materials' because it
is often incorrectly grouped under
is bought out). When considering maintenance labour, it is also wise not
maintenance work done opto make the common mistake of
erators as a zero cost "because the
operators for this work, the onmn11S2lttcrn
ffil:Uff[en1ance, and the COSt should be ,,,, 1;,,..,,..,, , u i .,;,,ciiu:" 1 uv,.,., .... , , .. .


maintenance labour effi-

Time recovery
of total
time paid for)
hours and as a
of normal hours)
Relative and absolute amounts of time spent on different
default actions and modifications, and subsets of

(by number of work orders and estimated hours)

maintenance contractors to expenditure on full-

and niaterials
and materials usually account for the portion of maintenance ex
penditure which does not come under the heading of ' labour' , How well
in the following ways:
measured and
they are
on spares and materials (total and per unit of output)
Total value of spares in stock
Stock turns
value of spares and materials in stock divided by the
total annual
on these
stock items which are in stock
VVli. vvH <,t,:;..v of
Relative and absolute values

of stocks (consumables,

ru11111zn e and control

How well maintenance activities are planned and controlled affects all
.-.... ,:,, ,.-.,,.,.,,.. ,,, n"'"" effectiveness and efficiency, from the overall utilisation of maintenance labour to the duration of individual stopmeasures include:
of total hours
The above hours as a

of the above tasks

as planned
Planned hours worked unplanned hours

of jobs for which the time was estimated

of estimates
hours vs actual hours for jobs which

What RCM Achieves


Some of these
measures are useful for ..m,.1 kinr
,,H,,"'r imme diate deaction
cisions or initating short-term
budgets, time recovery, schedule completion rates, .,u.,,,n,.....,,:...,
more useful for tracking trends and ('t,,rnr,,r n u JJe1ctOJrn1;an<:e with sirnilar
facilities in order to plan
term remedial action
per unit of output, service levels and ratios in
help in
attention on what must be done to ensure that
maintenance resources are used as i;:;u11,.,CHl;J
,,..,.,., ,..,,..a, ., as pv,,.,t<J1>.
Maintenance efficiency is
to measure. The issues which
it addresses are usually under the direct control ofm aintenance managers .
For these two reasons, there is often a 1'..,,.,0 "'"
.....:1 ...
rn.,' 'J for these managers to
and not
on maintenance
focus too much attention on
effectiveness. This is unfortunate, because the issues discussed under the

on the overall physical and financial "

those discussed under maintenance "'1"1-"'"",.,..""
maintenance m,:iw,,wrs
1u .ocu u;:;,!w1- direct their attention ,,.,,,nrln,.r,lu
, ... rACt ctr,::,,n(Tt'l-. of RCM
chapter explains, the Enr'r"'""'-''H
helps them to do so.

14.4 What RCM Achieves

1 . 1 in Chapter 1, reproduced as
tions of the maintenance function
Figure 14.1:

Growing expectations of maintenance

1 940

1 950

1 960

1 970

1 980

1 990


The use of RCM helps to fulfil all of the Third Generation '-"'""' """""" ''v"'"'
The extent to which it does so is summarised in the
starting with safety and environmental
HH.v;,;..,. . ,


Reliability-centred Maintenance

Greater Safety and Environmental Integrity

RCM contributes to improved safety and environmental protection in the
fo llowing ways:
revie1'v of the safety and environmental implications of
every evidentfailure
considering operational issues means that
and environmental integrity become - and are seen to become
top maintenance priorities.

from the technical viewpoint, the decision process dictates that fail
ures which could
or the environment must be dealt with
in some fashion it simply does not tolerate inaction, As a result, tasks
to reduce all equipment-related safety
are selected which are
or environmental hazards to an acceptable level , if not eliminate them
completely. The fact that these two issues are dealt with by groups
which include both technical experts and representatives of the 'likely
victims' means that they are also dealt with realistically.
the structured
to protected systems, especially the concept of
the hidden fu nction and the orderly approach to failure-finding, leads
to substantial improvements in the maintenance of protective devices.
reduces the probability of multiple fizilures which have
se1ious consequences. (This is perhaps the most powerful single feature of
it correctly significantly lowers the 1isk of doing business.)
involving groups of operators and maintainers directly in the analysis
makes them much more sensitive to the real hazards associated with
their assets. This makes them less likely to make dangerous mistakes,
and more likely to make the right decisions when things do go wrong.
the overall reduction in the number and frequency of routine tasks
esr>ec1atLV invasive tasks which upset basically stable systems) reduce
the risk of critical failures occurring either while maintenance is under
way or shortly after start-up.
This issue is particularly i mportant if we consider that p reventive maintenance
a part in two of the th ree worst accidents in i ndustrial history (Bhopal,
Chernobyl and Piper Alpha). One was caused directly by a proactive mainte
nance intervention which was currently u nder way (cleaning a tank ful l of methyl
On Piper Alpha, an u nfortunate series of i ncidents and
oversights might not have turned into a catastrophe if a c rucial relief valve had
not been removed for preventive maintenance at the time.

As mentioned in Part 2 of this rh,;:intPr the most common way to track

performance in the
and environmental
is to record
the number of incidents which occur, typically
of lost-time accidents per million man-hours in the
number of excursions (incidents where a stanaiam onregtmmc,n
'-'U'-<fJ<cVi '

To provide an indication of what RCM h as achi eved i n the field of safety,

1 4.2 shows the n u m be r of accidents p e r million take-offs recorded each year i n
the commercial civil aviation
philosophy (excluding accidents caused by sabotage, military action or turbulence) .
The p e rcentage of these crashes which were caused
failure also
declined. Much of the improved reliability is of course due the use of
materials and greater redundancy, but most of these HTir"'\rr.Hn,'nonrc
in turn by the realisation that maintenance o n its
in Chapter 12, this shifted attention from a
reliance on fixed time overhauls
in the 1 960's to doing what ever is n ecessary to avoid o r eliminate the conse
quences of failures, b e it maintenance o r
cornerstone o f the RCM
p h ilosophy). It also reduced the number of c rashes which
been caused by inappropriate maintenance inte rventions.




0 20




<l> 10









Year ---->
Source: C A Shifrin: "A viation Safety
A viation Week & Space Technology: Vol


Higher Plant Availability and Reliability
on the perform
achieving 95%
potential than one which is f' l l n t" I H
l r'tllP'111 n a 8 5%. Nonetheless , if t is
RCM achieves
significant improvements
of the


For instance, the

of ACM has contrib uted to the tAlll'\\AJli nn
a 1 6% i ncrease in the total o utput of t he
assets of a 24-ho u r 7day
m111<-r:;1 roc:essma plant This i m p rovem ent was achieved in 6 months, and most
of it was attributed to an exhaustive RCM review done d u ring thi s
dragline in an open-cast coal mine whose availability rose
from 86% to
in six months.
a large holding furnace i n a steel mill which achieved 98% availability i n its first
oinhtt.:,o ri mon ths of
an expectation of 95/o.

of course improved by
number and the
of unanticipated failures which have
The RCM process
to achieve this in the following ways:
consequences of everyfailure
which has not already been dealt with as a
w ith
criteria used to assess task
ensure that only
the most effective tasks are selected to deal with each failure mode.

tasks helps to ensure that potential

they becomefunctionalfailures. This helps
reduce operational consequences in three ways:
- problems c an be rectified at a time when stopping the machine will
have the least e ffect on An.arCl l'H\n<C
to ensure that all the resources needed to repair the fail- it
ure are available before it occurs, which shortens the
rectification is only carried out when the assets
need it, which
extends the intervals between corrective interventions. This in turn
means that the asset has to be taken out of service less often.
For i nstance, the example concerning tyres o n page 1 30 shows that the tyres
need to be taken out of service 20/o less often for
mainte nance is use d instead of sch eduled restoration. In this case, the effect
on the
of the vehicle would be
because removing a tyre
it with new one can be done very quickly. However, i n cases
extensive downtime, the i mprovement
where the corrective action
in availability
be substantial .

What RCM

the in6 each failure mode to the relevant functional
tlinann,'i ,. which
formation worksheet provides a tool
leads in tum to shorter
on on-condition
maintenance reduces
with a corresponincrease in avail ab ility . In addition,
l ist of all the failure modes which are reasonably
a dispassionate assessment of the rel ationship between
reveals that there is
no reason at all
routine overlumls
rron 11,9 1u ,i Thi s l eads to a reduction
scheduled downtime without a corresponding increase in unscheduled downtime.
steelworks to eliminate all

For instance, a A C M enabled a
fixed-interval overhauls from its ..,,:,.::
increased from 25 000 to 40 000 hours without "'"'"' "'"'"''"'"' rahah1l1t\1


a shutdown
of the above comments, it is often necessary to
or an overhaul for any of the following reasons:
to prevent a failure w hich is ge1m1ne1'y
a potential failure
a h idden functional failure
- to
- to carry out a modification.
In these cases, the disciplined reviev\i of the need for
or cor
rective action that is part of the RCM process leads to shorter shutdown
worklists, which l eads in tum to shorter shutdowns. Shorter shutdowns
"''"''' '""'" '"'"' uu1u"Lc and hence more
to be comrHe1tea

short shutdown worklists also lead

when the plant is started up
after the shutdown, becau se it
disrupted as much. Thi s too l eads to an overall increase in

as explained on page
RCM provides an
participate in the process to learn
how to
operate and maintain nevv plant. This enables them to avoid many of the
errors which would otherwise be made as a result of the
cess, and to ensure that the plant is maintained
from the outset
At l east four organisation s with whom the author has worked in the UK and
USA achieved what each described as 'the fastest and smoothest start-u p i n
the company's history' after
ACM to new installations. In each
ACM was applied in the final stages of
cerned are in the automotive, steel , paper and l"t"\l"ltO, tir,r\Ol'\J

3 12
the elimination
and hence of superfluous failures.
As mentioned
it is not unusual to find that between 5% and
superfluous, but
plant are
20% of the n r. r...-. ..._,,.,.o,," of a
can still
the plant when they fail. Eliminating such
leads to
increase in reliability,
'-' "-f.L &. v v V'-'' "'HU.i,;..

best to carry out

who know the
of failure modes, it becomes possible to identify
which othenvise seem to defy ae1:ect101n.
and eliminate
and to take appropriate action.
Improved Product Quality
quality issues as shown on pages 48 and
much to improve the yield of automated processes.
For instance, an
from 4% (4 000 parts per million) to 50 ppm.

used ACM to reduce scrap rates

Greater Maintenance Efficiency (Cost-effectiveness)

control the rate

of mainte-

Less routine ,naintenance:

Wherever RCM has been correctly applied to
nr,:,u,c,,nnu;:,. m aintenance system, it has led to a reduction of 40% to 70%
in the .......""''""'"''""'rt routine maintenance workload. This reduction is
but mainly due to an overall
due to a reduction in the number of
increase in the intervals between tasks. It also
that if RCM is
or for equipmaintenance programs for new
currently not subject to a formal preventive maintenance
lower than if the
program, the routine workload would be 40%
maintenance program were
by any other means,
t hat in this context, routine' or 'scheduled' maintenance means
any work undertaken on c yclic basis,
it the
of the
on a pressure gauge1 a monthly vibration reading, an annual
functional check of a
switch or a
overhaul. In other
it covers scheduled on-condition
schedscheduled discard tasks and scheduled

For example, ACM has led to the fol lowing reductions i n routine maintenance
workloads when
to existin g systems:

What RCM Achieves


a 50% reduction i n the routine maintenance workload of

a 50% red uction in the routine maintenance
formers i n an electrical distribution system
an 85% reduction i n the routine maintenance
of a
system on an oil platform
a 62% reduction in the number of low frequency tasks which needed to be done
on a machining l ine i n an automotive

Note that the reductions mentioned above are

In many PM
of the schedules i ssued
the p lanning office are
is often as low as
and sometimes even lower. In these
cases, a 70% reduction in the routine workload will
which means that there
issued into line with w hat is
will be no reduction in actual workloads.
Ironically, the reason why so many traa1t1m1tall
suffer from such low schedule co1m0Iet1ton
that rnuch of the routine
a third
done in any
of the prescribed work
out of control. A zero-based RCM review does much to
l ike these back under controL
Applying RCM to maintenance contracts leads t o
in two areas,
of fai lure consequences allmvs
Firstly, a c lear
different resto
response times more
ponse times
of contract mainteS ince rapid response is often the most
nance, judicious fine-tuning in this area can lead to substantial
tasks enable s
S econdly, the detailed
reduce both the content
of maintenance contracts, usually by the same amount
any other
schedules which have been prepared on a traditional basis. This
corresp1orn:1rnt2 '"''"'"'"'"' in contract costs.
Less need to use
If field technicians employed by equipment c"!::"H ""''' attend RCM
\:Vhich takes
on page
the ov,,h,, ... ,.r<>
of the maintainers , .. ,nu""''')
leads to a quantum jump in the
users to solve difficult problems on their own. This leads to
matic drop in the need to call for
0UiJIJJ.J,..., . "


3 14

Reliability-centred Maintenance

acquiring neiv maintenance technology

The criteria used to decide whether a proactive task is technically feaible
and worth doing apply directly to the acquisition of condition monitoring
equipment. If these criteria are applied dispassionately to such acquisi
tions, a number of expensive mistakes can be avoided.
items listed under 'improved operating pe,:formance '.
section of this chapter also improve
Most of the items listed in the
maintenance cost-effectiveness. How they do so is summarised below:
quicker fciilure

means that less time is spent on each repair

detecting potential failures before they become functional failures not

only means that repairs can be planned properly and hence carried out
more efficiently, but it also reduces the possibility of the expensive
which could be caused by the functional failure
the reduction or elimination of overhauls together with shorter work
listsfor the shutdowns which are necessary can lead to very substantial
savings in expenditure on parts and l abour (usually contract labour)
the elirnination of superfluous plant also means the elimination of the
need either to prevent it from failing in a way which interferes with
production, or to
it when it does fail

how the plant should be operated together with the identlj'ication ofchronicfailures leads to a reduction in the number and severity
of failures, which leads to a reduction in the amount of money which
must be spent on

of this phenomenon encountered by the author conThe most

cerned a single failure mode caused by i ncorrect machi n e adjustment ( operator
error) in a large p rocess p lant It was identified d u ri n g an RCM review and was
thought to have cost the organisatio n using the asset just under U S$1 million
in repair costs alone over a period of eight years. It was eliminated by asking
the operators to adjust the machine in a slightly d ifferent way.

Longer Useful Life of Expensive Items

By ensuring that each asset receives the bare minimum of essential main
tenance in other words, the amount of maintenance needed to ensure
that what it c an do
ahead of what the users want it to do the RCM
process does much to help ensure that just about any asset can be made
to last as
as its basic supporting structure remains intact and spares
remain available.

What RCM


maximum useful life of individual components

maintenance in preference to other

._,.,,..,,.. U H..,....,,,,

Greater motivation of individuals

to improve the m otivation of the
who are involved in
a dearer
the review process in a number
of the fu nctions of the asset and of what they must do to
and hence their confidence.
greatly enhances their
Secondly, a cle ar
of the issues which
control of each individual - in other
of the limits of what
to achieve - enables them to work more comfort
reasonably be
ably within those l imits. (For msrance.
visors automatically held responsible
as often narme11s
This enables them and those about them t o deal with failin
ures more calmly and rational Ly than
otherwise the
Thi rdly, the knowledge that each group member
in forwhat should be done to achieve them and in
deciding who should do it l eads a
sense of ow'nersrno.
This combination of COJllPetencie,
com fort and ;yu rn,s,rc fn ,-.
to want to do the
all mean that
are much more
right job right the first time.
,,,,yn " t r'I<'"'"''"'>

f'VH"H'.:>rr> C>l'I

Better Teamwork
In a curious way, teamwork seems to have become both a means to an end
The ways in w hich the
and an end in itself in many
structured RCM approach to maintenance problem
and decision
m aking contributes to teambuilding were summarised on page 268. Not
only does this approach foster teamwork within the review groups thembut it also improves communication and --- ""... ,., .. 01:>:t\il,'ee1n :
production or
and the maintenance function
m,mageJm ent, supervisors, technicians and vv,, , u,v,
....,..,,H'"....,.. "', vendors, users and maintainers.

A Maintenance Database
The RCM Infom1ation and Decision Worksheets .,.."''""''
as follows:

p1 \JV lU\.c

number of


Reliability-centred Maintenance

adapting to changing circumstances: theRCM database makes

sible to track the reason for every maintenance task right back to the
if any aspect
the tasks which
it is easy to
of the operating context
examples of such
are affected and to revise them accordingly.
-.u,cn< f"'-'-'"' are new CI1ViOnmental
changes in the
cost structure which affect the evaluation of operational consequences,
(\f'\,Pr,,tni, a

the tasks which are not affected by

these tasks.
means that time is not wasted
In the case
often mean that the whole maintenance program has to be reviewed in
As often as not, this is seen as too big an undertaking, so
as whole gradually falls into disuse.
an audit trail: Part

mentioned that rather than prescrib,.,...,,.,,.,,.,..,_ tasks
frequencies, more and more modem safety
assets must be able
is demanding that the users
evidence that their maintenance programs are
to produce
u""'Ju,;;i.,., defensible foundations. The RCM worksheetsprovide
l ogical and
this evidence the audit trail in a
understood form.
more accurate drawings and manuals: the RCM process usually

means that manuals and drawings are read in a
"what does it do T' instead of "what is it?". This leads
number of errors which may have gone u nthem to
,..,., nn ,n n<.:> (especially
process and instrumentation
noticed in as-built ,,.
Thisna1J1Jtms most often if the operators
work with the machines are included in the review teams.
reducing the effects of staff tuniover: all
suffer when
leave or retire and take their knowledge and
ence with them.
information in the RCM database, the
on!an:1s2tttcn becomes much less vulnerable to these ....,u.,u,i:;,'-'""
1-rH c.vmn1o a major automotive manufacturer was faced with a situation where
was to be relocated and m ost of the workforce had chosen not to move
to the n ew site. However, by
RCM to analyse the
with the
on, 11 n1'Y'la ,nt before it was moved, the company was able to transfer m uch of the
knowledge and experience of the departing workers to the people who were
recruited to operate and maintain the equipment in its new location.

What RCM Achieves

3 17

the introduction ofexpert systems: the i nformation on the Information

an excellent foundation for an
Worksheet in particular
this worksheet
esr>ectalllY if the information is
computerised database and sorted ......,,..,,. ,,.,,, .,....,,.,.

An Integrative Framework
As mentioned in Chapter l, all of the issues discussed above are
the mainstream of maintenance
and many are
feature ofRCM
'-'TPn-rnJ-<.'. TPn frame work for t!wlrlin
.__,.,.,, ,_. ,, _., ,..._ all of them at once, and
in the
everyone who h as -...u1 L:i;u6 to do with the
1 5 A Brief History of ACM

15. 1 The Experience of The Airlines

commissioned United
The United States
Airlines to prepare a report on the processes used by the civil aviation in
maintenance programs for aircraft The resulting report
was entitled Reliability-centered Maintenance.
B efore reviewing the application of RCM in other sectors, the follow
summarise the history of RCM up to the time of publica
tion of the report by Nowlan and Heap 1978 The italicised paragraphs quote
extracts from their report
The Traditional Approach to Preventive Maintenance

The traditional approach to scheduled maintenance programs was based
on the concept that every item on a piece of cornplex equipment has a
age ' at 1:vhich complete overhaul is necessary to ensure safety and
reliability. Through the years, however, it was discovered that
many types
could not be prevented or effectively reduced by
no matter how intensively they were persuch maintenance
formed. In response to this problem; airplane designers began to develop
designfeatures that mitigated.failure consequences - that is, they learned
how to design airplanes that were 'failure tolerant'. Practices such as the
replication ofsystemfunctions, the use ofmultiple engines and the design
tolerant structures greatly weakened the relationship between
and reliability, although this relationship has not been eliminated
there was still a question concerning the relationship of
.,. .,.,.,, u,cttTn,,, rnaintenance to reliability. By the late 1950's, the
of the
comrnercial airlinefket had grown to the point at ivhich there was ample
data for study, and the cost ofmaintenance activities had become sufficihigh to warrant a searching look at the actual results of existing
u,, ,,,.,. ,.,, cJ, At the same time the Federal Aviation Agency, which was


group led to the establishment

Reliability Program, described in the introduction
ument as follows:

area ofmaintenance in t-vhich it was most interested.

that must
a great deal was learned about the
Tlvo discoveries were
maintenance to be
Scheduled overhaul has little e/Ject o n the overall reliability of a
complex item unless the item lzas a ,lomi11ant failure mode
There are many items for which there is no effective form ofsclzed
uled maintenance

The History of RC11f Analysis

an attenipt to
various reliability programs
cable approach to the

n.,,.,;r..,,,,11n,1u maintenance programs. A

and in
was devised in
the A/AA

June 1967 a paper on its use

craft Design and Operations

emDoait'.a in a handbook on maintenance evaluation and

747 maintenance program has been suicccssrut
r:1nn-111nor, 1n,,1 [t>t:nnunu, led
later in a seconll document, MSG-2:
which were
Airline Manufacturer Maintenance Program Planning Document.
used to
the ,i;;nzea: uua 11l(llJrzten{ll'lC()
These programs
the Lockheed 101 1 and the
have also been
MSG-2 has also been applied to tactical miliwe refor aircraft such as the Lockheed
tarJ' aircrqfi;
the Airwas the
bus lndustrie A-300 and the Concorde.
outlined in MSG- 1 and MSG-2 was
' a scheduled-maintenance program that assured the maximum
and also prothe
vided them at the lowest cost. 11s an
under traditional maintenance policies
achieved with this
the Douglas DC-8 aiqJlane
339 items, in contrast to seven such items in the DC- IO proto overhaul limits in the later
gra1n. One
Elimination of scheduled
turbine propulsion
'"'" ';"'"""'' led to major reductions in labour and materials
reduced the
this was a respectable
then cost more than US$1 million
the Boeing
under the MSG-I
e.u1er.iaea onl_v 66 000 manhours
a basic interval
000 hoursfor the first
Under traditional ma'. uuentim:e
than 4 rnillion manhours to arrive at
cies it took an exJr,e1,wttu1e
the smaller and
the sanze structural
DC-8. Cost reductions <d' this mtuu,izti,tae are of obvious impor
maintainin.g large

32 1

Such cost reductions are achieved with no decrease in reliability. On

the contrary a better understanding ofthefailure process in complex

equipment has actually improved reliability by making it possible to
direct preventie tasks at specific evidence ofpotential failures.
Although the MSG] and MSG2 documents revolutionised the

sequences that determine

purpose. The nr;,ntOfn
the role

as well as the need to

led to

centred Maintenance.
The Nowlan and Heap report provided the basis
which was
1980 and revised in 1 988 and 1 993. MSG3 remains to this
day the process used to develop and refine maintenance programs for all
major types of civil aircraft.

15.2 RCl\rf in Other Sectors

the US
Sinc e 1978, RCM has been applied
in the United States commenced a selies
three nuclear power
of pilot applications
stitute in San
by nuclear facilities in North America, by EdF in France and now to an
, n ;,..-on,, , n n extent by nuclear facilities throughout the world.
The author and his associates
wi th the ....
,",.._n, ()f
sectors in the
1980s. Since
RCM in the mining and
t hen, we have worked on more than 500 sites in countries ,.,,.,........ .. all
six continents.


Reliability-centred Maintenance

The projects

from in-company awareness training for

and maintenance managers to the full-scale application
of RCM to all the equipment on a site. The sectors in which projects have
been carried out in the
years prior to the date of publication of this
e dition of this book include the following (sectors where we have worked
on more than 5 sites are shown in italics):

,u,.....,.. smelting,
\w'U;i:;., u ,: v0, assembly, components, tyres)
(nickel, aluminium, platinum)

base metal
breweries and soft drinks
buildings and building service.s
chemicals (industrial and household)

cosmetic manufacture
electric utilities
steam, gas and nuclear


canned fruit and vegetables,

meat products, milk)

g as distribution

iron mining
m ass housing

undertakings ,- -"' navies and air

n uclear facilities


cellulose, tlbre[ia;s

postal services
pulp & paper (tissue paper, fine paper, "'""""""""U,f.JH.....


water distribution
The fact that RCM has been
at all
levels, and has enabled users to achieve some remarkable successes in all
that it i s much less affected cultural differthese countries,
this nature.
ences than many other
listed opposite all became aware of and started
ing RCM at different
so implementation is more advanced on some
sites than others. The overall situation c an be summarised as follows:
in about 25% of the cases, senior managers have
ary training
about 1 0% of the oni:antsa1:ior1s have applied RCM to all of the
ment on at least one site
and most
the remaining 65% have reviewed some of their
RCM to
most if not all of their
plan to continue
'"'" .,... ,.,, ..,., in each case.
However, Chapter l 4 p rovides
summary of the results achieved
together with a brief review of some of the ,. .,.., .......... '"" ....,.

15.3 Why RCM 2?

There are now a l arge number of mt;erp1retat11:ms and variations of the
RCM decision logic i n existence.
four account for the majority of the RCM work cmTently
the p lanet The first is shown on pages 9 1 and 92 of the report by Nowlan
and Heap 1978 This is the version first used the
and it is also the
most other RCM ....."'"'h h r>n ,
The second version of the decision
is the official MSG3
version currently used by the civil aviation industry. It shown the
system/Powerplant Logic
on pages 4 and 5 of the Maintenance
Development Document
the Air
Association of America 199\ The third, which is similar to
M il-Std-2 1 73 used
the us Naval A ir :;""y"'"'''""'"'"
The fourth is RCM2, which is the
of this hook.



the RCM frame-

Dealt with



Not handled as a



failure could
Asks the multiple failure could
the affect
at the foot of the
I hidden function column
failure i s evident. Yes and no
colanswers lead to two
and leads to the same
umns: 'Yes1 defaults to
defaults as MSG3
to desirable


Encourages users to select the first

task without

The environment Not considered


criteria for deciding whether failureis technically feasible and




default action if no
can be fou nd: does not specify



Other l ubricaas individual

is a


about lubrication is
at the head of every task

selection criteria
users to consider tasks
from all categories before makina a
Not considered

As for Nowlan and

I Considered in auestion E

Appendix 1 :
Asset Hierarchies and
Functional Block Diagrams
Plant registers and asset hierarchies
own, or at least use, hundreds if not thousands of
physical assets. These assets range in size from small pumps to steel rolling
mills, aircraft carriers or office blocks.
may concentrated on one
smal l site or spread over thousands of square kilometres. S ome of these
assets will be
others will be fixed.
can apply RCM - a process used to determine
Before any
what must be done to ensure
to do what
ever its users want it to do - it must know what these assets are and where
are. In all but the smallest and simplest facilities, this means that a list
of an the plant,
and buildings owned or used by the nror1n 11 ,;:n_
sort, must be
tion, and
This list
ueing1]eu in way which makes it
track of the assets that have been
those that have
yet to be
and those that are no.t
is also needed for other
as the planning and scheduling of routine and non-routine maintenance
history recording and maintenance cost allocation. As a result, it
should be set up and the associated
in such
a way that it can be used for all these !J"'" "'v,.,...,,,.
Chapter 4
that RCM can be
at almost any level in a
hierarchy. It also suggested that the most appropriate level is the level w hich
number of failure modes per function.
leads to a reasonably
'Appropriate' levels become much easier to
if the
as a hierarchy which makes it possible to ll'! Pn t t 1r u
any asset at any level
down to and mcmomg
nents Cline replaceab le
or even spare
The truck on page 85 provides ()flC
A l . I overle af shows another ............u.u..,.,
boiler house in a food

FoodCo lnc

Factory 3


A list drawn up for the assets in

this hierarchy.
with a
hierarchical numbering
for each asset, might appear as
shown i n
A 1 .2.








Figure Al.I
Asset 1-1,,...,,,,.,,.,,.,.,..,,

FQodCo lnc
Factory 1
Factory 2
Factory 3

Preparation Dept
Packing Dept
Site Services

Power Supply
Compressed Air
Water System
Boiler House
Coal Handling System
Boiler No 1
Shell & Tubes
Feed Pump

FD Fan
Boiler No 2
Ash Handling System

Maintenance Dept

Head Office

Itunctional hierarchies and functional block diagrams
oo :s s1101e to /'l e> '<1,:,. ! ,n
>"r'' "" '

be done for the asset

Variations of the functional
are used to show
between functions at the same level. These are
, and
can be used to
known as functional block
in a number of different ways. For n c
defines a functional block diagram as a top level rt,s r,rp,,:,.n,t,:,tuu,
major functions that the
. On the other hand. Blanchard
the term 'functional flow ctrntgr;un
at many different levels. Smith tends to
energy and control
use the diagrams to show the movement of
signals through and between different elements of a system, whereas Blanchard &
use them to
h > e> r>


A functional block diagram for the boiler house in

plant to the two boilers, and the
flows from the coal
handling plant. It also shows w hat materials and services flow
boundaries. This is illustrated in Figure A 1 .4 on
for one the boilers. A more complex
a more detailed fun ctional b lock
be used to show what control
i ndication
signals pass across the system boundaries.

Functional hierarchies and functional

of the equipment
functions and
have to
is capable of fulfilling each functional
be of
2, functional block
As mentioned in
some help when RCM is app lied to facilities where the processes or the
relationships between them are not intuitively obvious. These tend to be
monolithic stmctures such naval
v \.;,:);'.),_,1,:,. combat aircraft and the less accessible
of nuclear facilities.
However, in most other industrial
food and automotive m2mu.tacturmg
is usually no need to draw up functional block 11 ':lnf':1 n1"
on an RCM
for the



Reliability-centered Maintenance


Figure A J.3

Asset Hierarchy

Appendix 1: Asset and Functional Hierarchies


Figure AJ.3 (continued)

. . . . . with Corresponding Functional Hierarchy

33 1


Reliability-centered Maintenance


Figure Al.4: Functional Block Diagrams

the relationships between different processes is usu in most

ally well enough understood by the participants in RCM review groups
to make these diagrams unnecessary.

For example, the boiler house operators and maintainers would be fully aware
of the fact that coal , water and air go into the boiler at one end and that steam ,
f l u e gas a n d a s h (and occasionally d irty water) come o u t o f t h e othe r. Most of
them would probably regard the notion that these simple facts should be d rawn
in a d iagram at best as a waste of time. As d iscussed at length in Chapter 2, the
is not to identify the simple and usually obvious relationships be
tween processes, but to define the desired performance relative to the initial
capability for all the key e lements of each system, and then to define what m ust
be done to ensure that the system continues to deliver the desired performance.

In cases of uncertainty, equipment i s usually

it to be
the required i nformation
easy to go and see what goes
can be extracted from a set
and instmmentation
fact, a good set ofP&ID' s nearly always eliminates the need for functional
block diagrams entirely as a precursor to the application of RCM. ln
effort and cost
such cases, block diagrams add significantly to the
of the RCM process while
to its value.
functional block diagrams only
primary functions at each level,
so they only tell part of the story. (For inct(lnrp.
aH the assets at the
<'""'"'"'''r <> .Y COiltainment
fourth l evel and below in
A l . I have a ,,,_.,_,vn.:,u
function. This cannot be shown in a functional block
making it unmanageably cumbersome.)


as explained in Part 3 of Chapter 2, the principal functions of the assets

in the hierarchy above the level chosen for the
should be sum
context statements. These state
marised in suitably worded
ments are written only for those assets which are relevant to the
the functions of
in question. As a result, time is not wasted
assets which are not germane to the asset under consideration.
is far more detailed than a cmde, smg1e:-stateme:nt-oer-a:sse1 n u , n rl fi'\

assets at or below the level chosen for the analysis are dealt with as p art
of the normal RCM process. Part 7 .of Chapter 4 showed that the fu nc
tions of lower level assets are either listed as
functions in the
main analysis, or deal t with as failure modes, or in u,.., """"'"'
ally complex subsystems, broken out for ser,ar.ate
For instance, the example of the t ruck shown i n
4. 1 1 i n
showed how a blockage in the fuel l i ne could simply be treated as a failure mode
of either the engine or the drive system, without needing a separate function
statement for the fuel system or the fuel line.

(In the author' s

fu nctional block ..;,;u 6 <-H,, tend to he of most
"'''"'"..,,m,:.,-.t" users.
value to outsiders seeking to apply RCM on
Because they are outsiders, they need these rf t <:i crr<:1, n"\<:.'
ore,oairect at the
expense of the owners of the assets - to improve
of the processes which they are about to analyse. The best way to avoid
this expense is not to employ outsiders as
in the first
rather to train people who have a reasonable first-hand
of the plant as RCM facilitators.)
rfv:, nr,;, rr, e



System boundaries
it is of course important to
RCM to any asset or
and where it ends.
to be analysed
If a comprehensive asset hierarchy has been drawn up and a decision
taken to analy se a particular asset at a particular level, then the
usually automatically encompasses all the assets below that system in the
asset n,A
'lr, nu
1. The only exceptions are subsystems which are judged to
be so
that they will not be
at all , or very complex
which are set aside for set)arate ..,"n _,, h u, c
which consist of a sensor in one sysCare is needed with control
a second system, which in tum
tem which sends a
activates an actuator in a third . Chapter 4 explained that this issue can often
be dealt with either by conducting the
at a high enough level to en
sure that the 'system' encompasses the entire loop, or by
'\c'il<ctf'n1 .:c ,,,.-.,, ..... t,"" 1 " (after the controlled systems have been analysed). However, sometimes this is not practical, in which case a decision must be made
will encompass the control loop in its entirety.
to which
right on the
Care is also needed to ensure that assets or
boundaries do not 'fall between the cracks' . This applies
items like valves and ,_,.,.,,. ;.,--
because as
It is wise not to be too
about boundary
um1erstana1mg grows during the RCM process, perceptions about what
not be incorporated i n the analysis frequently ...,,.u,,,..1-,.,_,.
should or
This means that boundaries may need to be extended to incorporate some
others may be dropped and
others which are i ncluded
initially may be set aside for later ,n-::1, hJ C l C
of rigid boundary definitions tend to be
external contractors
to apply RCM on behalf of end-users, because
boundaries must be defined precisely in order to define the com
mercial scope of the contracts. The fact that the analysis is the subject a
fonnal contract means that boundaries have to be defined much more pretechnical
much more rigidly - than is
point of view. Contracts of this type then either have to be rene.e:<matea
every time a boundary needs to be changed, or the boundary is not moved
in a suboptimal
The best way to avoid
when it should be,
the time and cost associated with these commercial manoeuvrings is to avoid
l'/'\'l"ltr\t''hncr out this
of maintenance policy formulation """"""''-''-u:,.,,..

(' 1 l t, C' U <.'1',::> t-O C

Appendix 2:
Human Error
Chapter 4 mentioned that a
many equipment fa ilures are caused
'human error' . It went on to mention that if a
human em..1r i s con
sidered to be a credible reason why a functional failure could occur, then
that error should be i ncluded in the FMEA.
human error is an
in its own right. The purpose of this appendix is to pro
vide a brief summary of the
of human enw, and to 0u;:::....,,:-,1e
how they might be dealt within the framework of RCM.
Principal Categories of Human Error
B IanWhen considering the i nteraction between
chard et
group the main factors under four ......"'""' ",;;,, ,,>,
anthropometric factors
human sensory factors
physiological factors
psychological factors.
every 'human error' can be traced to a fa ilure or a problem which
has occurred in at least one
we review them
briefly in the first p art of this appendix, before
in more detai l at
the fourth
r, o f" a.r,c,"..' '

Anthropometric facto rs
Anthropometric factors are those which relate to the size and/or "'T,e> n l'TTn
of the operator or maintainer. Errors occur because a person
person, such as a h and or arm):
simply cannot fit into the space available to do ..
- can not reac h
enough to lift or move sornetnm,g
- is not
or is reasonably likely to occur for any of these
If a failure is
reasons, it is highly unlikely that a proactive maintenance task will be
found to deal with it Note also that if a human error occurs for one of these
reasons ., the human error is not the root cause. The failure mode is ---H ,
human error is a failure effect
and the
-.u-u ..,""

C' A icTt t:> t h 1 rHT

lf the consequences are such that something must be done about a fail
ure which is
for anthropometric reasons, the only viable course
of action likely to be
involve recon
This will
fin11r1nn the asset in such a way that it becomes more accessible or easier
to move. In this context,
A2. 1 shows some dimensions which are
considered by the US Navy to be adequate for reasonable human access
in confined spaces.
i,..., ,.u,,u;-,,

Figure A2.1:

Where people fit

NA VSHJPS 94234, Maintainability
Criteria Handbook for
, J.oc,,.,n, JOre: of
Electronic Equipment. US Navy, Washington DC)

Human Error

Human ,.fJ.., '''".."

concern the ease with which
feel and even smell what is
this tends to apply to the
trol consoles.
the nooks and crannies
noise levels also affect the
maintainers to discern what is h".>lrvn,:.,-.,n,r
that if errors are ,."'"""".,., .... ,....

The term 'physiological factors' refers to environmental ,.., rr;e, <.''"''' which
The stresses include
or low terni)en:Jltmes,
affect human

festations of (human) ,,1-1-,

or mistake.
three terms are
concerned will make a slip,
defined in the next section of this appendix.)
for any of these reasons, the
If errors occur or are thought to be
human is once
not the root cause, but the error is the effect of some
other fa ilure. Again, if the consequences warrant it, the
Either the
of the n h 'H <' l ," (l l
to be some form
vironment can be cn::meo' in such a way that the
are reduced (for instance.,
protection), or "''J,..,......,.....,:., prorcecmns
a chance to recover
or more
timed rest breaks).
Another environmental stress factor is
climate. While this does not neces!.;an

.-.l\Ufr.,,,..., what RCM can do is alleviate if not eliminate the hostile
which SO often exists between maintenance and
less inclined to blame
on page 268. This makes
each other for errors. and more inclined to find solutions.
<lT>,r> n ,c- n , r,

An,,:,,;i l",r,r><'

The three sets of factors discussed far all relate to external phenomena
whi ch cause the human to m ake an error. As a
they are
and to deal with (al though doing so may sometimes be
easy to
of errors are
A far more
those which find their roots in the psyches of the humans themselves. As
a result, these
factors are discussed in more detail in the
next section of this appendix.
Psychological errors
of human error into those
Rea;.;on 199 i divides the
which are unintended and those which are intended. An unintended error
is one 1vvhich occurs when someone does a task which he or she should be
An intended error
the job
but does it
but what they
J V X U-J . .

Figure .A2.2:
".'.> t"",...,,..,,,;,:;,:. of

psychological errors

2: Human
u nintended errors are subdivided into
intended errors are subdivided into mistakes and violations.
A2.2 and are discussed
are illustrated in
in the following oma12:.rau1ns.
S lips and
are also known as skill-based errors.
son1el)OOV who is fully
to do
it correctly many times in the
when somebody does ..,,....,""t h n , n ....,,,,,.,...,,.,,nu
wires a motor n '''" "' t l u
when .someone misses out a
in a sequence of activities
stance, if a mechanic leaves a tool behind after
in a machine or
to fit a


preoccupied or simply ' absent minded' . As a

predictable, although their l ikelihood increases if the person is
in a physically hostile environment, or if the task is exi:e(:auru;rJtv -. v , ,. 1 v 1'*' A
1'0C> P n n o lh. l u
l-l r,.n ,o,<1.or if the environment i s
and the task is
"!"\c the
simple, then this
of human errors i s .,v,hu<HJ..>
it is fair to describe the error as the root cause of failure.
The possibility of a
tors and maintainers are involved directly
1EA). This leaves them
the effect,;; and consequences
which in tum results
motivation to do the job 'right first time ' . Thi s
where the consequences o f failure are
to be most se1ious.
Another approach to
based on the
assumption that if something can be installed
it will be . The
remedy is to go back to the
board and:
be assembled in the
correct sequence

individual components i n such a way that

i nstalled the
way round and in the right
This is the essence of the Jat)anese ,.,,_,.,...,... , ., ,,
Ideally this philosophy should be applied to
assets, because it
than retrofitted to
than to modify out bad

:" "'1


Reliability-centered Maintenance

Mistakes 1: Rule-based mistakes

Rule-based mistakes occur when people believes that they are following
the correct course of action when doing a task (in other words, applying
a 'rule'), but in fact the course of action is inappropriate. Rule-based
mistakes are further subdivided into misapplication of a good rule and
bad rule.
set of conditions, a person selects a course
In the first case, under a
of action which seems appropriate, usually because it has been successful
--u , ..,,_, with similar conditions in the past - hence the term 'good rule' .
However, some subtle variations on this occasion mean that the course of
undertaken deliberately, i s wrong,

For instance, a protected system might be set u p in such a way that excessive
pressu re should cause an alarm to sound and a warning light to illuminate. How*
ever, a situation mig ht a rise where the alarm is fai led, the pressure increases and
the light comes on. The absence of the alarm may l ead the operator to believe t hat
the warning l ight on i ts own is only a false alarm, especially if it has a history of
spurious fai l u res. In this case, the operator may choose to take no action u ntil the
a course of action which h as been appropriate i n the past. On
l ight is
this occasion however, it is not the right thing to do.

The application of a bad rule means j ust what it says. The normally chosen
or prescribed course of action is just plain wrong.
A classic exatmPle o f a bad rule is a maintenance p rogram which schedu les items
for fixed i nterval overhauls in o rder to deal with failure modes w hich conform to
fail ure pattern E or F
Figure 1 .5 or 1 2. 1 ). In the case of Pattern F especially,
an action designed to improve rel iability will i n fact make it worse, by upsetting a
stable system and i nducing i nfant mortality.

the 'root cause' of the failure is the rule itself or the process
by which it is selected. If the rule is promulgated or selected by someone
other than the person who performs the task - in other words, if the person
doing the task is only following orders - then the mistake is really the effect
of another failure.
The RCM process helps to reduce the possibility ofmisapplying good
rules in two ways:
especially what could happen
the thorough analysis of failure
if a hidden function is in a failed state when it is needed, means that people
are less likely to j ump to inappropriate conclusions when the situation
does arise (especially if they have been involved in the RCM process)
"',..,, ,.,,.,,. ..... attention on the functions and maintenance of protective
the RCM process greatly reduces the probability that these
devices ,Nill be in a failed state in the first place.


34 1

The chances of bad habits .;,
' "<T
6 are also reduced if care is taken
rise to C'l"'\,,,r,nn"'
the FM EA to r>C>r, n ru failure modes which
hc r,,,, .......... rh, to reduce them to
consequences of a false
c ases where the ......a.r.n,:,"'"''" an d/or the
alarm warrant it, the most
entails ,,c;, riloe , r,n
bad rules because the
to reduce
whole RCM process is all about defining the most
'rules' for
maintaining any asset
care must be take to ensure that the rules of RCM itself are
not applied badly. This is u.....,, ,_.,Ju,.,
application of RCM

mistakes occur when someone is confronted with a
situation which has not occurred before and which has not been '' "'
In situations like
ted (in other
one for which there are no
to make a decision about an ,:,ru"'lrnru"i ',t,:,
and a mistake occurs if this decision wrong.
r..-,.,.....,,,,,a the author has found that a common ,,,,.,.,'"'''"'"'' which occurs
in this context is a belief on
managers and
if a crisis occurs late at
"I know, therefore my company knows". In
Vlr'>A\l< ! I P,i ,'tt."
when all the
less if it is not in the m ind o f the person who has to take the first
deal with the cri sis.
that fir st and most obvious way to avoid Krnowte{H.!e:-o:ase:a
of the
mistakes is to improve the
who have to make the
and maintainers.
decisions. In most c ases. these are the
tors and maintainers are
to make appropriate decisions more often
if they clearly understand how the
vvhat can go
wrong (functional failures and failure
of each
fail ure
effects). As mentioned several times in
2, 13 and
1- ,.,,


YUIWH"' 1',fV- ,fHl.'\' VfJ

" 1' ""' ''


t.V'-l 'U.t,HU.,

so that there is less to know


Reliabilitycentered Maintenance

minimise novelty, because new and alien technologies put people at the
curve, where mistakes are most likely to happen
bottom of the
avoi d
coupling. This means designing systems in such a way that
if failures do occur, consequences develop slowly
to give people
time to think and hence more opportunity to make the right decisions.

A violation occurs when someone knowingly and deliberately commits
an error. Violations fall into three categories:
routine violations. For instance, when people make a habit of not wearing items of protective
(such as hard hats) despite rules which
clearly state that they should

violations. For instance, if someone who usually wears a

hard hat knowi ngly mshes outside without the hat on "because they
couldn' t find it and didn' t have time to look for it"

This occurs when someone maliciously causes a failure.

for routine and exceptional violations usually consists of appropriate enforcement of the rules by management. However, once again,
involvement in the RCM process
people a clearer understanding of
the need for safety procedures and the risks they are running if they violate
them. The
is beyond the scope of this book.

The most important conclusions to emerge from this appendix are that:
not all human errors are necessarily the fault of the person who made
the eITor. In many cases, the eITor is either forced by external circum
stances or by inappropriate mles. So if blame is to be allocated for any
error, c are must be taken to identify the real source
human error is at least as common a reason why equipment fails to do
what its users want it to do as deterioration, if not more so. As a result,
it should be dealt with as part of the RCM process, either as a failure
mode when it is a root cause, or as afailure
when it consists of
inappropriate responses to other failures
in the industrial context, it is only possible to come to grips with human
errors if the people involved in committing the errors are involved
directly in
them, and developing appropriate solutions.

Appendix 3:
A Contin uum of Risk
that it
be possible to
a schedule of
risks and economic risks in one
acc;ep1tat)te risks which combines
continuum. It .,, ..... ,.,,..,...,"' ..,., ..... that this
be made possible
5 .2 and 5. 1 4 in some way.
5. 1 4, repeatt:ct
A3 . 1 , showed what an
sation might decide that it can
for one event that has
economic consequences o nly.


Figure A.3. .1:

Acceptability of economic risk

Figure 5.2
one i ndividual might be
prepared to tolerate in a
,,,..,.,,,,,,. . .-,;, situation from
any event which could
prove fatal in that situation,
A3. l .
as summarised i n
Figure A3.2: Acceptability of fatal risk

In fact, these two charts cannot be combined as they

event while
A3. 1 is based on the probability of a
picts what one i ndividual might consider to be tolerable for any event.
went on to show
to the latter,
However, with
that it is possible to use what one individual tolerates from
event i n
to each
situation as a basis for
event which could
him or her at risk in that
The first step i s to convert what one person tolerates to an
site. In other words, if I tolerate a probability of 1 in 1 00 000
work in any one year and I have 1 000 co-workers who all
then we all
every 1 00 years and that person may be me, and it may

was to translate the probability which m ysel f and my cotolerate that any one of us
workers are
be killed by any
for each
event at work into a tolerable
failure) which could kill someone.
previous example, the probability that any
one of my 1 000 co-workers will be killed in any one year is 1 in 1 00 (assuming
Furthermore, if the
the same
site faces
that everyone on
activities carried
the site
(say) 10 000 events which could kill some
that each
one, then
event could kill one person must be reduced
to 1 06. This means that the
of an
L U <U H SH'

reduced to 1 07 , while the orcioaomtv

event that has a 1 in 1 O chance of
person m ust be reduced to 1 o-s. On a site
and where
that is divided into several
each area is further divided into several sec
risk could be carried out in stages, as shown

Figure A3.3:

From whole site to one event

shown, an 'event' is either:

In the
defined in the FMEA) which on its own has
lethal consequences. The probability allocated to this type of event de
fines the 'tolerable level ' which is referred to w hen the RCM process
"Does this task reduce the
of the failure
asks the
to an acceptable level?". See page 102.
fails and the protective
jc1ilure where a protected
device which should have rendered the system non-lethal is itself in a
allocated to this type of event defines the
failed state. The
acc:ep1:ablle level' which is referred to when the RCM process asks
failure to a toler
"Does this task reduce the probability of the
able level?" See page 122. It is also the probability used to establish
failure finding intervals. See page 1 79.
it is
that an approach similar to a fault tree anaAndrew and
would be used to allocate probabilities
H r,n, .."""' r in this
we work downwards from
on the
probability of a fatal accident
for each
task and to determine failureto determine a top-event pro
...,,,,., u..... m aintenance program.
babiJity based on an "'v1dtt'Hl
,,,,,,Ti n l i"V c,,i cn:"n"'

A detailed examination of fault trees

the scope of this book.
The purpose of this
to "'"!C,......,,,.
to convert risks which individual members of
be nr,,n',\r;.:,,n
manifestation of 'desired ""' d,vr-n'\ ., n,,.,.
to tolerate
information which can be used to establish
to deliver that
The process described above can be used to
at work which would flow from the
the probabilities
risks which one individual is prepared to
by everyone
on the
his. or her
illustrated in
A3.4. Note that in the next four
reo,res:ents the probability of any one event
in any one yeaT,
the annual failure
be ..,..,,,,,u-

J;-''-'J1V1u1{{1\ev') into 1nean-

rl,>.('t lYt'\,:,.f'i

'> u. uvv.

Ukely to

Figure AJA:

Acceptability of one
lethal event where
I have some control
and some choice

Ukely to km
to kill
Likely to kill

The same process could be ......,,--.

victims have no
be a typical "''"'"
'""..,"" of someone in this situation. From the maintenance
v iewpoint, such
are likely to be users of mass rrainsp,on i;;;,1,:rr"tl"\Q
or people
these people could b e called 'customers' .
and so on). I n
In this case, if they all tolerate the same risk as the i ndividual in
A3.2 (and there are the same number of potentially me-tnreatenmg
inherent i n the system), the process of
A3.3 could lead to the
1 in 10

Figure A3.5:

Ac,;eJJ1ta1J1iltv of one
lethal event where
I have no control
and some choice



to the no control/no choice scenario might

shown in Figure A3.6. (In practice,
to accept an even lower probability of being
most individuals are
the so-called dread'
killed for this reason than is shown in
factor. However, in most
fewer events would be likely to have
off-site consequences, so the
for each event might end up
about the
vv..-....,.u ........... ...,

10 chance of killing 1

Figure A3. 6:
lethal event where
I have no control
and no choice

1 07

Ukely to ki!l 10 oeo1rne (Jn-s1te +-+---+-_....

to kill 100 people off-site T-t--t--
Likely to kill

people off-site -t--t----+--

c have been determined for single events as

vv,.. .., ilit,,.,
,'"" '''-'''
Once acceptable prnlv1h
it is of course
A3 .5 and
shown in
'continuum ofrisk' as shown
combine them into a
A3. 7:

A "continuum of risk "

1 Q4 10-s 10-s 1 0-1 1 041 1 0-9

that these
Please note once
are not meant to be l"'\l'J'crn'l,t1;,:,.
and do not
reflect the
of the author or any other organisation or individual as to what should or should not be ac<:e1J1tat)te.

A Continuum


A3.7 is also not i ntended to imply that l

is worth l 0
a point at which two different value
million. That
happen to coincide. The financial risks which
aJ[1(f lC:US,tOf11eJfS
whi c h
UJ"l ! T i'>fl'll"\l llHSi:> <c

point is that the criterion upon which

not what
is based is what is
current i ndustry norm
these may v v , . . ,.,,,...,..._,
that the people who are both
in the
best posit ion to decide what is tolerable are the
the shareholders and their
in the case of
customers and the managers who have to
in the
of ..,,""'"''"-"
clear up afterwards (and bear the
this appendix shows one way i n which it may
risks. As mentioned
be possible to tum i nformed consensus about tolerable risk into frame
work for setting
for maintenance programs
to deliver it.
please bear in min d that the
is not intended to be
If you have access to a different fran1ethen
all means use it
work which satisfies an the

Appendix 4:

Condition Monitoring Techniques

1 Introduction
that most failures
of the
is c alled a potentialfailure,
fact that they are about to occur. This
and is defined as an
physical condition which indicates that
is either about to occur or is in
is defined as the inability ofan item
standard. Techniques to detect potential
to meet a
failures are known as oncondition maintenance
because items are
,rH:>nP,r'fPri and left in service on the condition that
meet ..,..,,..,...,,. .... ,._,
is determined
of these
the P-F
which is the interval between the emergence of the
into a functional failure.
potential failure and its
Basic on-condition maintenance
,,, rt lr t nrt in the form of the human senses
-u ""..,, 7, the main technical
of using
wide range of potential failure
conditions using these four
However, the disadvantages are that
humans is relatively , m nrc:f'H'P and the associated P-F in
very short
IJ\J'-""uu ..... failure c an be detected, the longer the P-F
intervals mean that inspections need to be done less
often and/or that there more time to take whatever action is needed to
avoid the c onsequences of the fail ure. This is
so much effort is
to define potential failure conditions and
techpossible P-F intervals.
de1tectmg them w hich
shows that a
P-F interval means
that the potential failure must be detected at a point which is higher up the
move up this curve, the smaller the deviation
P-F curve. But the
l! ,nrtd,hn,n
if the final stages of deterioration
the more sensitive must be the
not linear. The sm ller
monitoring r,,,.,,. t"l .. .,. H.," ui:: sume:u to detect the potential failure.
0"' ,,,.., """ "'"'

">Ati'A lr>'"l 'l nrA

""- '-' H U n.,,,...,.,,



Figure A4. l:


P-F intervals



2 Categories of Condition l\tlonitoring Techniques

range of the h uman
senses and can
be detected
instruments. ln other
""'"'''"""'"""' '"'....'" is used to monitor the condition of other ..,y,u1 L,,uu.,1

monitor, as follows:
Dynamic monitoring detects
fa ilures
equipment) which cause abnormal
cially those associated with
to be emitted in the form of waves such
and acoustic effects.
detects '"""'"'" "'t,,,i fai lures which
cause discrete particles of different sizes and
the environment in which the item or component

Chemical monitoring detects potential failures which

into the
cause traceable quantities of chemical elements be



in the
fai lure effects encompass
which can be detecappearance or structure of the
detect pote ntial
and the associated monitorin g
fail ures in the form
the visible effects of wear and
dimensional ,.,, .. ,... ,,;;,-..,,,.
monitoring techniques l ook for po tenzperature
tential failures which cause a rise in the temperature of the equipment
to a rise in the temperature of the material bei ng
the equ ipment).
" ""'"'"''" ,,,.F, techniques l ook for
Electrical rnnn1tnrincr
v,,h'""'""'' conductivity' dielectric s trength and potentiaL


and more are

of techniques has been
An enonnous
all the time, so it is not possi bl e to produce an exhaustive list
of aII the t;cehnin11P<'
"'"'"' '" ''-1 ',.," available at any time. Th i s appendix provides a very
brief summary of 96 of the techniques currently available. Some of these
are welJ -known and weH-estab lished, whil e others are in their
even still u nder development.
is technicaUy feasible
whetheror not any
in any context shoul d be assessed with the same rigour
and worth
as any other on-condition task. To help with this process, this appendix
for each technique:
lists the
the potenti al failure conditions which the techn ique is meant to detect
it ts
the p.. p i n tervals typical l y associated with the l\.A., u u 1 1.1 1..,,.._
for obvious reasons, this can only be a very rough 'ballpark' guide
how it works
and/or level of skill needed to app l y the technique (skill)
of the technique
of the technique
techniques, it is worth
that a
deal of attention is
foc used on condit ion monitorin g nowas
it is often
Bec ause of its novelty
c01no1eHe1v separate from other aspects of schedu led maintenance. However, we should not l ose
of the fact that condition monitoring i s only
one form
maintenance. When it is used, it should
into norma l schedules and schedul e p lanning
and not m ade
., ,, J ,,., , , ,,


3 Dynamic
A Prelimimuy Note on Vibration

each individual component in order to
However, the situation is COl11i'.)l1catctt
acceleration. So step
rneasured and what .-n ,:., ,:-, c nr, t1 n
is to decide which techn1qt1e

now called
systems in vibration
The role
systems can now find and >.H<<,::,1 1u,:,,__ t'il',-,l"'\ ! f'.> l'rH:
vibration , 1 1 1 .d t v , , ,
re2to111gs with al] !:he data from. """"'
of this part of this

rnn-h- t H'Pl "".-lfC' t\ f <'

al vibration


3. 1 Broad Band Vibration

,,.u'v"''"''''' belt drives, co:mp,re;sors

motors, pumps.


of failure
of two parts: a
anical vibrations i nto


the root m ean square

value and
sinusoidal wave rather than a ''"'''
P''"'"' wave. Such meters have a
r1"-1.cnn n,;:p over the range
Hz to 1 000 Hz

Skill: To

the eqmJ:lmi=nt and record

vibration: a semi-skilled worker

the nature
much lower and contribute very
,'1 f:, ,H,u,,.,. When these
do grow the
of deterioration. Difficult to set
'-2 Octave Hand Analysis

for broad band vibration

and fractional octave filters divide the frewhich have constant width

on a recorder
Skill: To

trained technician
to use w he n the measurement parameters have been
Good detection
en9:ir,eer: Portable:


3.3 Constant Bandwidth


ranges can be crnu1f:ca

Skill: To

trained ski lled

,,c,ru , s ,,,c,rf

to , n t,,'>r'l",,rnt

3.4 Constant Percentage .Bandwidth Analysis

Conditions monitored:

P-F interval:

and vibration


'""" '" ''"'"""'" a suitably trained skilled worker, to interpret the
results an ,;,. v ,,..ttttc>t-.,,,,,,1 technician
can be done in 'real time' and is therefore faster than FFT
and does not suffer from certai n pitfalls caused by the batch nature of
CPB s pectra are very
for rapi d
FFr, such as loss of data
fault detection. Equipment

3.5 Real Time Analysis

Conditions rnonitored: Acoustic and v ibrational

measurement and

of shock and transient

P-F interval: Several weeks to months
"'"'"" tape and played back through a realis recorded on ""'!',
and transformed into the treqm;nc:y domain.
A constant bandwidth s pectrum is produced, measured at 400
range selectable from 0- l O Hz to 0-20
intervals across a
resolution mode can be selected, and the scan can also be adjusted
in the baseband spectrum
a 'slow motion '
to be observed the time w indow is stepped

Skill: To operate eo1Ltn::>m(:mt and interpret results: an exr)e1ie11ct:a e111g1net!r

bands over the entire

range simulall
display of analysed spectra is continuously
u p1C1atea: No need to wait for level recorder readout: Sui ted to analysis of short
such as transient vibration and shock: X-Y recorders provide a
permanent record,a J!,es: Equ ipment not ..,v,, u.,,"''- and very
of skill: Off-line


3.6 Time Waveform Analysis

Conditions monitored:

cracked, broken gear teeth; pump cavitation,

nusa111g1mrten1t, mechanical looseness,

ADVllca,rwi11s: Gearboxes, pumps, roller oe .anngs, etc

P-F interval:

several weeks to months

"''""'""''"''-''"'"' is connected to a standard vibration analyser or a

to the
vertical i n
. A vibration
real time
put. The vertical axis on the CRT is scaled in amplitude and the horizontal axis


To reduce the
of the waveform.
pass. low pass and band

3.7 Time Synchronous

Conditions monitored: Wear,
to-metal 1mDa<:m1g, rmcn)wet(Imu.. etc

Gearbox gear teeth, roller

a paper machine,


to mon ths

bank of

rolls on

skilled worker.
,t the resu lts

t"\uf;.:,.t'\f ,:,,n

L ''-'-' ''-< LU A v ,.,

consi derable

nt ,;:u-r..

'.l'l''l<> I U ,.' C> r"t

the individual gears can be uuiu 1 "''"'u In

""'""1 r, rn," YlT that has many components
Ns,:w1 1ar1tafle:1: Care n1ust he taken with machines w ith roller element be:u-nurs
with the RPM and will be
tones are not
3.8 Frequency A nalysis
Conditions monitored: ", .......__.,, in vibration c haracteristics

wear, imbalance,

mechanical looseness, turbulence. etc

"''''"' <.,vr,1.,c,, bel t drives, compressors, """- "'" "' roller bear
vv,,u , u.-,,,,, electric motors, pumps, turbines, etc

P-F interval: Several weeks to mon ths

collected from measurement

in the time domain and
transformed into the
a Fast Fourier Transform
either the data collector itself or a host computer. The .."',.'"""'r1
on the
of the machine.
ft'"'"''' "'"'{''" range of the measurements is
Each machine which has
a spectrum
parts will
A baseline s pect rum of the machine in excel1ent condi ti on
actual spectrum of the same machine running at the same
rnc:reas1:s o ver the baseline of more than one standard deviation at any
t,"'" "'"'"'''" can indicate
One feature
,.. , ,..,. .., ,,,... ,.. , ... Waterfalls are
the "wat erfal l " of FFT c1swnn\.,,S:,
taken at the same

over a time interval


to be trended

vn,,,,.,,,,.n,un skilled worker. Requires considerable

, nr,c0,rri,l{:>t the results
software systcrns makes data interpretation easy.
in machine cond i tion can be detected at an
V V U V V H U !'-,

O,"l'l l t r\Yn ,c> n

v l H,UlF;,'-'c>

s mall

and random noise can look similar.

3.9 Cepstrum
Conditions monitored:

h armonics and sidebands in vibration spectra

shafts, gears, gear me:snmj, belt rota

of pumps and fans

tion, vane and blade pass rre(1uencu::s
P-F interrnl: Several

to months


of the transform,
Skill: .. ,,,,m mwe1sta1:wu1g

and the expert software

3.10 AililJlJ.llutoe Demodulation



Skill: A


to months



3.1 1 Peak Value (PeakVue)

Conditions monitored: Stress waves caused by metal-to-metal
and abrasive wear
Anti-friction oe:1ru1gs

shafts and

or metal

P-F interval: Severa] weeks to months depending on the application

:se:omat,s low energy faul t s such as those that occur in anti-friction

the fau lts to stand above
,n< and gears, and enhances their
noise floor. This makes them easier to
PeakVue first
waves from the vibrati on waveform
separates the
a high pass filter.
It is then conditioned to enhance its amp l i tude and
width, making it FFf
The conditioned waveform is then p1v"-v,h>vu
an FFT to determine the r..-;,,tll,,,--.n.,,u at which the stress wave occurs
h.c>:i.-1 1



Reveal s some faults that may have gone undetected i n thei r earlier
stages or which are buried in the noise floor
of the v ibration spectrum.
More consistent than demodulation. Outputs are independent of machine
and i nstrument Frnax
Applicabl e to a broad range
in excess of I kHz
very slow

3. 12 Spike Ene rgy1M

pumps, cavitation, flow
metal to metal contact, surface flaws
pressure steam or air flow, control valves noise, poor


ltvrvt,,catwns. Seal-less pumps used in the chemical and petrochemical i ndustry,

rol ler element
P-F interval: Several weeks to months
of components and strucSome faults excite the natural
tures. The i ntense energy
by repetitive transient mechanical impacts
to appear as
of high frequency energy in a spectrum
causes a
which can be measured an accelerometer. A
frequency band pass fil ter
is used to fil ter out low
v ibration
detector that detects and holds the ne,.1K-uHne,,1K
r --- and measurement results are
amplitudes and high repetition rates
to a
s ignal can be
In the
spectrum, the
trt>nn,::> nru and its harmonics.


3.13 Proxirnity

P-F interval:


Skill: A


Ill.<W<,(vantcir.1?e,.1t: P-F i nterval short:

tin1e: LJl!l!!IlOSllC

3.14 Shock Pulse Monitoring

Conditions ,nonitored: Surface deter.i oration and lack of
mechanical shock waves. With data rr,,,,..,., ,nn1




tion deteriorates from

11K:re,1se up to l 000 tirnes.
5'ki/l: A trained and

technici an

to operate. Portable. Can be used on

condition and lubrication status
vibrauuu .. , .,-.,,, are not ,':>611;_-vUHl; J influenced
condition or lubrication which
tion and noise. Identifies subtle
not be differentiated conventional v ibration
" < >, n ' 1-U'>H ! U

Disw:twmta ees: Needs accurate

size and
i nformation
measurement Limited to rol ler element nP 'lrt1, oc


3. 15 Ultrasonic Analysis
Conditions m.o nitored:


in sound patterns (sonic

or deterioration

c aused by

Ar;imir callOflS,' Leaks in pressure and vacuum systems (ie. boilers, heat
condensers, chillers, distillation columns, vacuum furnaces,
steam traps: valve and valve seat wear: pump
of seals and
pipe or tank leaks
,.nn uu.u;::.-.,,

fi , < n h '>rfYL> '

P-F interval: Highly variable ne1pe1101ng on the nature of the fault

frequency sound
between 20 kHz to
i 00 kHz.
sound w aves are
short and tend to be fairly
directional, so it is easy to isolate these
from background noises and
their exact location. i\l l ,,,,,,,r,,,tt r,cr e,c1rnmrne11t and most
a broad range sound. As subtle 1..,11,,u n'-'
to occur with deteri
oration, the nature of the airborne u ltrasound allows these
to be
lJltrasonic translators convert the ultrasound sensed by the instrument into the audible range where users can hear and rec:ogm;e them
filters out surrounding noise
,..,, ,,,,..,,J"" '"'' The ullrasonic
m ay be
on ne:aa1Dw:>r1t::s or as
on an electronic monitor or computer
UL'O U l <L V ,,,u

Skit!: A

trained skilled worker


36 1

3.16 Kurtosis
Conditions monitored: Shock
P-F interval:

Skill: A

anti-friction l"\,' in,HH'


to months

semi -skilled worker

3.17 Acoustic Emission

Conditions monitored: Plastic defonnation and
and wear





rials. Irre levant e lecttical and mechanical noise can i nterfere with measurements.
Gives l imited information on the type of flaw. Interpretation may be difficult

4 Particle .M onitoring
4.1 Ferrography
Conditions ,nonitored: Wear,


compressors and hydraulic systems

P-F interval:

several months
is dil uted w ith a fixer solvent (tetrachloro
slide under the influence of a
of the
the,.h.- ,,.n ..,,

examination. This uses

known as bichromatic
both reflected and transmitted
sources (which m ay be used simultaneously).
filters are also used to
the size, composition,
Green, red and
and texture of both metallic and non-metallic particles. An electron microscope can also be used to determine partic le
and provide an indication of
the cause of failure
worker. To

a suitably trained semi-skil led

and operate
and interpret the results : an experienced technician

Ac1'va.nttiwes. More sensitive than emission spectrometry

s tages of wear:
and sizes: Provides a permanent pictorial record
Time consuming, and needs some very
the TC>t''rn,,,... ,., ,..
eq1.npme:nt: Measures
an electron m1i::::roscfme for an in-depth

4.2 Analytical Ferrography

Conditions nwnitored: Wear,
Oils used i n diesel and _gmmtme
compressors and hydraulic systems

'-' 1 1;.;u,.... ,,,,

gas turbines,


condition of the component

worker. To

4.3 Direct .l'-\:':i:IUIIU

Conditions 1nonitored: Machine

P-F interval:



trained semi-skilled worker



4.4 Mesh Obscuration Particle Counter (Pressure Differential)

oil systems caused
corrosion and contaminants

P-F interval:

several weeks to months.

Thi s instrument measures the differential pressure across three

.,,.,,"'' c,n,n 5 ,
micron screens, each with a know n number
As the
than the pores are trapped on the
which reduces the open area of the screen and increases the pressure
which is converted
across the screen. Sensors measure the pressure
to reflect the number
than the screen s ize. This is converted in
tum i nto ISO 4406 clean l i ness codes.
Skill: To operate the
, ,,'1', ",.",,.,., r t he nsu lts: a
and can be used
,(1avw,ztll!f!es: No
can be used for
An in-line version of the
in the field or the
real time conti nuous
Particle counts are calibrated to an ISO 4406
cleanliness s tandard. Most oils can be
in matter of minutes. Not affected
bubbles, emulsions or dark oils that l imit laser-based ""''" '",.,:,,,"
>ts1'U/1v'atztaee:\: Provides no indication of the chemical composition of vart1clles.
'" n'lr\11Pf'".l t ,P I U
\, U \., U Jl<.l U U :C:. oil systems . '-' "'-""'
4.5 Pore-blockage (Flow Decay) Technique
Conditions monitored: Particles in 1 u1-,,.,r..1 t1 n o- and hydraul ic oi l caused by wear,
corrosion and contaminants
compressors and
P-F lnttirval:

several weeks to months

between 30 and 1 50
( can go as h igh
nn:c1sicm calibration screen
in a sensor
the flow. Sma ll er rv.l l: 'H { '' I P C'


time curve. The hand held computer

to convert the n ,,, n r

'" "" ' '-'" "

an ISO cleanliness code.


Extinction Particle Counter


in ..........-.. ,,.,..... n-

corrosion and contami nants

compressors and
P-F interval:

of ... . -, ,-r , , - l < > -.:

d n n ! <HJP

Skill: To

of the contaminants.

the count From this information



LO!fl(llll(:ms nzonitored: Particles in l n t-,ru,:,t, ri n and

corrosion and contaminants

oil caused


oil systems
lo months


count From this information a direct

determined au1:on1at1cat1y

trained skilled worker

where conditions are controlled.
small as 2 microns. Faster than the visua1
the test

n umber of translucent
also vary
nmr-tu l ,;,. c in the
beam. Provides no information on the chem ical comJ)0;1t1on of contami
avoid coinci
dence t:ITOr where
appear as one
4.8 Real Time Ferromagnetic Sensor

compressors and
P-F interval: Weeks to months

rerTonuctgriette sensor uses an lnr1nrtt \, ,::,. Of lll<Jlf-lll\,JU,\.,

o f fe rrous n'l,-t<r'
the sensor. The sensor
rv.ntlf''1"''' ollect arou
n:ttr1,,,.:, with an eh::ctron1H!I1Ct.
in an oscillator
calibrated to indicate the mass of ferrous """'''tr 1--"'"' collected. After a measurement
been taken, the

n n ,HH I T>I

On-line tecltmuwe

I A <'


4.9 All-Metal Debris Sensors

Ferrous and non-feITOllS nll,'r>rl,,:,c
,,,,,...,a,,,,,,,,,,,.,,, for the n,-,..,t,:,J,t,,''"


p..F interval: vVeeks to months

the stinmlus coils, When a ferrous

the first field and


The sensor will detect and measure most of the

,. ... ,,tnrcw are
to alert/advise operators, or to
f'nl'lnr, ,,,,.., bo th ferrous and nnn
of a false indication. On, board
of various
and store the time domain
identification wear sources in near real time.


4. 10 Graded F'iltration
Conditions monitored: Particles
corrosion and contam i nants
turbines, transirrn:sP-F interval:

weeks to months

A small amoun t


materiat or dirt

4.1 1 l\'fagnetk Chip Detection

trans mis-

of the capi ndicates imminent failure. The debris
colour, and texture) aepe1au1111g

v (, CU H H ,<Hl"JJl


4.1 2 Blot Testing

monitored: Wear metals,

compressors and
to few

and sometimes corrosion oa1m,:1e:s,

turbines, transmis-


4.13 Patch Test

Conditions monitored: Wear metals,

P- F


Skill: A



D- 1698)

and ,.,... ,,.,.,., """''"


in transfonners, breakers, and cables


to separate the sediment from the oil.

measure soluble
is decanted and
., ' """"' pentane insolubles and filtration
Uh,,tV>.tJ::c\.,,1.,L and filtered
to obtain total
and the re mainder

an electrician. To conduct the

be taken offline

Transformer does not have


4. 15 UDAR


be conducted i n

Detection A nd l<an2mt?J

smoke from smokestacks



5 Chemical Monitoring
A Preliminary Note on the Chemical Detection of Contaminants in ,Fluids
to detect elements i n
POl[enrnu faiiure has '"""''"' r,d

of the fluid

listed below, and

Wear metals: the

Magnesium from turbine ac(:essorv \. ,c.., '" "''

- l\!Ianganese from valves and blowers

Molybdenum from
Nickel from valves, turbine

Tin from
turbine aircraft

Zinc from brass components, ne1Jrn1 e111e


Aluminium from
Boron from coolant leaks in oil
Calcium when found in fueL
Copper fron1 oil cooler

lVlagnesium from seawater con tami nation

Phosphorus from a coolant leak in oil

Potassium from contamination

- Silicon from contamination
Sodium from anti-corrosion
result of a coolant leak.

CmTOsion: the
Aluminium from

u,11 1 r. 1u 1 n n ""'""''""'Yt'


in oi l




tin, copper, nickel and

'"''""'"'" "' uw,.t;:u1..<,1uu1, or bariu rn: contaminants such

silicon: corrosion


turbines, transmis-

compressors and

to months
!ll'fC> f V I P t1,P

of a
fla me ()f other

into its constituent atoms. The



converted into

the results: an

metal concentrations in used oil

at low
not suffer from

foci litics for tie1ternrnnrnig wear

1mc1su11n and repeata-


Fluorescence Spectroscopy

Conditions nwnitored: Wear metals such iron, aluminium, chromium, lead,

t i n, copper,
and si lver: oil additives vv,nu,,uuu;:;.,
cal -cium, rm1tgnes1tur.1r1. or barium: contaminants such

turbines, transmisP- F


source which raises

This causes the contaminants to emit
that the radiation measured is the

of the c hemical elements in the

rnulti -channel data


4: Condition

5.6 Energy Dispersive

oil ,,,ui ,t, ,u,,, C"Olrll<IUHI1;!
or bariurn: contaminants such

P-F interval:


tion of the elements can be obtained.

Skill: To draw

line scans

Conditions monitored: The



Se veral m onths


'"""''", ,L,,,

to two
1 8 1 6) apart, until breakd()\Vn
and trended. Five breakdowns
with one

an electrician.




Transformer does

to be taken offCt.>rll<' H H /P to

""'''--"''""" PCBs. Uses

5.8 lnterfacial Tension (ASTM ll-971)

in transformers, breakers and cables.

P-F interval: Months
""'"" " " "'' "' '""

the force needed to

"n" wire from the interface between a
of oil
the device
is i mmersed in the water to a
on the
of IO mm. The oil-water intetiace is
r,n.,.hir..c.,, The inter-
lowered until
tension is then calculated. High and medi um
u;;.,,.,.,..,, i .. ,.,, ., should
i n-service oil and 40 (l'\f lnt,,
1r1,n for new oil
"'1 "' ' " 1

\'f'I IT,,t'f/C>C n)TH, 1't)ll: 'rlr.>rc

electrician. To conduct the test,


,,,,,_,.,.""''""'""'' soluble in water. Test takes about



Conditions monitored:


- Arsenic from anti-corrosion

- Barium from oeirenientt cl1st:>er:sar1t


and/or m:;oc:rs,mt
Calcium from
- Chtomium from an anti-oxidant i n
Cobalt from

from natural trace

Iron from natural

Lead from arnU-\Ne::Jr
anti-knock agen t

in some lubricants

- Nickel



oil s

in crude oils






levels 1n crude oils

Potassium from natural
levels in
Selenium from natural
crude oils and coal.
oil s
Silicon from ant;Lt,n n H n t' agent i n

levels in crude oils and seawater

levels in crude oil and so1ne fuels. Used as an antiinr,n,,ir ,nrr oils.
levels in some crude
crude oils. Found as anti-wear additive in
anti-oxidam i n mari ne> lubricants .
5.10 Fourier Transform Infrared CFT*IR)

Condilions monitured: Deterioration, oxidation,

'"' '-'""''". ,,,, 1,1 1 h"" in mineral oils and
oi]s from cornbustion


to months

P-F interval:

spectroscopy, FT-IR measures absorbent

u,,v,:,, 1,c, ,n o:il'l1<;: to determine the
of the elements i n
converted into a uniform
a Michelson interfero
where it is a ltered
e tt::m,srn:s of the oil and contaminants.
detector where i t
q h,. nrnt , n n

trained semi-skilled worker. To operate the

i ,,n, ...., tr.. ," technician. To
test results:
levels do not alter
unlike AA. Data can be

ref)earatnllty Total acid num

""'"'''"',..... from FT-IR data
fla mm able solvent for -,,,.. . ,.,,...
...,1,..q,.,, ,.,,.,.H use different data extraction
lo l

for oil condition


u n ., a v n n,,

P-F interval:




P!TH ' I .C'!");V

Fourier transfonn i nfrared spectrometer to record
eluted from the column. Different detectors


i nsolubles).



tra i ned and

have broad features that are

5.14 Thin-layer Activation

electrical con-

interval: Months

Skill: To

5.15 Sc:arnmr1g Electron

Conditions monitored:


L,r,,,.,.,,.,.,,,,d i,trl':CIJt><

for the nn,,ntH''t' of

thin films and , n r,.-t ,, ,,;, c,

conductors, finished semiconductors. meta.I and






and creates a
ion then
atoni. The
fills the core hole and a third
,nn c,::>r\./P energy This electron has

"''"U"'"I'-, atom, which allows dements to be

Skill: Skilled

determ ine root causes of fail5. 17 Electr(M'hemical Corrosion Monitoring




trained technician
msoe<;ttcm unless this
to do so

the extent or

location of corrosion:

5.18 Exhaust Emission A nalysers (four-gas Analysis)

Internal combustion

P-F interval: Weeks to months

Skill: Trained and


5. 19 Colour lndkator Titration

monitored: Lubricant deterioration
m oil


P-F inlerw;tl: Weeks to months

dissol ved into a mixture


chemicals used in

5.20 Potentimnetrk Titration

Condirions numitored: Lubricant deterioration
of an oil





for oetro1ewm based


5.21 Potentiometric Titration TBN (ASTM D2896)

turbines, transmis-

P-F lnte rval: Weeks to

to neutralise corrosive acids formed

continued use.
re!.nuc11tss of colour of oil. Accurate
used for



5.22 Power )?actor (ASTM D

Dielectric losses in electrical msu12ttm1g oils caused



A Preliminary Note on Moisture l\1ioniitori11tl!

Water in oil
reduces rn:wn,1n1:rv
reduce roller element
P l'1 Al l <' I \J wth
the l 1 1 !-,rtf"\ 1 1 n , nr1,n,rt,1,.>,'

it cnames

it dnrrr>l'lt::<'

"''"'i'><'1l"\/ 11nn1rn1,;,r,;:

other aspects of
Water also
rusts and corrodes metal "'"'''""'''"'"

fol lows : :

n f' t;,,, ,N,H'.

- it gums up valves and
it shortens the life of filters
it entrains more air, which

Karl :Fischer Titration Test

Conditions monirored: Water in oil

bulk modulus.

interval: Days to weeks
1n,:,rntu1,11 A measur ed
is reacted with Karl Fischer reagent which
contains iodine. When iodine i s
current will pass between
with the
e lectrodes. Moisture entrained i n
which has not reacted with the iodine rem ai ns. Once
the iodine and the test is
used to determine the titration
and ca1culate the water concentration. The duration of the test indicates the water
transformers should not exceed
at 20"C

per million).

are difficult to obtain :

for sensi tive
diluted: Considerable skill needed

5.24 Moisture Monitor (Vapour Induced Scintillation)

onauw,1s monirored: Water in oil
turbines. transmis
and transformers .



P-F interval: Several

""'"'/"'"'' C't"' and e m i t a d istinctive

n11crcmr1011, e m ounted near the
data collector for

5'k ill: A trained semi-skilled worker

Detects a wide
70 millilitres of fluid to conduct the tesL
fluid' s

5.25 C rackle Test (Human
Water in oi l

P-F interval:


trained se1ni-sk i l led worker

of water present.

oil around hot surface,

5.26 Crackle Test


Conditions monitored: Water in oil


P-F interval: Weeks


5.27 Clear and


Conditions monitored: Water in oi l

low as 25

1 0 0()0


a more

and economical .

OH colour can

6.1 Liquid

discontinuities or cracks due to
corrosion stress
heat-- treatmcnt. corrosion
machined sur
structures, compressor receivers

P-F interval:

a,cc1Jrctm2. to
fluorescent or dual sento remove them from the test
t)enetrant;) and the
post e mulsified or solvent

' ' " ''"""" '""'"' '1

trained semi-skilled worker. 1n1:en:m:ital:101r1:

4: Condition

CondiI ions
P-F interval:

Conditions monitored: Surface and n,,,..,.,.._,.,,.,

wear. laminations, mc:1wm111s,




should be carried out under ultrav iolet

Conditions monitored: S urface discontinuities and
heat tre,atn1e11tL '"''"''''''"'""' cn1hriltlen1er1t.




co1ntam1ng fine iron oxide parmspec:ncm and a

trained semi-skiiled worker. Evaluation :

Can be used on

with limited visual

Provides a record.

cracks: Not an on-line ,,,,,h,.,,,..,,,,:,

6.5 Ultrasonics Pulse Echo Technique
Surface and subsurface discontinuities caused
>1"\ Ptrtlf,n and
heal treatment, inclusions, lack of n,:
\11 Welds,
t' ,., .. .. ..
Jami.rmtion The thickness of materials
to wear and corrosion.
L UllU llUHl.\ u', n , J'r, ,-.,,/f


un:iflcattons: Ferrous and non-ferrous materials related to

tures, boilers, boi ler tubes,
P-F interval: Several weeks

several months.

""'""''""- A transmitter sends an u ltrasonic

to the test surface. recei ver
n,,,-t-,,,,t feeds the return
to an ,,.,,,
echo a combination of
"fiPt';>,._, and from any 1 n r, ::.r\J'Pn 1.n
from the uv1,"-":' ll"' side ()f the wnr
'--" ''k'':,.,.,_,
between the initial and return
The time
and the
indicate the locatio n and
of the defect can be



Skill: A


of materials.

Difficult to differentiate types of defects.

6.6 Ultrasonics Transmission Technique
A transmitter e mits continuous waves from one transducer which are
the test
Discontinuities reduce the amount
r"' '""r and so their presence can be detected.
1'"' 1

echo tecnmque.
rec:tmt1u1n e: Problems of modulation associated
to be obtai ned .

,,,::,,,:,,--i ,.,.....

Condi ticm
6.7 Ultrasonics Resonance
Conditions monitored,
(Also used for

ll/l./ l l !...IA v , ,

P-F interval: As for

Resonance in the


6.8 Ultrasonics l'requency iVlodulation

Conditions monitored.
send ultrasonic

Conditions monitored: Genera.I and localised
P-F interval: Several months.

stream for a
are removed and checked for
several months)
these measurements, relative metal loss from the
can be estimated.
Skill: A

trained technician.


and subsurface discontinuities

detection of dimensional ..., .. ,,.. 1_,"-"-'
corrosion; determination material h ardness.

l 00 kHz

4 MHz i nduces
mete r .

,,,,n,-in ,- tt n fT T>"\ '1tr, -> l <c Can

work with
c h a rt .-, ,-. . -,,,,----

materials .

6.1 1

structures, metallic
pumps, shafts,


parts or structures
crack-1. i ke

Two-sided access


Conditions monilored:

and their

defects, COITosion,


Skill: A

6.14 Cold




Skill: As

6.15 Deep-Probe Endoscope

!nSrPTf"> P t'"

system w h ich can penetrate bores with

by a


restricted entry. Illumination


5ikill: As

6.16 Pan-view Fibrescopes


As for
unit is trans,.,.,...,.,,,"'r from a cold l ight
a Hexible fibre cable into
built into its \Vhich
The instrument c an be
to take a detailed ..-,n., u,-:n,"

1 \,;\.j \Ut v.l,

or cine carneras,

control knob built into the side

take photographs or mount TV viewers
can also be used with

t1uorescent J)ene1trat10n to detect minute t1aws in inaccessible areas

makes more detailed inNot a n on - line .,,..,.",,t,-..r,,., .. t,:,,,,tn,, ,,,,, ,.c
Resol ution limited:
DfC)IOIH!ed rn1soiect1ons: Ultra violet nn1resco1Jes

6.17 Electron monitored: l'he

circumstances of


6.18 Colour
Conditions monitored: Oil colour and conditi on
interval: Weeks to mon ths

6. 19 Oil
Conditions monitored: Oil oxidation , water contamination,

lvla intenance


contami nation .
No test
Partkles less than microns cannot
and source of contamin ation cannot
6.20 Oil Odour
Condi1ions monitored: Oil oxidation

'""'''"" "'''""'""' odour when n ew and flP\fPln n

or ' burned' odour
oxidise in
An unusual odour
the stronger the odour,
contamination such as fuel dilution.
tec1m1Qt11e is also limited
have a nmre sensitive
to strong odours . In addition, vapours can
t anks, and
offa strong odour when the
concen trated vapours may therefore suggest
oxidation than

,n n ,T,CHH

worke r


6 .1 8 Strain
tunnels, the load

4: Condition
P-F imerFal:

to months

foil and semiconductor
"""" . ..,,, that vvhen an electrical
" " /,,,-,,.,.


a nd
increased internal and
due to poor lubrication and reduced
,;, f j , r, ,t u n "1,'-P



Condirions monitored: '1 ,,, .,.,., c,!''1 cnair1gtS

mixed lubricants, fuel and

P-F interval:

weeks to months.

computer for

trained ski lled tec11mcrnn .

HV> ;,c; n( ,c,d hc.H Y>,.._L;,' < d ,, , ,_,..,


6.23 Falling Ball Comparator

turbines, transmisP-F

,,,,,="""'"C,"i to a reference oil . Identical baHs are
and the reference oiL The time
r. r. . . . ,r1

Skill: A trained

Accurate to within

in most

needs to be translucent
falls, dark or oxidised oil s may be unsuitable . Not a field


6.24 Kinematic Viscosity (ASTM 0445)

Conditions monitored: Oil u H;,,,, ,h,
in diesel and ga:mlme
compressors and hydraulic

v l l,,",Jll\.,,

trans mis-

P-F interval: Weeks to months.

This test (resistance to

measures the time it takes for a
a calibrated
at a
can be used to monitor oil deterioration over time or
fuel or other oils. The kinematic
time of flow and the cal.i bration factor of the instrument The '"""""'rn," ""'''""' 'T"'
of the liquid.


are used in the

Not a field tecnrntQUe.


A Preliminary Note on
ation. It i::; based on the n .., ....,.,,,.1."
emit infra-red radiation. Thermal ,,"Y,un,n,Y

7.l Infra-red Scanners

Conditions monitored:
oxidised or "'""'"">io,.,
Mechanical: heat op''"'"'..,,,,t.,,,r1
l ubrication, rn1sa11grnrnem


welds, buried steam lines, steam traps,
kilns, tyre
and rubber manufacture.

P-F interval: A
detectors. The detectors
amount of current
then nr,'V'>CC.>i1 an on- boa rd """'" ' c,u
ted on a view finder or rmmitor as

Skill: A




7 .2 Focal Plan
Electrical: currentiresistance re1auonsm1 os fonn



radiatio n onto a
of {'ipr.:,,,. trc
and thermal resolutions that \Vere
u nknown.
comr>os. e o of many small elements. The detectors convert the
which is


i nto a visible

""'"'" ""';;"'"" FPA' s have

versatile. Non-contact safe to

rw,"""'""'"" without
the temperature of
t,;,.,nt'\ ,.rJ,ll !lr<> a1mrenc<S as small as 0. 1 F or
Small and
than IR scanners.
7.3 F'ibre
i nsulation deteriora-


P-F interval: lfours to months.

Skill: A

able in hazardous environments:


Conditions moniwred: Surface temperature.
Hot spots, insulation failure.
P-F interval:


Permanent record



Colours do not
temperatures: Service life of each
colour in the interim).

8. 1 Linear Polarisation Resistance

co,,ui,wms monitored: Rate of corrosion in systems ." ..."'"
ductive coITosive fluids.




trained technician .

and direct indication of corrosion
Some instruments record the
1,.,. 1 0
sion condition: Aut.o matic and
Sensitive to corrosion
results .
fraction of a mil per
tor,, ,L:>nrq, ; l'\i'l ,:[i l< l lf'P<.:

nr.,... ,, ,,


i nformation on total co1Tosion.

8.2 EJectdcal Resistance ( Corrometer)
1n1egra1:ect metal loss

total corrosion).
gas transmission

structur es, cathodi c f"\t,,tpr, f , J'\l

' .... ..'><. ..... ,...""

water distribution
paper mills,


HH,n , t r.,r, n n


or tube of the same metal as the

a wire,
monitored. The electrical resistance of the



corrosion rate.


both corrosion rate and

indicate whether the corrosion

eO:Ul(IITHllt nrrnn,nc no permanent record.
8.3 Potent.iai Monitoring


4: Condition


mills, electrical "''"''"'""'""

etc; best suited to stainless
P-F interval:



can be influenced

8.4 Power Factor

Conditions monitored:
moisture in

P-F interval: Several months.

times the insulation

power factor. The measured c urrent
"" "
called the watts Joss. These values are measured and l c1..1
"w.1 A, when the insulation
system is first i nstalled to estab lish a baseline. Su: bse(l1ue1H
to the initial
As the circuit urn,ed.arn.'.e ,...a,:1,,,,,,
ture, contamination, insulation shorts or

filled oil transformer should have a
in-service oil filled transformer

field technician.
One of the
Not an




rnonitored: Insulation resast.ance.

current flow. If there is no current
eqiu1i::,m1::m under
this current must be
cmTent' . The insn caHed
Skill: Tedmicjans

be carried out on-line.

8.6 Breaker Timing

monilored: Breaker contact travel,

and bounce.

circuit breakers.

and rnedium

,,.,,,,. ..,,,,"""'' h' attached

the breaker mechan ism,

then nn,<>t).:>f1
The test set measures contact
These results are
to the last

set. The breaker circuit

breakers c an benefit from this test.

8.7 lircakcr Contact Resistance Test

Conditions numitored: B reaker <..:ontact wear and deterioration .
Circuit breakers.

the contacts. The

4: Condition

Resistance values
ures before

8.8 Motor Circuit

interval: Several

A number of tests are taken ,.,,,r:,, ,.,.,,.

motor circuit condition.
of any defects which may be

circuit unbalances
of the motor
indicates imbalanced m:.:tgt11et:Jtc

\Xll1,t11l H1'C



l'-l 1r"\.1 t ' 1 t 1 , 1 ,:,

i nsulation and the outer 'skin ' of the insulation


field tedmician ,

'""'''""''- and minimum c urrent test
non-destructive. u1mtwc:u,,1nt and
can be used i n
the field. Tests c a n b e done a t the MCC reqmrmg no break in motor connections.

8.9 Electrical Surge Comparison

Conditions numitored: Tum-to-turn and nn,ase:-rc1-rnrm;e insulation cte 1ter1orat11on,
and reversal or open circuit in the connection of one or more coils or coil groups.

P-F interval: Weeks to months, dependent on motor

under loaded conditions.
to two separate but
at high
waveforms reflected from each
1u\.,ul.l._,,,.1, each wave-

to third segrnent, and

which combination nn)Cl!lCE\s
turns cause
small differences in
shorted or
Mis-connections such as coil reversal
in waveform
tend to cause
differences or
With this
method it is also often
to detennine the voltage at which tum-to-tum or
is near opiera.tmg
If this
nn:ase:-rc,-rnrm.e conduction
then the motor has a serious insulation fault and should be replaced as soon as
nrn,s1rHe_ If
plus 1000 V,
not detected up to twice
n.n,nti,no is considered
and the motor can be returned to service.

ment can also n;:.,ri"nrn,
Cannot evaluate one coil by itself.
co1no1ex and
determine the location and
of a faulL
8.10 Motor Current Signature Analysis
Conditions rnonitored: Broken rotor
between bars and
u neven rotor-stator air gaps, rotor misposition, deteri
orated or shorted rotor or stator core lamination.

Appendix 4: Condition


fl/J,Dllc-:arions: AC or DC motors.

P-F interval: Several weeks to montlu,_

Jn;;rn;f1n11 This
is based on the ..,,.,...,,..,,,nh> that an
a mechanical load acts as an efficient, connnuous1v available transducer. The
motor senses mechanical load variations and
variations which are transmitted
the motor power
c urrent
very small in relation to the
current drawn
electric motor, can be monitored an<l recorded at
from the
of the variations ,, ; ; U H l<>c'
of machine condition, which may be trended over
deteiioration or process alteration. The test is done
on one of the power leads at the motor control
or starter cabinet. The raw waveform
fi ltered and further
,.._..,-,"'"'''""'''! to obtai n a measurement of the i nstantaneous load variations vAthin
the drive train and the ultimate load. In
should not differ more than
StatOf nr,,hlPn,i;:
pass hPnn,,:.n l"lr ll1nll1C(t}ate.1v
pared With the
s1gnm1ca1at difference in nnnlltncle

curren t transformer around one of

electrician . To conduct the
and llH,e>rr,r>.t
of electric rnotors.
nician with an
Aa'va,,ua:ge,,;: On-line measurements can
connections. No electrical connections
e lectrical shocks.
can be taken
or otherwise hazardous machines.

,-,.n">,nra l u


been HY\t"lrr."""''
the spectra (this has
tnt,::.rnrPt:ltlr,n standpoint). Equipmen t

8.1 1 Power Signature Analysis

Conditions monitored: Rotors, broken bars, cracked broken end
bowed or bent rotors; Stators, shorted lamination, ecc.:er1rr1c1ry;
balance, resisti ve and inductive imbalance;
current and
or machine
Torque variations, wear or deterioration of machine c lcanmce&,
output restrictions, m,tc1111mrv augrnrnem: M,1cl11m(:rv
li.l)l}ll'Calwns: AC induction motors, sv1nc11rcmC)llS motors. f'('\t"'rU'\t',a. cc,,rc pumps
and motor op1!rated valves.

P-f' interval:

weeks to months,
attached to motor feed lines either al the Motor Control
conditioning unit condi
sensed from the feed lines. Data files



Skill: To attach
and interpret the

to li vc motor feed lines: an electrician. To conduct the test

an ex:pe11e11ce:a technician.

down the equipment. One of the

can be done without
few techniques that enables broken rotor bars to be detected under Joad. Allows
eq1mi:im1nt efficiencies to be determined.

8.1 2 Partial Discharge

conaml>n,Y nr , n n, Insulation breakdown.
;1pp11i:a1ums. Al1
of medium voltage
switchgear, bus Jucts, transformers, arresters, busn1ng.s,
and the cables themheads, motors, generators, cable terminations, cable
> 2,000 volts AC
selves. Distribution systems and
P-F interval: Several

the shape of the void, am

"" ''"'"'''''"''"' how quickly the insulation fails).

occurs when a s mall void, crack, or
larity in an insulation system
electric field to build up. Sensors are used
sensor is connected between the gnJurldeid
to pick-up the PD. On
CT circuit On cables, the sen sor connected around the
side of the
around the insulated conwire that connects the cable shield or
are placed on the motor frame or around the ground
ductor. On motors,
three i ssues
connection or around the insulated motor lead. ln
are considered:
and the magnitude of the pulse (field
the number pulses per

Appendix 4: Condition


the power of the

over time of the power ( trend
the rate
When rrer1ou1g data three levels of PD thresholds are set
continued use,


Skill: txpe1ne11ce:a electrical techn ician.

PttlVlLnu.wes: Allows
and more infonned aec1s:101ns.
type of electrical eQ1ll10, mcnt
ns,aa,va,,zm1<es: Current available on-line
source of the PD while the eq1up111e11t
tittle or no information. Several data
current standards available on maximum ac(::er>tat11e
for cables.

rPr h ,('\l {'"'"

to set PD threshold,.

8.13 High Potential (Hi-Pot) Testing

wall insulation deterioration .
P-F interval: Several weeks.
High DC
or ramps up to a l imit,
twice the l ine
derived from the IEEE Standard 95. At the first
current or drop in insulation resistance with further
is recorded and the
removed in order to
breakdown. If the insulation withstands the
the motor can be returned to service.
u,",,, Qf1IlSll1at10n
LUI I \J1 'lt,

Skill: An expe1>1e11ce'.ct e lectrical technician.

vi,sacivamGi.ees: Motors have to be taken out of service to
v,.,,_..,, ,. ,.,_.,. , , destructive.
8.14 Magnetic Flux Analysis
Conditions monitored: Broke11 rotor bars, unbalanced
such as turn-to-tum,
P-F interval: S everal weeks to months


Reliability-centred Maintenance

A flux coil sensor is placed at the centre of the axial outboard end of
the motor. Lons1:ste11t p1os1nrnrung of this sensor is essential for reliable and trend
able data.) The signal ,.r,:,,,,
" ''"'"' from the sensor is transformed into the frequency
an FFf analyser. A trend of certain magnetic flux frequencies will
associated with the rotor and stator windings.
indicate electrical
in a flux coil spectru m occur at frequencies which have some
Most of the
Broken rotor bars increase the sideband activity
relationship to nmn ing
(which causes motor
Unbalanced supply
and eventually leads to premature deterioration of the stator windings) shows no
except around the
occurring at line frequency + 1 x RPM. One of
the first faults a w inding will encounter is tum-to-tum shorts, which then
or phase-to-ground shorts. A winding fault can be indicated
around the 3 x running
sideband of line frequency . A variation of thi s
technique is used t o detect turn-totum shorts b y looking a t the family o f 'slot
tre,c1m:::nc1es from measurements taken with a flux coil. Flux measure
is analysed at the
ments are taken as mention above, and the
The principle slot pass frequency occurs at the product
of the number of rotor bars and running speed. The technique i nvolves comparspectra over time to determine when c hanges occur.
Skill: To record the spectrum: an electrician/technician with an understanding of
motors. To interpret the results: an en:grnteer.
' vwiua;; One of the few techniques that can detect faults associated with elec
trical insulation of electric motors while the motor i s online.

11s11ai1anuu2es: High
resu]ts .

tO ,nt
: ...1.,,F'"'

of skill and knowledge of electric motors required

8.15 .Battery Impedance Test

Conditions monitored: Cell deterioration .
......,.,,.-n,,.n ,u power and DC control power batteries.

P-F interval: Several weeks.

its internal i mpedance
to lose
an AC signal between the terminals of the
measured and the
calculated. Two
,,.....,.., ... ,,,., ,.,.,,.", can then be made:
the impedance is compared with the last
for that battery and second, the reading is compared with other batteries
in the same bank. Each
shou ld be within 1 0% of the others and 5% of its
outside these values i ndicates a cell problem or capacity
loss. There are no set _grnctetmts and limi ts for this test Each type, style and con
has its own impedance, so it is important to take a baseline

4 11
Skill: Field technician.
test can be
is low level and 'rides' on top of the DC of
Lhsactvantc1,Res: Test could take

time on

9 A Note on Leaks
With the exceotmn
covered in much detail in this
storage tanks. This is because
which 1'""" ' 1 11 ,,
d.,.,,,,,..,, nt,,-r, of 36 different leak detection methods

and commissioned
Laboratory, Edison,
L>, .,d--n


available from the National Technical Information

United States
Co mmerce based in ,r.,rt n Cl'Tt,,. , n


& Le Bleu J. "Condition NHJmtorm,g
8 (6) 1 995
/'lVJ1rr,M U',:>


Atomic emission
Abrasion: 59
Accelerometer: 35 1
risk: 30, 98- 1 0 1 , 256,
Acoustic emission:
Actuarial analysis:
A.ct1nm:1steJnng RCM:
Age-related failures: 1 1, 1 3 1 - 1 33, 1 601 6 1 , 235-238, 243-246
Association of America
Ain.:raft accidents: 309
Airlines and RCM: 3 1 8-32 1
All-metal debris
Analysis paralysis: 65
Analytical ferrography:
Anxiety: 40
Appearance: 40
n nruut rHY RCM : 1 6- 1
Asset hierarchy: 327-330
Atomic absorption spectroscopy: 374
Atomic emission spectroscopy:
Attentional failures:
Audit trail: 20, 3 16
Auditing RCM: 1 8, 2 14-217,


Availability: 294-304,
of hidden functions: 1 1 5- 1 1 8,

Battery 1n11Je(lam:e
Benefits of RC;I:



1 69, 206

Comfort: 40

cons1:!QUern;e evaluation: 1 1 ,
Consequences of failure: 1 0- 1 1 , 1 5, 7 1 90127,
lV(H(l;UlCe : 9 1
catc:gm-tes: 1 0

34, 49
Criticality assessment: 280
Customer service: 3, 1 9, 29, 1 04, 201 . 280
Cusum chart: 1 5 1
Secondary damage
Database: 19. 267-268, 3 1 5-3 1 7
Data: 255-260
Technical history data
Debottlenecking: 62-63
enaoscicme: 394
,. nui:cuu
u . 200-201, 267
Decision support: 5
Decision Worksheet: 198-211, 267,
Default actions: 1 4, 9 1 , 170197
Defect rer>oritmii:

Desired oertori,narice: 23.24, 130, 189,
Deterioration: 58-50,

in maintenance schedules:

the ultimate:

Control limits:
Corrator: 401 A02
Corrective maintenance: 1 7 1
39 1
1 9,
maintenance: 278. 305
operating: 104, 20 I
repair: 104, 108-110,
Maintenance craftsman

Dielectric strength: 375--376

Differential absorption lidar:
rerrogn1p11: 363-364
Disassembly: 60
Scheduled discard
exr)Onenttal: 239-24 1
normal: 1 32, 236-237
survival: 236,
Weibull: 242-243
Dirt: 60
LJ1s;assemt>Jtv: 60
1 47, 256, 278, 294
Downtime: 3,
lJrnlWUHlS: 2(), 3 1 6
1 50, 349, 35 1 -362
Economy: 4 2
consequences: 1 5, 1 05 ,
1 39- 1 40
risk: 1 19, 343, 346

see Maintenance effectiveness
Failure effects
energy: 294
time: 230-23 1
1 50, 350, 40 l -4 1 1
Electrical resistance meter: 402
Electrical surge comparison: 406
Electro-chemical corrosion mcimtonng:


Electrostatic fluorescent penetrants: 389

Emergency medical equipment:
Energy dispersive
Environmental consequences: 3, 1 0, 1 5,
93, 94103J 204, 263
definition of: 95
and reai:Stn: 1 02, 1 90
standards: 30, 95,
Environmental hazards: 75
Environmental integrity: 19, 38, 308-309
Equipment effectiveness: 30 t
bQ1Luoment vendors: 77-78,
Ergonomics: 40
Erosion: 59
'ESCAPES' functions: 38-44
Evaporation: 59, 1 34
Evidence of failure: 74
Evident failures: 92-93
categories: 93
Evident function: 92
cx ,cer:>ttonal violations: 338, 342
Exhaust emission analysers: 382-383
Existing assets: 1 9
maintenance schedules:
Expectations of maintenance:
Expert systems: 3 1 7
bx1pm1entml distribution: 239-24 l
FAA: 3 18-323
Fast Fourier transform
Facilitators: 1 7 - 1 8, 269-277

Failure modes: 9 , 53-73,

270 ,


41 8

Reliability-centred Maintenance

Fast Fourier tnmsform: 35 I

Fatigue: 59,
and bearing failure: 1 57- 1 59
FaulHree analysis: 344-345
Technically feasible

First Generation: 2
Flow orc)cesses:
Fluorescence spectroscopy (X- ray):
FMEA: 53-89
see also Ru lure modes and Failure effects
Focal plan
Fourier transfom1 infrared spectroscopy:
Fractional dead time: 1 1 7
nmure-rmamg: 1 75- 1 85
Maintenance schedules
Maintenance schedules
on-condition: 1 45- 1 49, I 63- 1 65
scheduled discard: 1 38- 1 40
scheduled restoration: 1 35
t're:auenc:v analysis: 356
Fuel line: 8 1 -85
Functions: 7-8, 21-44, 2 1 5, 261 -262
different types: 35-44
Evident functions
Hidden functions
8, 35-37
secondary: 8, 34, 37-44
sur,ertluc,us: 43
3 29-333
Functional hierarchies: 329-333
Functional failures: 8-9, 45-52, l 24, J 55,
2 1 6, 262
definition: 47

Gas cn1on1at<)gr:1ph,v: 379-380

Gas station:
Gearbox failure:
failure data: 79

Graded filtration: 367-368

Hidden failures: 1 0, 1 5 ,
1 11-128,
1 72. 204, 263
normal circumstances: 1 26
OPt!ratm_g crew: 1 25- 1 26
& secondary functions: 1 25
remsrnn: 122, 191-192
of time: 1 24
Hidden functions: 1 1 1 - 1 28, 1 70
decision process: 1 23
1 1 5- 1 1 8
asset: 85, 327-330
functional: 329-333
H istory: see Technical history
Hopper: 1 95- 1 97
Human error: 60, 6 1 , 70, 335-342
Human senses: 1 49, 1 53-1 54
Human sensory factors:
l-h1a1PnP 39
lm1pleime ,ntu1g RCM recommendations:
1 8, 212-234, 276
'<UV.>htl1tv 23 -24 , 47-48, 64, 1 30
Ini tial
Initial interval: 207-208, 2 1 7
Infant mortality: 1 3,
247-249, 3 1 l
Information Worksheet:
52, 54, 89,
202, 267, 27 1
Infra-red scanners: 399-400
Infra-red spectroscopy: 379
Initial capability : 23-24
Initial interval: 207-208
Installation and infant mortality: 247 -248
Integrative framework: 3 17
Inteifadal tension: 376
ISO 9000: 2 1 9
Job card: 233-234
Just"in-time: 3, 3 1
Karl Fischer titration
Kinematic viscosity test: 398

KrnJWI1eawe-o,1sed nustakes: 338. 341-342
Kurtosis: 361



"LlO" l ife: see BIO life

Lapses: 338-339
Lead time to failure: 146
Leak detection: 4 1 1
safety: 103
8188, 2 1 5
LIDAR: 370
Life: 294
average: 1 32, 1 37. 238
safe-: 1 38- 1 39
useful: 19, 1 32, 1 37, 238,
Light detection :md
Light extinction
counter: 366
Likely victims: 99- 100,
Linear P-F curves: 1 60- 1 62
resistance: 401 .402
Liquid dye penetrants: 3 88
Living RCM program: 277
Lubrication failure: 59,
Mean time between failures
M'ITR : see Mean time to
Magnetic flux analysis: 409-4 1 0
Magnetic particle inspection: 389
definitions: 6
Maintainable: 24
Maintenance: 1 88- 1 89
chaUenges: 5
costs: 3 1 2-3 14
craftsmen: 17,
definition: 6
effectiveness: 293-304, 3 12-3 1 4
efficiency: 304-307
labour: 305
& control systems:
and redesign: 1 88- 1 89
supervisors: 17, 228-234,
Management information:
Manuals: 20, 316

No scheduled mamte:mu1ce: 14, 15, 107,



Reliability-centred Maintenance

colour: 395
3 96
Once-off changes: 1 8, 220-221
1 1,
145-169, 205,
i ntervals: 1 45- 1 49
1 55- 1 56

28 - 35, 5 0,


procedure: 1 8
1 5 , 93 ,
Operational consequences:
3 10
1 03-108, 1 70,
definition of: 1 04
1 93- 1 97
Operations managers: 26 1 -268, 2 9 1
()peratiOllS "' """"'nll,t.'Arc } 7 , 262-268, 29 1
262-268, 291
\JfJuu u , ,.;

also Scheduled restoration
UV(lflOlldlnJ!.: 6 1 -64
Oxidation: 1 34
1 44- 145, 1 57- 1 62, 343,.349
linear: 1 60- 1 6 1
and random
205, 256,
P-F interval: 145-149,
4 10

vs operating
1 56, 1 60- l 62
P,m-view fibrescopes:
Partial discharge: 408A09
failure: 47
Particle effects: 1 50, 349,
Pmlide monitoring: 349, 362-370

282-285, 3 1 5

Patch test: 369

Patterns of failure: 1 2, 235-249
A: 1 2, 1 33, 249
age-related: 1 1 , 1 3 1 - 1 33
132, 133, 235-238
C: 1 2, 1 33, 243-246
D: 1 2, 1 43, 246
1 43 , 239-242
F: 1 2, 143, 246-249

Performance standards: 7, 22-27 47, 9 1 ,

Phase: 35 1
asset: 6,
8 1 ,302, 304, 327-330
effects: 1 50, 350, 388-398
Pn,rs10, 1o_g1ca1 factors: 3 37-338
Piper Alpha: 308
Plant register: 1 6, 327-330

1ov.1tr,enutern:v schedules: 224-225,
time: 230
running time: 23 1
Poka yoke: 339
Por1e-otocl<a\?e tecr1mc1ue: 364-365
Potential failure: 1 4, 1 44-1 45, 1 54, 1 55,
205, 256, 3 10 , 348-349
definition: 1 44
Potential monitoring: 402-403
Potentiometric titration:
TANfrBN: 383-3 84
Predictive tasks: 144-169, 1 7 1
Pressure relief valve:
Relief valve
Pressure switch: 1 25
Preventive tasks: 133-1 40, 1 7 1
Primary effocts monitoring: 149, 1 5 2- 1 53

42 1

Primary functions: 35-37, 125
PRN: 280
Proactive maintenance: 129169
Proactive tasks: 1 - 14, 9 1 , l 02- l 03, 106,
1 07, 1 70,
combination: 1 69
hidden failures: 1 2 l
selection: l 67- 1 69
Mean Time between Failures
note on P96

Random failure:
1 40 , 1 43,
and P-F ,,,1,,,.,,,,, 1,,. 1 56
Raw material supply:

& instrumentation drawings:

batch: 29
3, 1 9, 60, 1 04, 1 49,
277. 3 1 2
monitoring: 1 5 1 - 1 52
Production effects:
26 l
Project managers: 1 7 ,
Protected function: 1 1 0
mean time between failures: 1 19- 1 8 1 ,
1 85
Protective devices: 4 1-42, 1 1 1 - 1
1 85
Proven technology: 247
Proximity analysis: 359
Ps,1chotog1ca1 factors: 338-342
22-23, 24
different operating contexts: 28,
hidden failures: 92, 93
failure rnodes: 54, 56-57, 6L 6569
rn11.ure-nn1o mSI: 1 78- 1 79,
functional failures: 46
fail ure: 54, 56-57

redesign: 1 92

rnainte:na11ce: l 88- 1 89
non-operational corisi.::(Juence-s: 109- 1 10,
1 93- 1 97

and nn111r'e -1 m11111P-'

Reliability-centred Maintenance (RCM):
200-20 1

Scheduled restoration
1 7. 266*269
Rigid 001esc;opies: 393
Risk: 95- l O 1
acci:ptabiWty: 98- 1 0 1
continuum of: 1 1 8- 1 20
cconornic: 1 1 9,
evaluation: W l
fatal: 98--99,
of fail ure: 69, 1 59, 3 35-34 1
Rotating disc electroche;:
Routine maintenance:
and hidden functions: 1 20- 1 22
Routine violations:
340-34 1
Rule-based mistakes:
Run-to-failure: l l , 1 4
Running time:
S -N
SPC: see S tatistical


Safe-life limits: 1 38- 1 39

S afety: 1 9, 38,
1 70, 279, 308-309
1 0, 1 5 , 93. 94-

Scheduled restoration: 1 1 , 1 3, 134-137,

1 68, 205, 238, 265
technical feasibility: 1 35- 1 36
wo1th doing: 1 36- 1 37
rate: 295
59. 7 1
Second Generation: 2, 1 1
75-77. 1 1 0
Secondary functions: 8,35, 37-44, 1 25
Sediment: 3 69-370
Selective approach to RCM 278-282
S enses:
Human senses
Customer service
Seven omstJtons:
Shape parameter:
Shift arrangements: 30
Shifted Weibull distribution: 245
Shock pulse
Shutdowns: 3 1 1
:s12:rut1cant assets:
Simultaneous learning: 269
S ki ll-based errors: 339
Skills in RCM :

Smoke detector: 191


Safety first: 93

_ see Relief valve
and actuarial analysis: 25 1
Scale parameter:
:-:ic,mnmJ;? a uger elec tron m1crc>scoov:
3 81 -382
;")Cumm2 electron mH:::ro:-.covv: 3 8 1
Schedule: see Maintenance schedule
Scheduled discard: l l, 1 3, 137-140, 1 68,
tn>l'H ll rli''-1 1 38- 1 40

worth doing: 1 40
Scheduled on-cm1ct1tmn tasks:
1 43
Scheduled overhauls:
Scheduled restoration

S taff turnover: 20, 3 16
Standard operat111g mocem-e: 221 -222
Stand-by pump; see Pump
control: 1 5 1 - l 5 2
Steel mill: 3 10, 3 1 1
Strain gauges: 396-397
Applied stress
.Stn omtbte ma,:met1c film: 3 89-390
Structural integrity: 3 9
.su1oertllu)us functions: 4 3
Survival distribution: 237-245
27, 48
boundaries: 1 7, 270, 334

TQM: 2 1, 288

packaging: 223
oro, om,ed: 206M207
selection process:
Teamwork: 20. 268,315
characteristics: 1 5, 1 29
Technically feasible
1 ec:nmca111y ttias1Die: 14, 90-91 , 129-130,
205, 324
failure-finding tasks: 1 85
on-condition tasks: 1 49
scheduled discard
l 40
scheduled restoration tasks: 1 35- 1 36
Techniques maintenance: 5
1 50, 350, 399,-40 l
Temperature indicating paint: 40 1
1 e1npt;atmg: 28 l
rne:rm,ogr;apny: 399
Thin-layer activation: 380
Third Generation:
Time and hidden failures: 124- 1 25
Time synchronou s avi:ra21r11g analysis:
3 55-366
Time waveform
Total failure : 47
Total quality management:
Traditional view of failure: 1 1
to maintenance: 1 6

Vibration switch: l 1 5


needs analysis: 269

in RCM : 29 1
wire: 42, 1 15
Truck: 26. 28,
Truncated Weibull distribution:
disc failure: 1 6 1 - 1 62
exhaust system: 44, 52, 89
Staff turnover
Tyres: 1 60- 1 6 1 , 1 62,
3 10

Y ield:


