Tutorial Book On Asset Management - Maintenance and Replacement Strategies at The IEEE PES ... PDF

Tutorial book on Asset Management -
Maintenance and Replacement Strategies

at the IEEE PES GM 2007
IR-EE-ETK 2007:004
Tutorial book on Asset Management -
Maintenance and Replacement Strategies
at the IEEE PES GM 2007
Authors:
Dr. George Anders
Dr. Lina Bertling
Dr. Gerard Cliteur
Dr. John Endrenyi
Dr. Andrew Jardine
Dr. Wenyuan Li
Edited by:
Dr. Lina Bertling

Contents
Preface...............................................................................................................................................2
1 Introduction...............................................................................................................................3
2 Maintenance as a strategic tool for asset management .............................................................4
2.1 Introduction.......................................................................................................................4
2.2 Are Utility assets aging? ...................................................................................................7
2.3 Condition Assessments .....................................................................................................8
2.4 Driving today’s network into the future..........................................................................10
2.5 Biography........................................................................................................................13
3 Introduction to maintenance ...................................................................................................14
3.1 What is maintenance? .....................................................................................................14
3.2 Review of maintenance policies .....................................................................................16
3.3 Linking reliability and maintenance: a probabilistic approach.......................................20
3.4 Conclusions.....................................................................................................................23
3.5 References.......................................................................................................................24
3.6 Appendix: Deterministic or probabilistic models ...........................................................25
3.7 Biography........................................................................................................................26
4 RCM and its extension into a quantitative approach RCAM .................................................27
4.1 Introduction.....................................................................................................................27
4.2 Reliability-centred maintenance (RCM).........................................................................28
4.3 Reliability-centred asset management (RCAM).............................................................30
4.4 RCAM application study for an electrical distribution system [5] .................................37
4.5 Conclusions.....................................................................................................................45
4.6 References.......................................................................................................................46
4.7 Biography........................................................................................................................47
5 Optimizing condition monitoring decisions for maintenance planning..................................48
5.1 Introduction.....................................................................................................................48
5.2 Optimizing Condition Based Maintenance Decisions ....................................................49
5.3 Software for CBM Optimization ....................................................................................53
5.4 Recent Developments .....................................................................................................56
5.5 EXAKT Summary ..........................................................................................................57
5.6 Conclusion ......................................................................................................................58
5.7 References.......................................................................................................................59
5.8 Biography........................................................................................................................59
6 Computer program for decision support in the management of equipment maintenance ......61
6.1 Introduction.....................................................................................................................61
6.2 Asset Management Planer (AMP) Program ...................................................................62
6.3 Asset Reliability Model (ARM) Program.......................................................................66
6.4 Optimal refurbishment strategy ......................................................................................71
6.5 Program description ........................................................................................................76
6.6 Numerical example .........................................................................................................76
6.7 Conclusions.....................................................................................................................81
6.8 References.......................................................................................................................82
6.9 Biography........................................................................................................................82
7 Risk Based Asset Management – Applications at Transmission Companies.........................83
7.1 Introduction.....................................................................................................................83
7.2 Replacement Strategy of Aged HVDC Components .....................................................84
7.3 Determination of the Number and Timing of Spare Transformers ................................96
7.4 Further Discussions.......................................................................................................103
7.5 References.....................................................................................................................104
7.6 Biography......................................................................................................................105
Content
Content
IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA
Preface
It is a pleasure to present this book which has been prepared for the tutorial on Asset
Management- Maintenance and Replacement Strategies, at the IEEE Power Engineering Society
General Meeting during 24-28 June 2007, Tampa, Florida USA.
The tutorial is sponsored by the; Reliability, Risk and Probability Applications (RRPA)
Subcommittee group chaired by A. W. Schneider, Jr., and the Power System Planning &
Implementation Committee (PSPI) chaired by Dr. M. L. Chan. Dr. Lina Bertling KTH (Royal
Institute of Technology), Sweden, is the tutorial chair and editor of the book.
The book shows on how maintenance is turned into a strategic tool for asset management. It gives
a review of maintenance policies, and shows on the link to probabilistic approaches, and the
reliability-centred maintenance methods. It shows on how condition based monitoring could be
used for optimizing maintenance decisions. Furthermore, it introduces computer programs for
decision support in the management of equipment maintenance. Finally, it shows on applications
at transmission companies using risk based asset management.
The material in the book has been prepared by five more authors that are; Dr. George Anders, Dr.
Gerard Cliteur, Dr. John Endrenyi, Dr. Andrew Jardine, and Dr. Wenyuan Li. All these authors
are well known experts within the field on maintenance and asset management.
The idea for this tutorial came up at the 9th International Conference on Probabilistic Methods
Applied to Power Systems (PMAPS2006), held at KTH Campus during 11-15 June 2006. The
picture below shows on a memory from a workshop during PMAPS2006, which gathered several
of the authors for this book. It has been a good and busy year since then, and maintenance keeps
getting more useful when the time goes!
Lina Bertling, Editor

Stockholm, March 15, 2007
Contact for further information:

Lina Bertling
Assistant Professor
KTH Electrical Engineering
100 44 Stockholm, Sweden
Phone; +46 8 7906508
E-mail; linab@kth.se
www; www.ee.kth.se/rcam or
www.ee.kth.se/users/linab
Picture from left; Andrew Jardine, Ulf Sandberg, Gerard Cliteur, John Endrenyi and Lina Bertling
Preface 2
1 Introduction
Maximal asset value and minimal life cycle cost are typical economic objectives of the electric
utilities. However, attaining these objectives is constrained by the requirements of customers and
regulators concerning the reliability of power supply. De-regulation of the electricity market has
increased the incentives for cost effective and efficient use of available assets. Optimization of
maintenance is one possible technique to reduce life cycle costs while improving reliability, and
utilities need to implement new strategies for more effective maintenance techniques and asset
management methods. The term asset management here implies making the right decisions on:
what assets to perform maintenance on, what level of maintenance to perform, what specific
maintenance steps to perform, and when to perform the selected maintenance. However, to make
the right decisions the manager needs strategic tools, planning tools and data and different support
systems.
This book covers these different needs by: showing maintenance as a strategic tool for asset
management, introducing maintenance planning methods such as reliability-centered maintenance
(RCM), showing condition monitoring methods for collecting maintenance data and maintenance
software, and finally showing an example of asset management methods in practical use in a
transmission company.
Introduction 3
2 Maintenance as a strategic tool for asset management

Dr. Gerard Cliteur
Power System Planning & Management
KEMA, Inc.
Abstract - The importance of Equipment Maintenance and Replacement strategies addressing system
reliability issues in North American power grids is growing. The reliability of these grids typically
comprises lightning and weather induced outages, trees, animals and equipment deterioration. Vegetation
management, automation (especially in distribution), insulation coordination and system hardening are
common initiatives. However, neither of these address equipment deterioration directly. As the
infrastructure is aging (average ages approach 40 years, some equipment categories have appreciable
numbers exceeding 55 years) the question really is how long will failure rates stay constant? If they go up
due to wear out, how fast will they increase? Can we do something about this right now? Can we for
instance maintain more effectively and thereby extending its useful life? Can we apply life extension kits?
The answer is; yes, but it depends on the actual business case and what the respective Utilities are already
doing. What does it cost in terms of O&M labour and materials to do all of this and what does it buy in
terms of deferred capital spending (replacement) and improved system reliability?
Similar questions can be raised for equipment replacements going forward. Should we spend more
capital to pro-actively replace certain equipment? If so, what equipment and at what rate? How does this
affect O&M spending and system reliability? And, more challenging, in light of the other above-mentioned
options to improve system reliability, what is the most cost-effective option?
This chapter address these issues, the options and will provide practical examples of how utilities deal
with project ranking, prioritization and optimization under certain objectives and constraints and
uncertainties.
2.1 Introduction
Asset Management is more than Condition Based Maintenance. It is less than corporate portfolio
planning. It boils down to connecting execution and funding; connecting operations with asset
ownership and corporate objectives. Asset Management is not operational excellence but instead
focused on effectiveness, bringing out the most of every capital investment or expense from a
planning perspective. It has a long-term view, strives for balanced investment-risk-performance
levels and supports data driven decision-making required for all ‘discretionary spending’. Thus,
and most importantly, Asset Management is for utilities with an aging asset base.
Aging is not necessarily a bad thing. Equipment condition actually may improve for a certain
period. However, it is clear that every piece of equipment eventually deteriorates due to wear,
incidents and chemical processes, etc. This needs further elaboration on two issues that are at
hand here. First, as aging and condition deterioration are time dependent, forecasting becomes of
interest. Secondly, there is a quantification issue with uncertainties that put engineers at unease
and managers either because of lacking data for unformed decision-making or having too much
information to paper…Both are long standing topics in the Industry and apt with uncertainty,
confusion and doubt. Omitting any Sarbanes-Oxley implications, let’s start with the forecasting
issue.
Maintenance as a strategic tool for asset management 4

G. Cliteur
2.1.1 Condition forecasting

In the medical profession, health is an individual’s physical body condition and is a momentary
snapshot of that person’s well being and potential performance. “I am healthy (currently have no
diseases) and am trained, willing and capable of running a marathon within 2 hours and 15
minutes”. This is an example of someone expressing his or her condition. A useful statement for
the application screening committee. This claim can be tested and verified, if not by having the
person run the marathon once. If, however, we change perspective and look at this claim from a
sponsor’s point of view, we will want to know a couple of additional data points. Apart from the
looks of the runner…we will want to know the age and, most important, how long this person can
perform up to these specifications. Any physical body is subject to condition enhancements and
deterioration, thus emphasizing the importance of forecasting this into time and the related
certainty.
Any professional responsible for budgets in combination with a certain expected but repeated or
continued performance by assets that can deteriorate is in need of this information. Back to
physical Utility assets, asset managers are similarly interested in such forecasted condition data.
Adding assets like transformers is not much of a deliberate decision as it grows the (asset base of
the) company and typically yields incremental revenues. As Utility asset bases tend to age, the
successful Utility will become more and more defined by the one that can deploy fact-based
decision-making related to asset replacements, often earmarked as ‘discretionary’ spending. If one
is too late, the related performance goes down and other risk elements may become exposed. If
one is too early, capital is wasted. Forecasted equipment condition 4 and system performance feeds
multi-dimensional cost-benefit analysis 5 and improved decision-making.
2.1.2 Condition quantification

Utilities use classifiers for equipment condition are ‘as new’, ‘good’ or ‘acceptable’, ‘critical and
‘urgent attention required’. Many of these are convoluted due to the inherent mix of condition and
criticality of the unit in question. We will discuss this later in this paper. Other Utilities introduce
a condition index; a parameter between 0-10 or even higher, if extended granularity is warranted
or needed. The question really is what zero and the maximum mean. The maximum typically
refers to the ‘as-new’ condition, even though ‘as-new’ has many teething diseases related to a
potential new design, manufacturing or material defects. Focusing on aging infrastructure, the
question really is what a condition index of (close to) zero means. Is this imminent failure? Likely
failure? The CFO will still inquire the infamous ‘when?’.
A hazard function expresses the annual likelihood of failure (i.e. not performing up to specified
values) as a function of age. This is not to be confused with the failure rate of a certain
population. The failure rate is a measure of number of failures divided by the number of
equipment in any given year. This is a function of the more physically meaningful hazard rates
when convolved with the actual age distribution of the total population.
Based on a hazard function one can make a replace-retain decision because it is related to an
actual physical unit. Failure rates can not be trended nor be used for single unit replace-retain
decisions as they depend on the total population. Utilities reporting failure rates and their constant
4
Condition as a function of operations and life extension measures (typically periodic maintenance).
5
As opposed to analyzing and benchmarking performance separate from expenditures. Multi-dimensional
benchmarking additionally takes regional differences, network differences and, most importantly, time into account
by averaging several periods (i.e. years).

G. Cliteur
trends need to be more self critical and for instance consider the failure rate in an imaginary
family…having 4 family members over the age of 78 with no one perishing doesn’t mean that the
rate of perishing will stay constant at that favourable figure over the next decade or so.
2.1.3 Why this matters

Condition forecasting and quantification are important as:
1) It is expected that performance is being stretched to the limits with the current grid,
increased loadings and deteriorating equipment conditions. Witnessed by the abundant
recent summer loading related outages.
2) Uncertainty puts operations in a scrambling mode to obtain replacement dollars for
unplanned replacements (typically from planned projects) and ultimately destroys credit
ratings and customer perception.
We all have an intuitive level of risk. Examples that come to mind are typically related to
automobiles and children. When safety is an obvious factor we all agree without being too
critical, granular and quantified. Even though risk is defined as the probability of an event times
its impact, we readily have acceptable and unacceptable classifications ready. However, when it
comes to events that are unprecedented but with extreme high-impact (e.g. the flooding of New
Orleans during hurricane Katharina) or events we know are going to happen but are hard to assess
(e.g. when is this nice 100 or 120Hz-humming piece of steel going to give up the ghost?) – we
tend to be under critical and reluctant with pro-active measures 6.
With most Utility assets there is a clear responsibility and benefit with being critical and open to
assessments. The impact side of the equation for a power transformer failure for instance is
related to the congestion costs, non-delivered energy, replacement/repair cost and safety related
liabilities. A power transformer can fail violently with sharp porcelain debris that cuts through
walls, oil fires and spills. Not to mention the indirect impact of negative headlines related to such
a catastrophic failure. Planned replacements that are well-timed avoid all the negative energy,
indirect dollars and effort related to emergency replacements. The biggest savings are, however,
with improved supply chain management as procurement can now anticipate the need for units
and negotiate discounts for multi-unit advanced orders with a strategic Vendor. Here is where the
large volumes of distribution equipment kick in.
Other benefits relate to improved transparency of reinvestment plans and may be used in a long-
term regulatory strategy framework. Some Utilities are indeed deploying asset condition forecasts
in relation to expected system performance under different scenarios of spending in an interactive
discussion between planning, finance and the regulator.
Forecasting and quantification are beneficial to support prudent or, better, optimal spending.
Quantification needs to take uncertainty into account, especially when forecasted 5 to15 years out.
It is important to understand the data and algorithms that underlie the quantified hazard functions
in order to verify and improve the forecasts with each newly obtained data point (to be discussed
in the section on condition assessment). The Utility with the best data and best forecasting
algorithms, like the best performers at Wall-Street, will have the highest certainty and on
aggregate loose the least money on an aging asset base. Note that this is not a general plea for
6
It will probably take just one major incident with an obviously rusted bridge collapsing that will trigger a nationwide aging
bridge assessment and management program with corresponding capital and maintenance budget.

G. Cliteur
pro-active replacement strategies. It is about getting your arms around cost, risks and performance
over a certain period of time, evaluating several scenarios and making deliberate choices.
The next two sections will address the questions related to aging asset bases and elaborate on
what data to store and algorithms to use for condition assessment (quantification) and forecasting.
2.2 Are Utility assets aging?

Yes. All asset bases are aging and this is a good thing. Every year, an asset base gets one year
older when omitting system expansion, load growth related upgrades (upgrades comprise new
equipment as opposed to uprates where only modifications to existing equipment are performed)
and replacements (e.g. replaced poles triggered by road widening). It is the deterioration
component of aging that should worry us. If an asset base does not deteriorate, or we have some
kind of proof that it won’t occur within the next 20 years or so, we have peace of mind and can
focus solely on other Utility issues (e.g. aging workforce). As long as we are pro-actively
replacing equipment at rates less than 1%, we are inherently assuming that the equipment has a
useful life exceeding 100 years. This implies we should be accruing the money for emergency
replacement up to the assumed lifespan.
2.2.1 Do we accrue money for emergency replacements?

No, because we do not assume an actual lifespan. At least not documented and acted on in terms
of dedicated replacement budget. The general belief is that the variability is large 7 and one would
hope for the largest lifespan. As a matter of fact, ignoring indirect costs of failure and
maintenance spending, the optimal replacement age of all assets is at failure. A big secret of
operations is that Utility staff keeps fingers crossed and maintains & repairs based on experience
and engineering judgment. Not to mention the water hosing of critical power transformers during
hot summers…
2.2.2 If this is true, do we have a time bomb?

No. Hot summers and other weather events will take out the weak units in a few isolated
incidents. There will be a budget to do what is deemed necessary (…) in a one-time effort. These
events however increase the awareness of an aging (and incapable) power grid and the magnitude
of indirect costs. As the number of such events seem to increase, it may be more appropriate to
speak about an aging asset mine field.
7
There is also the belief that newer equipment has shorter lifespans than older equipment.

G. Cliteur
2.3 Condition Assessments

Why it is done the way it is done now? Because it is difficult, your engineers will tell you.
Because we have no data - our crews only want to repair equipment without logging the details.
Because we have no time to sit and think - there is too much capital work (new construction) and
too few resources. Most often all this is true. The major omission, however, is the creation of a
‘case with inherent proof’.
We all know and have experienced that budgets become swiftly available to address issues that
just became painfully apparent by actual failures and related outages. Only if these could have
been predicted, articulated (on paper – different from the typical “I told you so” complaints for
denied past budget applications) with likelihood and impact for verification when one actual
occasion took place, then this makes a compelling case for non-discretionary spending in order to
avoid adversary events or, at least, mitigate its related impacts. It is this single omission that
jeopardizes the discussion between execution and funding.
As long as there is no compelling case with actual proof but only strict engineering condition
assessments in language unfamiliar to the best willing CFO, there will be little money dedicated
to the case. To the CFO’s defense, it should not be hard to imagine a host of other initiatives to be
financed with a better (better defined) ROI or any other measure for bang-for-the-buck. Again,
the way to go is forecasted hazard functions (as the engineering side of the risk equation) in
combination with impact of failure (as the financial side of the risk equation).
2.3.1 So, what is done?

Many Utility plant is assessed on a regular basis. In fact all plant is in theory subject to preventive
maintenance based on inspections as even distribution line equipment is eyeballed during
walkdowns every 10-15 years. Having said that, substation equipment is typically assessed on a
monthly basis and operational data is available through SCADA systems. The assessment
includes cross examinations of inspection parameters, operational data, maintenance data and
diagnostic measurement results. The cross examinations comprise of comparing the raw data to
thresholds or applying these in algorithms published by professional organizations, etc. There is
much attention to power transformers. Potentially because the important deterioration
mechanisms are thermal and mechanical, better allowing for extrapolation and prediction than the
sudden dielectric phenomena in circuit breakers for instance. Also, condition assessment of power
transformers is well reported in the literature with commonly accepted standards and thresholds
compared to other power system devices.
The assessed conditions are typically reported in a so-called risk matrix, representing the
condition or health index on one axis and the criticality (or ‘importance’) of each unit on the other
axis. Then the area is divided into three or more arbitrary zones representing categories dubbed as
‘normal operation’, ‘suspected / increased monitoring’, ‘alarm 1’ (plan replacement or more
detailed assessment) and ‘alarm 2’ (take out of service immediately). The problem with these risk
matrices is twofold. Firstly, they lack time dependency at both axes. Condition deteriorates over
time and criticality changes with availability of spare parts, topology changes, added customers or
load and a host of other influences. The risk matrices indicate immediate problems but are not
predictive. Secondly, the zones are arbitrary and granular (not quantified). It is equally arbitrary
whether a red zone is actually red and deserves spending. Again, it is the forecasting and

G. Cliteur
quantification that allow for proper allocation of dollars that, in turn, provide the real benefits and
ensure a sustainable electric power supply.
2.3.2 And what is not done?

One of the most elegant yet often omitted applications is trending the assessment outcome. Last
year we measured this value and it was 80% off of the threshold (for failure or a certain alarm
value), now it is only 70%. Correcting for potential differences in operation and maintenance
regimes, this would yield an expected remaining lifetime (everything assumed equal) of 7 years.
Of course, the threshold is not deterministic. If only there was one single indicator that was easy
to determine yet 100% predictive... In reality, both the measurement result and threshold have
inaccuracy (related to the repeatability of the measurement) and uncertainty (related to restricted
knowledge, past data of comparable events, etc), respectively. However, the accuracy of the
measurement should be known and the uncertainty related to the threshold can be diminished;
incremental research may deliver more predictive results. This ‘incremental research’ is not a
static, expensive, off-line R&D assignment but can be integrated into day-to-day operations. It
requires the same data as used for the assessment itself augmented with failure data. Equipment
failure is a unique moment to learn and improve; track age, operational data relevant to
deterioration and at the log the actual failure mode as a minimal set of parameters to be evaluated
post-mortem. The process is depicted in Figure 2.1.
Improvement process
Theoretical (physics, design) Generated by equipment
physical knowledge (1)

Historical (failure database, OMS)
Capacity
additions Asset list Failure mode Cause Indicator
(equipment types, position (per equipment type) (per failure mode) (per cause)
Reliability make, model, year)
replacements --- --- --- ---
Corrective --- --- --- ---
replacements --- -other- --- ---
(3)
(5) Read checks Failure threshold Trigger level Mx.Orders

(per indicator) (per indicator) (per failure threshold) (per trigger)
--- --- --- ---
Maintenance--- --- --- ---
(4)
plan --- --- --- ---
Include for criticality (system impact, safety)

(6)
and backlog (2)
Perform maintenance activities (can be another read check)
Record results
Page 115
Figure 2.1 Improvement process for integrated condition assessments

G. Cliteur
There are two reasons why such data is not available and such analyses are not made. First of all,
the assessment related IT tools (i.e. Computerized Maintenance Management Systems) are
predominantly used for admin purposes; work tickets are generated, followed up and closed-out.
The problem with the field crews not willing to fill out the relevant data can be solved by
providing them concise pull down lists of data entries and training. The real problem is the lack of
analytical engine power to load and run queries or any type of algorithms over historic data and
selected assets in these tools. As such, there is simply no possibility for review and feedback.
Secondly, there are few Utilities that have a consolidated database spanning asset registry,
operations, maintenance and planning. The Utilities that want to review and improve spent a
handful of resources in an uncoordinated one-time effort to collect the data. After this effort there
are typically only a few process adaptations to facilitate a continued effort.
2.4 Driving today’s network into the future

The most useful approach to take responsibility for an adequate future power infrastructure is a
repeated and combined fleet assessment and bad-actor approach. Both will be discussed now,
including their interrelation.
A fleet assessment requires the regular asset registry data, inspection and maintenance data, and
operational data. One can either automate to redo the condition scores and predictions when alarm
values may be reached after each newly generated data point or manually do this after a certain
period. This effort comprises of reviewing condition data against operational data to detect or
refine correlations. Every time a failure happened or is detected before actual failure 8 the failure
mode will be evaluated. If it is aging related then this data point will be included in a revised
hazard function computation. If we know further details such as condition data before failure this
may lead to revised alarm thresholds or inspection & maintenance intervals, etc. It also may
provide clues with respect to indicators that are predictive but are not yet being considered up to
date.
The fleet assessment results in three sets of information: the actual bad actors (or suspected units),
individual hazard functions for all units and a consolidated hazard functions for all comparable
units together. The bad actors can be short-listed for replacement (with timings based on their
hazard functions; one can apply a Life Cycle Cost analysis of certain alternatives with time series
of costs, including direct and indirect cost of failure) or they can be put on a watch-list for
increased attention (e.g. condition monitoring). Other measures for consideration are extending
useful life by re-rating (uprating – by deploying latent margin without modification, downrating –
by decreasing operational parameters), upgrading (increase the ratings by a physical modification)
or refurbishment (replacement of deteriorated components), or improved effectiveness of
maintenance. Each measure can be considered either for individual units up to entire asset
categories. The actual measures and budget should depend on the criticality of each listed unit as
risk is not only set forth by the condition of the unit (i.e. the hazard function) but also by the
impact of failure. The impact of failure depends on the node of the network among others.
8
Failure is defined as not being able to perform the specified tasks. As such, a circuit breaker for instance has failed
already when its contacts are stuck. The implications will be noticed upon a tripping signal. The failed condition
needs to be detected before this trigger with a timely condition assessment. Note that with the suggested approach
this does not necessarily imply a diagnostic measurement or inspection.

G. Cliteur
The hazard function for the entire fleet can be used to forecast next year’s failures. In this case,
we do not know for sure which units are going to fail and exactly when, but we do have a measure
for the likely quantity of units failing. This concept is depicted and described in Figure 2.2.
Aging Asset Base - computations

20 60.00%
18
50.00%
16
14 Aging 40.00%
Number of units
Hazard rate
12
10 30.00%
8
20.00%
6
4
10.00%
2
0 0.00%
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
Age
Units prone to failure, actual number of units failing = hazard rate times number of
units.
Failed units will be inserted at age = 0 column, representing replacement with new
equipment. This estimates the capital budget required for replacements as a
baseline.
Failure rates, impact on system reliability, average population age and
corresponding maintenance budget will be computed.
Page 46
Figure 2.2 Concept of hazard rate and age distribution convolution
As discussed, this information supports supply chain management as procurement can now
anticipate the need for units and negotiate rebates for multi-unit advanced orders with a strategic
Vendor. Besides, there is no scramble to find money for replacement potentially disadvantaging
planned projects. Most importantly, a Utility can establish the maintenance and replacement costs
for all Utility plant going forward as a baseline, including effects on system performance. This
baseline can then serve to compare pro-active measures such as replacements, uprates, changes in
maintenance and inspection, etc.
This quantification and forecasting will support the shift from engineering and standards driven
planning to performance based planning for those Utilities that are willing to bridge the gap
between execution and funding. Figure 2.3 represents such a baseline for one of these Utilities.

G. Cliteur
Baseline assessment – Equipment

capital costs
$ 3 5 .0 $ 3 5 .0
Millions
$ 3 0 .0 $ 3 0 .0
Forecasted annual Capital costs
M ax im um c apital c os t
$ 2 5 .0 Current s pending $ 2 5 .0
Capital c os t at s us tainable point
$ 2 0 .0 $ 2 0 .0
Million $
$ 1 5 .0 $ 1 5 .0
$ 1 0 .0 $ 1 0 .0
$ 5 .0 $ 5 .0
$ 0 .0 $ 0 .0
)
V)
V)
es
es
VA
es
)
l
f.
rs
t)
tr o
VA
or
or
ns
n
9k
9k
ke
lin
lin
lin
0M
ou
sw
sw
on
tra
0M
6
6
ea
G
H
H
dm
(<
(>
C
bu
bu
1
O
O
1
br
e
(>
&
(<
rs
ic
rs
a
V
V
V
V
kV
rv
(P
f.
ke
ke
M
H
M
H
f.
ns
io
se
V/
ns
69
V/
ea
ea
f.
ct
tra
ns
EH
EH
tra
H
e
br
br
ot
tra
V
V
V
s.
Pr
H
M
H
e
V/
su
V/
ic
EH
rv
EH
tr.
Se
is
D
Page 52
Figure 2.3 Baseline capital cost assessment result for a selected Utility
For completeness, it must be mentioned that this is just the ground work for true Asset
Management covering aging infrastructure with maintenance, replacement, monitoring and
rerating as strategic options. However, there are more challenges to Utilities such as system
hardening (being able to withstand Storms), lightning and animal induced outages and
vegetation…all these need to be reviewed, potential projects and programs need to be defined,
each with alternative capital and expense options for optimisation. It is the comprehensive
approach of evaluating all T&D issues (quantified and forecasted, dealing with uncertainties),
tying to system performance and investment levels that will define the successful Utility.

G. Cliteur
2.5 Biography
Dr Gerard Cliteur. Gerard is a senior principal consultant with KEMA and specializes in helping
utilities improve business performance through management and technical consulting. He has 14 years of
experience in equipment condition assessment and valuation, equipment modelling and design, failure
analyses and expert witness, maintenance strategy, and asset management. He is responsible for the
initiation and management of large volume projects including consulting, R&D, process improvement, and
technical audits.
Dr. Cliteur is a recognized expert in the interpretation of inspection, maintenance and operational data in
order to assess equipment health, O&M procedures, capital project planning, budgeting, and project
prioritization in order to minimize cost, achieve performance targets, and proactively manage risk. He has
published more than thirty technical papers in these areas, and is a regular instructor for international
courses and seminars.
Prior to joining KEMA, he worked for six years at Toshiba Corporation in Japan, developing Ultra High
Voltage switchgear and he has worked for Endesa in Spain. With KEMA, he has performed consulting
assignments for major utilities including Tennet (The Netherlands), El Paso (USA), CLP Power (Hong
Kong), Public Power Corporation (Greece), Dhofar Power Company (PSE&G subsidiary in Oman),
Tenaga National Berhad (Malaysia), National Hydro Power Company (India), Cinergy (USA), and many
others.
Dr. Cliteur holds a M.Sc. in electrical engineering, Eindhoven University of Technology, (The
Netherlands), and a Ph.D. from Kanazawa University (Japan), and has completed several executive
training programs on business management and finance. He is an IEEE member and chairs the Asset
Management Working Group.

G. Cliteur
3 Introduction to maintenance
Dr. J. Endrenyi, Fellow IEEE
Scientist Emeritus, Kinectrics Inc.
Toronto, Ontario, Canada
Abstract – One goal of power system operators and asset managers is, now more than ever, to minimize
system operating costs and ensure that the system is running most economically. An important operating
cost is the cost of maintenance. Those making decisions about equipment maintenance must have a clear
understanding about what maintenance can achieve, what maintenance methods are available and what
are the assumptions used in the various approaches. This presentation describes the difference between
regular and as-needed maintenance, the effect of maintenance that does not achieve as-new conditions,
and empirical and mathematical maintenance models. Probabilistic mathematical methods and
Reliability Centered Maintenance are highlighted as two promising approaches in the future.
3.1 What is maintenance?

Maintenance, according to definitions published in an IEEE Task Force Report [1], is a form of
restoration of a device where restoration is “an activity which improves the condition of a
device”. Specifically, maintenance is a “restoration wherein an unfailed device has, from time to
time, its deterioration arrested, reduced or eliminated”. This contrasts with the activity of repair,
which is a “restoration wherein a failed device is returned to operable condition.” The quoted
definitions are reprinted in Reference 2.
The purpose of maintenance, as generally perceived, is to increase the lifetime of a device and
extend its time between failures, by restoring it to a “younger” condition. This is a worthwhile
goal, because it would help to increase component and system reliability. Electric utilities have
always relied on maintenance programs to keep their equipment in good working condition. It
must be pointed out, however, that maintenance is just one of the tools for increasing reliability.
Others include adding more generation, increasing transmission redundancy and installing more
reliable components. At a time, however, when these approaches are heavily constrained, electric
utilities are forced to get the most out of the devices they already own, through more effective
operating policies, including more effective maintenance programs.
An important relation can be observed in the above definition of maintenance: the concept is
linked with the process of equipment deterioration. It is obvious that a sequence of ever-
increasing deterioration would lead to failure. Maintenance is carried out in the hope that by
slowing deterioration the (mean) time to failure can be made longer. Asset managers might be
willing to pay for an increase in relatively inexpensive maintenance activities if thereby the
number of costly repairs following failures can be reduced. But it is clear that the sum of the two
expenses will reach a point of optimum where it is the lowest. It is the task of maintenance
planners to identify this point and install maintenance policies where the minimal cost is at least
approximated.
Not every failure is the consequence of deterioration. Devices can fail for many reasons. Some are
caused by external events such as weather phenomena (lightning, ice, wind, heat), or damages
inflicted by animals or humans. The device in question sees these as random phenomena and no
Introduction to maintenance 14
J. Endrenyi
oiling, adjusting, cleaning or tuning will make any difference in the frequency of such failures.
These failures are called external failures, as opposed to failures intrinsic to the device itself,
being the consequences of deterioration and ageing, which are internal failures. The times to
internal failures can be controlled by maintenance performed on the device itself. Such
maintenance is called internal maintenance, or simply maintenance, if this does not cause any
confusion.
The rates of external failures can be reduced only by changes in design, such as the erection of
barriers and fences, or improved shielding of transmission lines against lightning, or burying the
circuits under ground. In some cases one can speak of external maintenance; for example, when
trees in the vicinity of overhead lines are regularly trimmed to avoid failures due to contact with
tree branches. Note that external maintenance is performed outside the device, not on the device.
This presentation will not be concerned with external failures and maintenance.
Maintenance is an important part of asset management. As deterioration increases, the asset value
(condition) of a device is reducing. The connection between asset value, time, maintenance and
reliability is shown in Figure 3.1[3]. The curves in the figure are called life curves. Since they
are derived from probabilistic information, the times shown represent means.
Figure 3.1 Life curves
Figure 3.1 illustrates conditions for three maintenance policies, including Policy 0 where no
maintenance is performed at all. If failure is defined as the asset condition where asset value
becomes zero, and lifetime, as the mean time it takes to reach this condition, the extensions of
mean life T0 to T1 when Policy 1 is applied instead of Policy 0, and T1 to T2 when Policy 1 is
replaced by Policy 2, can be clearly seen in the figure. So are the changes in the asset condition
(value) at any time T. Note that both failure and lifetime can be defined differently; e.g., failure
could be tied to any asset condition which is deemed unacceptable.
As far as reliability is concerned (measured in this case by the mean time to failure), Policy 2 is
superior to Policy 1. Maintenance clearly affects component and system reliability. But
maintenance has its own costs, and when comparing policies, this has to be taken into account.
The increasing costs of carrying out maintenance more frequently must be balanced against the
gains resulting from improved reliability. When costs are also considered, Policy 2 in Figure 3.1
may be very costly and, therefore, may not be superior to Policy 1.
J. Endrenyi
3.2 Review of maintenance policies

Maintenance has been performed for a long time on a great variety of devices and machines, and
over the decades many routines have been devised for the purpose. Originally, maintenance
policies have been chosen on the basis of long-time experience and later, by following the
recommendations of manuals issued by manufacturers. In most cases, maintenance has been
carried out at regular, fixed intervals. This practice is also called scheduled maintenance 9 and it is
still the maintenance policy most often used.
3.2.1 Improvement vs. replacement

The simplest representation of scheduled maintenance in terms of life curves is shown in Figure
3.2a. Maintenance is commenced at equally spaced times TM, 2TM, . . . (scheduled maintenance).
The diagram is constructed on the assumption that maintenance would invariably result in as-new
conditions, an assumption frequently made or tacitly implied. From Figure 3.2a it appears that
the device would never fail, except for the fact that life processes are probabilistic and failure can
occur, with low probability, at every point of a deterioration curve. Neither the curves in Figure
3.1, nor those in the various models in Figure 3.2 give account of this possibility – these
representations are inherently deterministic.
If maintenance would invariably result in as-new conditions, it would have the same effect as
every time replacing the device with an identical new component. Only costs would decide
which one to choose; and perhaps, nowadays, more and more often replacement would win.
However, the assumption is not realistic. Maintenance is not carried out to regain 100% of the
asset’s value but only a fraction of it; in most cases, this makes maintenance cheaper than
replacement. If it is assumed that maintenance is done to 90% of the asset condition level reached
at the previous maintenance, the resulting life curve will run as shown in Figure 3.2b.
Maintenance is still triggered by reaching its due time but terminated at the predefined level
(dotted line). Now failure would occur even in the deterministic process.
9
This presentation follows the terminology proposed in Reference 1. Other terms exist and are referred to in the
terminology. The IEEE Task Force which approved the proposed terms saw no reason why any of the other terms
should be preferred to those recommended.
J. Endrenyi
Figure 3.2 Life curves for various maintenance approaches: (a) “perfect” regular
maintenance, (b) imperfect maintenance, (c) as-needed maintenance - All ordinates are “Asset Conditions”
A large number of replacement policies are described in the literature; in fact, most of the
literature concerns itself with replacement only, neglecting the possibility that maintenance may
result in smaller improvements at smaller costs. Maintenance policies involving limited condition
improvement are mostly based on experience, and such empirical approaches cannot predict and
compare changes in reliability as a result of applying various maintenance policies.
3.2.2 Regular vs. as-needed maintenance

In the last decade or so, a growing number of industrial operators saw merit in freeing up the
regularity of maintenance intervals in favor of performing maintenance only when needed. 10 This
approach obviously offers savings, but it also requires new expenses for routines to identify times
for maintenance. To find out when maintenance is needed, condition monitoring – periodic or
continuous – and appropriate criteria for triggering action are required. Development of a life
curve for this approach is shown in Figure 3.2c.
10
Actually, as-needed maintenance has been practiced for centuries. The bearings on the wheels of horse-drawn
carriages were greased only when the driver noticed that they were running dry.
J. Endrenyi
The lower dotted line represents the outcome of condition monitoring; it “triggers” maintenance
as soon as the component deterioration curves (the curved lines parallel to the appropriate
sections of the M0 curve) reach it. When the resulting improvements touch the upper dotted line,
maintenance is completed. It seems that maintenance frequency increases at old age and so does
(assuming the 90% rule) the “depth” of maintenance: at the beginning, minor maintenance may
suffice, but later on, major maintenance or even overhaul may be required. The lines for policies
1 and 2 in Figure 3.1 run between the two dotted lines, obtained by some arbitrary rule, and
provide a smooth representation of the process.
3.2.3 Empirical vs. mathematical approaches

Many empirical models are simple and the rules involved are easy to understand. But they are not
very flexible and the benefits obtained from their application cannot be clearly identified. Also,
cost and reliability optimization cannot be carried out.
Notwithstanding the above, some empirical approaches developed in the last 20 years are far from
very simple, but their logic is very clear and they have the promise of being used more generally.
Such approach is the Reliability Centered Maintenance (RCM), first proposed about 20 years
ago [4,5]. It is based on condition monitoring and, therefore, does not follow rigid maintenance
schedules. It includes failure cause analysis and an investigation of operating needs and
priorities. From this information, it selects the critical components in a system (those that are
dominant contributors to system failure or to the resulting financial loss) and indicates more
stringent maintenance policies for these components; in fact, it assists in deciding where the next
dollar budgeted for maintenance should go. An important advantage of the RCM approach is that
it also considers external, non deterioration-originated failures (e.g., those caused by weather,
animals, humans).
Example
Consider the case of overhead lines in distribution systems. According to fault and interruption
statistics in the UK, the percentages of failure causes of such lines are the following [6] (since
only the dominant failure causes are shown, the percentages are rounded and do not add up to
100):
Weather 55%
Damage from animals 5%
Human damage 3%
Trees 11%
Ageing 14%
The conclusion appears to be that the maintenance budget for overhead lines should be divided
almost equally between internal and external programs. The external budget would be spent
mostly on tree trimming and some design changes, such as the erection of barriers and fences. 
The RCM approach is discussed in more detail in Chapter 4 by Dr. L. Bertling.
Maintenance policies based on mathematical models are much more flexible than heuristic
policies. Mathematical models can incorporate a wide variety of assumptions and constraints, but
in the process they can become quite complex. A great advantage of the mathematical approach
J. Endrenyi
is that the outcomes can be optimized. Optimization with regard to changes in some basic model
parameter can be carried out for maximal reliability or minimal costs.
Mathematical models can be deterministic or probabilistic. Since maintenance models are used
for predicting the effects of maintenance in the future, probabilistic methods are more appropriate
than deterministic ones, even if the price for their use is increased complexity and a consequent
loss in transparency. For these reasons, the use of such methods is spreading only slowly.
The simpler mathematical models are still based on fixed maintenance intervals (scheduled
maintenance), and optimization will be carried out, in most cases, through sensitivity analysis, by
varying, say, the frequency of maintenance. More complex models [7,8,9] incorporate the idea of
condition monitoring where decisions about the timing and amount of maintenance are dependent
on the actual condition of the device (predictive maintenance). Such policies can be optimized
with respect to any of the model parameters, such as the frequency of inspections.
3.2.4 A simple deterministic model

This example is based on one in Reference [10]. Consider a device that breaks down from time to
time. To reduce the number of breakdowns, inspections are made n times a year when minor
modifications may be carried out. The optimal number of inspections that minimizes the total
yearly outage time, consisting of the repair times after failures and the inspection durations, is to
be determined.
Let the failure rate be λ(n) occurrences per year, where λ is independent of time but is a function
of the inspection frequency. Therefore, the total downtime T(n) is also a function of n. Further,
let it be assumed that
λ (n)= k (n + 1) (3.1)
where the numerical value of k indicates the failure frequency when no inspections are made.
If tr is the average duration of one repair and ti the average duration of one inspection, then
T (n) = λ (n)tr + nti (3.2)
Substituting (3.1), taking the derivative of T(n) with respect to n, and equating it with zero,
dT (n) - ktr
= + ti = 0 (3.3)
dn (n + 1) 2
From the second statement, the optimal value of n becomes

nopt = (ktr / ti )½ - 1 (3.4)
With k = 5 per yr, tr = 6 h and ti = 0.6 h, one obtains that nopt = 6.07 per yr, or the optimal
inspection frequency is about one in every two months. The total outage time is T(6) = 7.9 h/yr,
whereas without inspections it would be T(0) = 30 h/yr.
J. Endrenyi
3.3 Linking reliability and maintenance: a probabilistic approach

As already mentioned, one of the tasks of maintenance studies is cost optimization, where the
costs include both the maintenance and repair costs. Repairs are assumed to be done, of course,
after each failure. If it is decided to do maintenance more often or to more exacting standards, its
costs will increase; as a result, however, lower failure frequency and associated repair costs can
be expected. The goal is to balance these expenditures. To do so, a model is needed which can
calculate the effect of changes in maintenance parameters on the various reliability parameters. In
other words, a model which can provide a fast answer to questions like “what is the effect on the
mean time to failure if the maintenance frequency is raised by 20%”.
As one can see from the “Simple deterministic model” above, optimization is easily included in
mathematical models. On the other hand, modelling the relation between maintenance
(inspection) and reliability (failure rate) is still a problem. In the example above, this relation is
given by (3.1). It should be observed that this relation is assumed, and not a result of calculations.
What is missing is a mathematical model where this relation is part of the model itself, and the
effect of maintenance on reliability is part of the solution.
In the following, probabilistic models will be presented for a device without and with
maintenance.
3.3.1 Basic models

A simple failure-repair process for a deteriorating device is shown in Figure 3.3. The various
states in the diagram are explained in the legend. The deterioration process is represented by a
sequence of stages of increasing wear, finally leading to equipment failure. Deterioration is, of
course, a continuous process in time, and only for easier modeling is it considered to occur in
discrete steps.
Figure 3.3 State diagram including stages of deterioration (D1, D2, . . .). F: failure state.
The number of deterioration stages may vary, and so do their definitions. In most applications, the
stages are defined through physical signs such as markers on wear or corrosion. This, of course,
makes periodic inspections necessary to determine the stage of deterioration the device has
reached. The mean times of the stages are usually uneven, and are selected from performance data
or by judgment based on experience.
The process in Figure 3.3 can be readily represented by a probabilistic mathematical model. If
the rates of transitions shown between the states can be assumed time-independent, the
mathematical model describing such a process is known as a Markov model. Well-known
techniques exist for the solution of these models [11,12,13]. It can be proven that in a Markov
model the times of transitions between states are exponentially distributed. This property and the
constant-rate property follow from each other.
J. Endrenyi
One way of incorporating maintenance into the model in Figure 3.3 is shown in Figure 3.4. It is
immediately clear that in this arrangement there is no assumption made that maintenance would
produce “new” conditions; in fact, the effect of maintenance can now be limited: it is assumed
that it will improve the device’s condition to that which existed in the previous stage of
deterioration [14]. This contrasts with many strategies described in the literature where
maintenance is considered equivalent to replacement.
If a failure has external causes (e.g., inclement weather), there is a single step from the working to
the failed state. Now, the constant failure-rate assumption leads to the result that maintenance
cannot produce any improvement because the chances of failure in any future time interval are the
same with or without maintenance (a property of the exponential distribution). That maintenance
will not do any good in such cases agrees with experience as expressed by the oft-quoted piece of
wisdom: “If it ain’t broke, don’t fix it!” The situation is quite different for deterioration processes
where the times from new conditions to failure are not exponentially distributed even if the times
between subsequent stages of deterioration are (this can be rigorously proven). In such a process,
maintenance will bring about improvement, and one can conclude that if failures are the
consequences of ageing, maintenance has an important role to play.
Figure 3.4 State diagram including three deterioration stages

and the corresponding maintenance states (F: failure state)
In Figure 3.4, the dotted-line transitions to and from state M1 indicate that maintenance while in
state D1 should really not be performed because it would lead back to state D1 and, therefore, it
would be meaningless. State M1 could be omitted if the maintainer knew that the deterioration
process was still in its first stage and, therefore, no maintenance was necessary. Otherwise,
maintenance must be carried out regularly from the beginning, and state M1 must be part of the
diagram.
It should be observed that this and similar models solve the problem of linking maintenance and
reliability. Upon changing any of the maintenance parameters, the effect on reliability (say, the
mean time to failure) can be readily computed.
A further comparison of the model in Figure 3.4 and similar deterministic models is given in the
Appendix.
3.3.2 The Asset Management Planner (AMP): a practical model

A more sophisticated model [15] based on the scheme in Figure 3.4 and tested in practical
applications is shown in Figure 3.5. A program, called Asset Management Planner (AMP), using
this model, was developed by Kinectrics Inc. in Toronto, Canada. It computes the probabilities,
J. Endrenyi
frequencies and mean durations of the states of a component exposed to deterioration but
undergoing regular inspections and receiving maintenance on an as-needed basis.
Without maintenance, the path from the onset (entering D1) would run through the stages of
deterioration to the failure state F. With maintenance, this straight path to failure is regularly
deflected by inspection and maintenance. According to the diagram, in all stages of deterioration
regular inspections take place (I1, I2, I3), possibly several times, and at the end of each inspection
a decision is made to continue with minor (M) or major (MM) maintenance, or forgo maintenance
and return the device to the state of deterioration it was in before the inspection. Another point of
decision is after minor maintenance when, if the results are considered unsatisfactory, major
maintenance can be initiated.
Figure 3.5 The AMP model

The result of all maintenance activities is expected to be a single-step improvement in the
deterioration chain, following the principle shown in Figure 3.4. However, allowances are made
for instances when no improvement is achieved or even when some damage is done during
maintenance, the latter resulting in the next stage of deterioration. The choice probabilities (at the
points of decision making) and the probabilities associated with the various possible outcomes are
based on user input and are estimated from historical records.
Another technique, developed for computing the so-called first passage times (FPT) between states
[16], will provide the average times of first reaching any state from any other state. Although not
shown, the technique is implemented in the AMP model. If the end-state is F, the FPT’s are the
mean remaining lifetimes from any of the initiating states. This information is necessary for
constructing life curves.
It can be observed that the AMP model can handle both scheduled (regular) and predictive (as
needed) maintenance policies. Figure 3.4 shows an arrangement for scheduled maintenance: the rate
of starting maintenances is always the same. (This rate is the reciprocal of the mean time to
maintenance; the actual times constitute a random variable). The equivalent in Figure 3.5 would be
the removal of the inspection states. The scheme in Figure 3.5, as shown, takes also care of as
needed maintenance. Condition monitoring is done through regular inspections, and if it is found
that no maintenance is needed, the device is returned to the “main line” without being sent for
maintenance. Maintenance is carried out only when needed.
For further elaboration and detailed applications, see Chapter 6 by G. Anders.
J. Endrenyi
3.3.3 Generation of life curves

Life curves have been discussed in Section 3.2, and the present process starts out from the
diagram in Figure 3.2c. Now, however, the generation of a specific life curve that accommodates
the conditions in Figure 3.4 and Figure 3.5 will be discussed. The process occurs in several steps,
as explained below with the help of Figure 3.6.
• First, the borderlines between the deterioration stages D1, D2 and D3, expressed in terms
of percentages of equipment condition, are marked on the vertical axis and entered into the
program.
• Next, AMP/FPT calculations are carried out by the program, to determine the first passage
times between states D1 and D2, D1 and D3, and D1 and F. These are entered on the
time-axis of Figure 3.6. By using the AMP model, the effects of maintenance are already
incorporated.
• If there was no maintenance, the FPT’s D1D2*, D1D3* and D1F* would be obtained and
the corresponding life curve would run as shown. (This is identical to the curves M0 in
Figure 3.2.) With maintenance, the life curve is no longer a smooth line but a rugged one
indicating the deterioration between maintenances and the improvements caused by them.
A crude realization of the process is shown in Figure 3.6. Note that the placement of the
dotted lines ensures that maintenances out of the state D2 should take the device into D1,
and those out of D3 into D2 – as prescribed in Figure 3.4. Some niceties in Figure 3.5 are
not considered.
• The equivalent smooth life curve is drawn by observing the following simple rules. At
time 0 it must be at 100%, at D1F it must be 0. At the remaining two ordinates, by
arbitrary decision, it should be near the lower quarter of the respective domains. (In Figure
3.6, the midpoints are used, an earlier convention.)
Figure 3.6 Development of life curves without maintenance (a), and with maintenance (b)
3.4 Conclusions
In this review, a survey is offered of the various maintenance methods available to operators. The
methods range from the simplest, “follow the manual”-types to detailed probabilistic approaches.
To get most out of maintenance, one would have to select a mathematical model where
J. Endrenyi
optimization is possible – optimization for highest reliability or lowest operating costs. There can
be little doubt that such probabilistic models would be the best tools for identifying policies that
provide the highest cost savings.
Another choice of which operators are becoming more and more aware is to apply a maintenance
policy based on no rigid schedule but on the “as needed” principle. This can be implemented with
or without mathematical models; example for the latter is the RCM approach. RCM, steadily
gaining in popularity, is based on an analysis of failure causes and past performance, and helps to
decide where to put the next dollar budgeted for maintenance. The method is good for comparing
policies, but not for true optimization.
In today’s competitive environment, cost optimization is becoming even more important. This is
particularly true for transmission and distribution equipment where the maintenance choices
described in this review fully apply. While the maintenance times of generating units may be
determined by different considerations, many of the basic principles discussed in this Chapter will
still have relevance.
3.5 References
[1] IEEE/PES Task Force, “The Present Status of Maintenance Strategies and the Impact of Maintenance on
Reliability”, IEEE Trans. Power Systems, 16, 4, pp. 638-646, November 2001.
[2] IEEE Tutorial on Electric Delivery System Reliability Evaluation, 05TP175, Chapter 5, “Reliability and
Maintenance”, by J Endrenyi. IEEE/PES General Meeting, San Francisco, CA, 2005.
[3] Anders, G.J. and Endrenyi, J., “Using Life Curves in the Management of Equipment Maintenance”,
Proceedings of the 7th PMAPS Conference, Naples, 2002.
[4] Smith, A.M., Reliability-Centered Maintenance. McGraw-Hill, Inc., New York, 1993.
[5] Moubray, J., Reliability-centered maintenance. Industrial Press Inc., New York, 1992.
[6] Bertling, L., Reliability Centred Maintenance for Electric Power Distribution Systems, PhD thesis, Royal
Institute of Technology (KTH), Stockholm, 2002.
[7] Canfield, R.V., "Cost Optimization of Periodic Peventive Maintenance", IEEE Trans. on Reliability, 35, 1,
pp. 78-81, April 1986.
[8] Anders, G.J. et al. "Maintenance Planning Based on Probabilistic Modeling of Aging in Rotating Machines",
CIGRE Conference Paper No. 11-309, Paris, 1992.
[9] Reichman, B. et al. "Application of a Maintenance Planning Model for Rotating Machines", CIGRE
Conference Paper No. 11-204, Paris, 1994.
[10] Jardine, A.K.S., Maintenance, Replacement and Reliability. Pitman Publishing, London, 1973.
[11] Endrenyi, J., Reliability Modeling in Electric Power Systems. J. Wiley & Sons, Chichester, 1978.
[12] Anders, G.J., Probability Concepts in Electric Power Systems. J. Wiley & Sons, New York, 1990.
[13] Billinton, R. and Allan, R.N., Reliability Evaluation of Engineering Systems, Second Edition. Plenum Press,
London, 1992.
[14] Sim, S.H. and Endrenyi, J., "Optimal Preventive Maintenance with Repair", IEEE Trans. on Reliability, 37,
1, pp. 92-96, April 1988.
[15] Endrenyi, J., Anders, G.J. and Leite da Silva, A.M., "Probabilistic Evaluation of the Effect of Maintenance
on Reliability - An Application", IEEE Trans. on Power Systems, 13, 2, pp.576-583, May 1998.
[16] Anders, G.J. and Leite da Silva, A.M., “Cost Related Reliability Measures for Power System Equipment.”
IEEE Trans. On Power Systems, 15, 2, pp. 654-660, May 2000.
J. Endrenyi
3.6 Appendix: Deterministic or probabilistic models

In this Appendix, a short comparison is made between a deterministic and a probabilistic
approach describing the same situation, and a potential weakness of the deterministic approach is
pointed out. Consider a deterioration-maintenance process similar to that shown in Figure 3.4. A
deterministic equivalent is presented Figure 3.7(a). It is assumed that without maintenance the
device would fail after (exactly) 10 years, the (rigid) maintenance interval is 3 years, and the effect of
maintenance is a 1-year improvement in deterioration. Deterioration and maintenance are still linked
through an algorithm based on the diagram; this algorithm constitutes a deterministic mathematical
model. It can be seen that the time to failure now becomes 14 years as a result of the four
maintenances carried out in the interval.
Figure 3.7: Maintenance every 3 years, resulting in

(a) 1-year improvement, (b) 3-year improvement
if total wear is 6 years or more, otherwise as in (a)
M – maintenance MM – overhaul F – failure
While it is conceivable that the improvement due to a maintenance activity is less than the
deterioration between two consecutive maintenances, especially early in the life of a device when
only minor maintenances are performed, later the effect of maintenance should equal or exceed
the deterioration occurring between maintenances. This can be ensured by scheduling overhauls
(major maintenances) beyond a given stage of deterioration. If, for instance, in the above example
overhaul is required instead of maintenance after the deterioration stage of 6 years, and if the
effect of overhaul is a 3-year improvement in deterioration, the diagram will change to that shown
in Figure 3.7 (b). Note that now the expected time to failure is infinite.
The problem with this deterministic representation (and many others) becomes obvious in the last
example. It is easy to visualize that if the improvement resulting from maintenance is less than the
maintenance interval, the process will tend “to the right” and end in failure. However, this can be
considered an unlikely case. Every time the improvement equals the maintenance interval, the
process will oscillate within a given range, as in Figure 3.7 (b), and if it exceeds the maintenance
interval, the process will move “to the left”. In both latter cases the implication is that failure will
never occur. This is a false conclusion and is due to the assumptions that (a) failures cannot occur
during the various stages of deterioration, and (b) all quantities involved have fixed values. If
variability is allowed and the probability of failure is in no state is the probability of failure
assumed to be zero, as in a probabilistic model, the failure state will, sooner or later, always be
reached. This agrees with experience and can be rigorously proven.
J. Endrenyi
3.7 Biography
John Endrenyi (M’59, SM’76, F’87, LF’94) is Principal Scientist Emeritus at Kinectrics, Toronto
(formerly Ontario Hydro Technologies), and retired Adjunct Professor at the University of Toronto. He
received a Diploma of Electrical Engineering from the Technical University of Budapest, the MASc
degree from the University of Waterloo (Ontario) and the Ph.D. from the University of Toronto. He joined
Ontario Hydro’s Research Division in 1959 where he was first engaged in station and transmission line
grounding studies and, later, in the development of probabilistic models for power system reliability. He
has contributed to the methodology of power system reliability and maintenance through numerous papers,
seminars, tutorials, a book, and participation in several IEEE, EPRI, CIGRE and IEC committees. In 2004,
he received the biennial award of the PMAPS (Probability Methods Applied to Power Systems)
International Society. Dr. Endrenyi is a registered Professional Engineer in the Province of Ontario. (e-
mail: john.endrenyi@rogers.com)
J. Endrenyi
4 RCM and its extension into a quantitative approach RCAM
Dr. Lina Bertling, Member IEEE

KTH (Royal Institute of Technology),
Stockholm, Sweden
Abstract -Reliability-centred maintenance (RCM) is a qualitative systematic approach to organizing

maintenance. It originates from a need developing more efficient approaches for planning of preventive
maintenance, not lowering the level of reliability. The main feature of RCM is its focus on preserving
system function where critical components for system reliability are prioritized for PM measures.
However, the method is generally not capable of showing the benefits of maintenance for system reliability
and costs. For this purpose a quantitative approach for RCM has been developed, i.e. the reliability-
centred asset management method.(RCAM).
This chapter provides an overview of two different approaches for RCM i.e. RCM II and RCAM. The
chapter also shows on application studies using the RCAM approach. Results from application studies
show how the RCAM method can be used to compare different maintenance methods and PM strategies
based on the total cost of maintenance, which includes the impact of the PM measure on the system
reliability. Relating maintenance effort and reliability improvement is, however, a complex problem, and
substantial input data is required to support the method. The RCAM, as well as the RCM, approach
consequently provides a means for creating resources to provide input data.
4.1 Introduction
Reliability overall can be improved by lowering either the frequency or the duration of
interruptions. Preventive maintenance (PM) activities could impact on the frequency by
preventing the actual cause of the failure. Consequently, PM is cost-effective when the reliability
benefit outweighs the cost of implementing the PM measure. There is, therefore, a need for
utilities to incorporate systematic methods which relate maintenance of system assets to the
improvement in system reliability. This is part of the wider concept of asset management. Asset
management involves making decisions to allow the network business to maximize long term
profits, while delivering high service levels to the customers with acceptable and manageable
risks.
Reliability evaluation and maintenance planning techniques have separately been well developed,
for example [1] and [2], with reliability assessment starting in the 1930s [3]. However, few
techniques relate system reliability to component maintenance. Furthermore, the available
techniques are not generally put into practice. Reasons for this, according with the author, are
typically the lack of suitable input data, and a general reluctance to use theoretical tools to address
the practical problem of maintenance planning. There is however an existing, and shown
successful, approach for relating reliability to PM is known as reliability-centred maintenance
(RCM).
This chapter briefly describe two different RCM approaches. The first method described, RCM II
is a well known approach and is proposed by John Moubray in his book "Reliability centred
maintenance" [4]. The second method that is presented, has been developed within a research
RCM and its extension into a quantitative approach RCAM 27

L. Bertling
project at KTH (The Royal Institute of Technology) and involves a high degree of modelling
[5][6]. The different steps in the two approaches are presented, and for the RCAM approach
results from application studies are included. Finally a comparison of these methods is made, and
future challenges are summarized.
4.2 Reliability-centred maintenance (RCM)

4.2.1 The background and concepts of RCM
RCM is a qualitative systematic approach to organizing maintenance [4],[7] and [8]. It originated
in the civil aircraft industry in the 1960s with the introduction of the Boeing 747 series, and the
need to lower PM costs in attaining a certain level of reliability. The results were successful and
the methodology was developed further. In 1975 the US Department of Commerce defined the
concept RCM and declared that it should be used in all major military systems [4],. In the 1980s,
the Electric Power Research Institute (EPRI) introduced RCM into the nuclear power industry.
Today RCM is used or being considered by an increasing number of electrical utilities [9], [10].
The main feature of RCM is its focus on preserving system function where critical components
for system reliability are prioritized for PM measures. However, the method is generally not
capable of showing the benefits of maintenance for system reliability and costs.
There are different versions of the RCM approach in use. In the 1980s, questions concerning the
environment became important issues. This led to more focus being put into these issues
according to Moubray [4]. Streamlined reliability centred maintenance (SRCM) are simplified
versions of RCM. The streamlined versions are developed to lower the recourses needed to
perform RCM.
Maintenance and reliability are important because of the large costs associated with maintenance
tasks and costs due to loss in production and breakdowns. Breakdowns can also lead to
consequences that affect the environment or personal safety. These aspects could also be taken
into consideration when performing a RCM analysis.
4.2.2 RCM according to Moubray

The RCM II method has a strong focus on environmental and safety issues. A short summary of
the method found in Moubray's book [4] is presented in this section.
The RCM II process involves asking seven questions about the studied system:
1. What are the functions and associated performance standards of the asset in its present
operating context?
2. In what ways does it fail to fulfil its functions?
3. What causes each functional failure?
4. What happens when each failure occurs?
5. In what way does each failure matter?
6. What can be done to predict or prevent each failure?
7. What should be done if a suitable preventive task cannot be found?
These steps are described in more detail below followed with some additional features of this
method.

L. Bertling
4.2.2.1 What are the functions of the asset?

To answer this question the asset's functions are divided into primary and secondary functions.
The primary functions are the main purposes of the asset while secondary functions are additional
properties that the asset is expected to meet. Functions should be described by a verb, an object
and a standard of performance.
4.2.2.2 In what ways does it fail to fulfill its functions?

The next step is to identify in what way the asset can fail to perform it's functions established in
step one. There could be several ways the asset fails to fulfil its desired functions.
4.2.2.3 What causes each functional failure?

Each functional failure may have several causes, failure modes. It is at this level that the
maintenance of the system is to be done. It is stressed that the analysis must be applied at an
appropriate detail level otherwise the work may become very extensive or in the other case,
become meaningless.
4.2.2.4 What happens when each failure occurs?

The effects of the failure should be recorded. This includes evidence that a failure has occurred,
environmental or safety threats, effects on production, physical damage and how to restore the
system after the failure.
4.2.2.5 In what way does each failure matter?

This step analyses what consequences each failure leads to. First the failures are classified as
apparent or hidden. If occurring on their own, hidden failures will not be noticed. Evident failures
are failures which will become evident if occurring on their own. Evident functional failures are
classified according to three groups that describe what the consequences of a failure are. The
three groups listed below are ordered according to importance.
1. Safety and environmental consequences
2. Operational consequences
3. Non-operational consequences
Operational failures affect costs in connection with production and operation. Non-operational
failures only effect the cost of repairing.
4.2.2.6 What can be done to predict or prevent each failure?

Examine if there is any maintenance which can be done to prevent or predict the failure. These
tasks are called preventive tasks.
Predetermined tasks which may be used are scheduled restoration and scheduled discard. These
are often appropriate when dealing with age-related failures. To use these strategies there must be
a point in time when there is an increase in the probability of failure.
Condition based tasks are used to identify potential failures. If condition based tasks are feasible
the problem of how frequently to perform these tasks must be answered. This can be a difficult
problem if reliable information about the failure probabilities and P-F intervals is hard to acquire.
Condition based tasks are feasible if a potential failure condition is possible to identify, the P-F
interval is reasonably constant and not too short and that monitoring the item at intervals shorter
than the P-F interval is possible.

L. Bertling
Condition based maintenance and monitoring are discussed in more detail in Chapter 5 by Dr. A.
Jardine.
4.2.2.7 What should be done if a suitable preventive task cannot be found?

If no appropriate preventive task is feasible or worth doing there are three choices; redesign, no
scheduled maintenance or to conduct failure finding tasks. Failure finding tasks are intended for
hidden failures. When deciding which option to choose the consequences of failures must be
considered. If the consequence is non-operational, economy can rule the choice but when there is
safety or environmental consequences redesign might be the only option.
When applying the RCM II process there is a decision diagram that should be followed. When a
maintenance strategy is found that is feasible and worth doing it is chosen, and further analysis of
other maintenance tasks is not required. Whenever possible, scheduled on-condition tasks should
be chosen. Otherwise scheduled restoration tasks and then scheduled discard tasks are selected.
The last choice when dealing with less severe consequences (operational and non-operational) is
no scheduled maintenance or redesign. If the consequence involves environmental or safety
hazards the problem must be addressed and no scheduled maintenance is not an option.
4.2.2.8 Characteristics of RCM II

Moubray's method has a predetermined preference of maintenance strategies. The method steers
towards performing preventive tasks rather than corrective tasks after a failure. Of the preventive
strategies condition based maintenance is preferred to pre-determined maintenance.
Environmental and safety consequences have a high priority in the analysis.
Since the process stops when an acceptable maintenance strategy is found it is possible that
another type of strategy would be more efficient if it also were evaluated. On the other hand some
work can be saved and the process is made faster this way.
4.3 Reliability-centred asset management (RCAM)

The RCAM method is developed from RCM principles attempting to relate more closely the
impact of maintenance to the cost and reliability of the system. The method has been developed
from comprehensive application studies for real power distribution systems [5],[6] and [11].
As a first step in the method, the critical components for the system reliability are identified from
a sensitivity analysis. These components are further studied, focusing on the impact of
maintenance measures. The relationship between reliability and maintenance has been established
by relating the effect of PM to the causes of failures for the component being assessed. Two
different approaches have been used. The first approach assumes a constant reduction ratio
between failure rates and the effect of PM, whereas the second approach assumes this ratio to be
dependent on time. In the first case λ(PM) depends only on the effect of PM (Approach I). In the
second case, λ(t,PM) is also time-dependent (Approach II), and the failure rate reduction is a
consequence of the PM actions considered for the specific component that is studied.
Formulating the failure rate model for Approach II is a complicated task. Studies on this have
been made for the underground cable component [5], [6] and [11] and for breaker components
[11], [12],[13] and [14]. Results from these studies are presented in Section 4.4.

L. Bertling
The main stages of the RCAM approach are:

Stage 1System reliability analysis: defines the system and evaluates critical components affecting
system reliability.
Stage 2Component reliability modelling: analyzes the components in detail and, with the support
of appropriate input data, defines the quantitative relationship between reliability and PM
measures.
Stage 3System reliability and cost/benefit analysis: puts the results of Stage 2 into a system
perspective, and evaluates the effect of component maintenance on system reliability and the
impact on cost of different PM strategies.
These three stages emphasize a central feature of the method: that the analysis moves from the
system level to the component level and back to the system level.
4.3.1 Economic evaluation

The economic evaluation brings the RCAM analysis to its final step: to relate the benefits in costs
due to the impact of maintenance on reliability. The motivation for any PM strategy is that the
cost of applying the PM measure should be less than taking no action at all. If little or no PM is
done, then more system failures are likely to occur resulting in more repair actions being required,
i.e. in more corrective maintenance (CM) actions. Therefore, the important issue is to compare the
costs associated with different maintenance methods, including both PM and CM with the
objective of minimizing the total cost of maintenance.
There are several costs that can be related to the effect of system failures. Two direct utility costs
are: (a) cost of failure (CM), e.g. repair costs and losses in revenue due to non-delivered energy,
and (b) cost of the PM actions, e.g. planned maintenance or replacement of a component in
advance of failure. However, the cost of failure also depends on the customer cost [15]. A supply
interruption affects the customer, who will suffer supply unavailability and may suffer direct costs
and/or be compensated via a penalty payment. Consequently, the proposed cost analysis
considers:
• the cost of failure Cf
• the cost of preventive maintenance CPM
• the cost of interruption Cint
The optimal maintenance method and PM strategy is the solution that minimizes the sum of these
three costs. However, in some cases it may not be necessary to include Cint , for example for a
simple or first order comparison of strategies.
Section 4.4 presents an application study following the RCAM approach. The economic
evaluations have been made using fundamental techniques. The costs are evaluated on an annual
basis with an assumed increase due to inflation d 1 . Furthermore, the investments in PM measures
are spread over the remaining time of the assessment period T. Finally the present worth value of
the total annualized costs is evaluated. The present worth value of one outlay (C) to be paid after
n years with the discount rate d 2 , is gained by multiplying by the present worth value factor
PVf (n, d ) = (1+ d2 ) .
−n

L. Bertling
4.3.2 The steps in the RCAM approach [6]

Figure 8 illustrates the logic for the RCAM method. This figure includes the different stages and
steps in the method, and the systematic process for analyzing the system components and their
causes of failures.
The ten steps needed to perform the RCAM approach, as identified in Figure 8, are presented in
more detail in this section.
Stage 1
System reliability 1. Define reliability model
analysis and required input data
*
2. Identify critical components
by reliability analysis
Stage 2
Component reliability 3. Identify failure causes
For each: critical component i,

modelling by failure mode analysis
4. Define a failure rate model
5. Model effect of PM
PM method j, and
failure cause k.
on reliability
Are there more causes Yes

of failures ?
No
Are there alternative Yes

PM methods ?
No
6. Deduce PM plans and

evaluate resulting model
Yes
Are there more critical
components ?
No
Stage 3 7. Define strategy for PM *

System reliability when, what, how
cost/benefit analysis
8. Estimate composite
failure rate
9. Compare reliability for

PM methods and strategies
10. Identify cost-effective *

PM strategy
RCAM plan
Figure 8 Logic for the RCAM approach [6].

L. Bertling
4.3.2.1 Define reliability model and required input data.
Define input data including: network data, component reliability data and customer data, and a
reliability model.
4.3.2.2 Identify critical voltage levels and components for the system reliability based on results
from reliability analysis.
The approach for the sensitivity analysis is as follows: categorize components according to their
type, vary their input failure rates for one type at a time, and evaluate the resulting indices for the
system and different load points. Perform this analysis for different voltage levels and load points.
The results provide a prioritized list of components for PM measures.
4.3.2.3 Identify failure causes by failure modes analysis for each component identified as
critical and affected by PM
• Identify causes of failures from an understanding of: component functions, failure modes and
failure events.
• Determine the percentage each cause contributes to the total number of failures from
interruption data and expertise.
• Identify experience data for interruptions due to these causes of failures.
• Identify possible effect of alternative PM methods.
4.3.2.4 Define a failure rate model

For components i, i = 1,"n model the failure rate function λi as follows:
4.3.2.5 Approach I : Assume that the failure rate equals the average failure interruption, λia ,
from reliability input data (from Step 1):
λi = λia (4.1)
4.3.2.6 Approach II: Assume that the component failure rate function can be obtained as a sum
of contributions from the different causes of failures of type k , k = 1," m . Deduce a model for the
failure rate as a function of time, using experience data from Step 2 for the failure rate modeling,
as follows:
m
λi (t ) = ∑ λi k (t ) (4.2)
k =1
4.3.2.7 Model the effect of PM methods on reliability for each failure cause
Assume that the PM method j, j = 1,"z , preventing failure cause (k) is applied to component number
i. For each PM method j define a failure rate model as follows:
4.3.2.8 Approach I
• Assume that the effect of applying PM is a reduction of the actual failure cause k with x jk %
reduction, where x jk ∈ [0, a ] and a, is the percentage contribution to the total failures of that
failure cause, and given from Step 3.
• Assume that the failure rate for the analysed component is reduced by the same percentage.
The resulting failure rate function can be evaluated from:
⎛ z m
x ⎞
λi (PM ) = λiav ⎜⎜1 − ∑∑ 100jk ⎟⎟ (4.3)
⎝ j =1 k =1 ⎠

L. Bertling
4.3.2.9 Approach II
• Deduce a model for functional relationship between reliability and PM activities as a function
of time. This model requires more knowledge about the component behaviour and the effect
of applying PM with method j and the impact on specific failure causes.
• The resulting failure rate function can be evaluated from:
z m
λi (t , PM ) = ∑∑ λijk (t , PM ) (4.4)
j =1 k =1
4.3.2.10 Deduce different plans for applying PM, and evaluate the resulting effect on the
component failure rate
Note that for Approach II this requires the effect of applying PM at different times on the
resulting failure rate functions to be evaluated.
4.3.2.11 Define and implement different strategies for PM

A PM strategy, S, for the system is defined by:
• applied PM methods j denoted by: j ⊇ S ,
• proportion of the component type i that are affected by each PM method denoted by: sj,
and also for Approach II, and within the period t ∈ [t 0 , T ] :
• number of times PM is applied v , and
• at what times PM is applied (t PM 1 , t PM 2 ,", t PMv )
4.3.2.12 Estimate the resulting composite failure rate.

This step implies developing the failure rate model for the component i applied with PM strategy
S. The resulting failure rate function provides the input data for component type i to the system
reliability model.
• Define which failure causes are affected by each PM method j in the strategy. Let
k ⊇ j denote the affected causes, and k ⊆ j denote the non-affected causes.
• The resulting failure rate function captures the average composite failure rate characteristic
for the component i. It is made up of several parts, depending on the PM strategy.
4.3.2.13 Approach I
• Define the extent of the effect for each failure cause, affected by PM method j, that is x jk .
• Evaluate the resulting composite failure rate for component type i which is given as follows:
⎧ ⎧ i m
⎫
⎪∑ j ⎨ j
λ + λ ⇔ ∑ λijk ⎬ +
i
⎪ j⊆ S ⎩ k =1 ⎭
λi ( S ) = ⎨ (4.5)
⎪ (1 − s ) ⋅ λ + s ⋅ λ (PM ) +s ⋅ λ
⎪⎩ ∑ j ∑ j ∑
i i i
j j jk jk
j⊇S k⊇ j k⊆ j
4.3.2.14 Approach II
• The following equations define the resulting failure rate function:

L. Bertling
⎧ λi0 (t ) t0 ≤ t ≤ t PM 1
⎪ i
⎪ λ (t ) t PM 1 ≤ t ≤ t PM 2
λ (t , S ) = ⎨ 1
i
(4.6)
⎪ # #
⎪ λi (t ) t PMv ≤ t ≤ T
⎩ v
where
λ i 0 (t ) = λ i (t )
⎧ ⎛ (1 − s 1 j ) ⋅ λ ij (t ) + s 1 j ⋅ ⎞
⎪ ⎜ ⎟
λ 1i (t ) = ⎨ ∑ λ ij (t ) + ∑ ⎜ ⎟
j ⊇ S 1 ⎜ ∑ λ jk (t , PM ) + s 1 j ⋅ ∑ λ jk (t )⎟
i i
⎪ j ⊆ S1
⎩ ⎝ k⊇ j k⊆ j ⎠
#
⎧
⎪ ∑ λ j (t ) +
i
⎪ j⊆ Sv
⎪
λ v (t ) = ⎨ ⎛ (1 − (s 1 j + s 2 j + " + s vj ))⋅ λ ij (t ) +
i
⎞
⎜ ⎟
⎪
∑ ⎜ 1
⎪ j ⊇ S ⎜ λ (t ) + s ⋅
⎟
vj ∑ λ jk (t , PM ) + s vj ⋅ ∑ λ jk (t )⎟
i i
v −1
⎪⎩ 1
⎝ k⊇ j k⊆ j ⎠
4.3.2.15 Compare system reliability when applying different maintenance methods and PM
strategies.
• Perform system reliability analysis with result from Step 8 as input data for included
components. The output is the system and load-point reliability indices that show the
different effects of the PM strategy (S) on the system.
• Compare the impact of PM strategy (S) on system and load-point reliability indices.
• For Approach II, an alternative is to compare the average load-point indices during the
period, evaluated as follows:
T − t0
Δt Δt
λ av , Lpi =
T − t0
∑ λ (t , S )
i
Lpi i
(4.7)
and similarly: U av ,Lpi , rav ,Lpi , Eav ,Lpi for each load point, L pi , in the system model.
• Analyse the effect of using different PM strategies on system reliability.
4.3.2.16 Identify cost effective PM strategy

• Evaluate cost functions in [cost/yr], based on those that were introduced in Section 4.3.1:
• the cost of failure CCM f
• the cost of preventive maintenance CCM PM
• the cost of interruption CCM int
with and without PM respectively as follows:
4.3.2.17 Approach I
n n
CCM f = ∑ λi ⋅ c if , CPM f ( S ) = ∑ λi (S ) ⋅ c if (4.8)
i =1 i =1

L. Bertling
where cif is the cost of failure for component i [cost/int].

n
CCM f (t ) = ∑ λi (t ) ⋅ c if ⋅ (1 + d 1 )
t
i =1 (4.9)
n
CPM f (t, S ) = ∑ λ (t , S ) ⋅ c ⋅ (1 + d 1 )
i i t
f
i =1
where d 1 is the inflation rate.

4.3.2.19 Approach I
n
CPM PM (S ) = ∑∑ C PMj
i
(4.10)
i =1 j ⊇ S
where CPMj
i
is the cost of applying PM method j for component i [cost/measure].
⎧0 t0 ≤ t ≤ tPM1
⎪ n
⎪ ∑ ∑ Ci
⎪⎪ i =1 j ⊇ S1 PMj
(
T − tPM1 + 1 )
tPM1 ≤ t ≤ tPM 2 (4.11)
CPMPM (t, S ) = ⎨
⎪ # #
⎪ ⎛ n ⎞
⎪∑ ⎜⎜ ∑ ∑ CPMj i
(
T − tPM1 + 1 ⎟⎟ tPMv ≤ t ≤ T )
⎪⎩ v ⎝ i =1 j ⊇S v ⎠
where the cost of applying PM, at each PM occasion, is equally spread over the remaining
time period.
4.3.2.21 Approach I
CCM int = c int
Lpi
⋅ E Lpi , CPM int (S ) = c int
Lpi
⋅ E Lpi (PM ) (4.12)
Lpi
where c int is the customer interruption cost in [cost/kWh].
nlp
CCMint (t) = ∑ ELpi (t, CM) ⋅ cint ⋅ (1 + d1 )
Lpi t
i =1 (4.13)
nlp
CPMint (t, S) = ∑ ELpi (t, S ) ⋅ cint ⋅ (1 + d1 )
Lpi t
i =1
• Evaluate the total annualized costs in [cost/yr]:

4.3.2.23 Approach I
TCCM = CCM f + CCMint (4.14)
TCPM(S ) = CPM f (S ) + CPMint (S ) + CPM PM (S )
TCCM (t ) = CCM f (t ) + CCM int (t ) (4.15)
TCPM (t , S ) = CCM f (t , S ) + CCM int (t , S ) + CPM PM (t , S )
• Evaluate present values in [cost].

4.3.2.25 Approach I
The same value as given by:(4.14).

L. Bertling
T
TCCMPV (t ) = ∑ T CCM (t ) ⋅ PV (t , d )
f 2
t =t0 (4.16)
T
TCPMPV (t , S ) = ∑ T CPM (t , S ) ⋅ PV (t , d )
f 2
t =t0
The cost-effective solution is the maintenance strategy that provides the lowest total cost when
comparing the total costs for PM with different sets of S , and with no PM, that is CM.
4.4 RCAM application study for an electrical distribution system [6]

This section provides results from application studies using the RCAM approach for assessment
of an urban electrical distribution system Birka system. The application study includes failure rate
modelling, for the underground cables and with the effect of PM on one failure cause (water-
treeing). For each of the results presented the corresponding step in the RCAM method is noted.
4.4.1 Stage 1 - System reliability analysis for the Birka system (Step 1-3)
The disturbance data for the Stockholm city power system (from 220, 110, 33, to 11kV level) and
the period 1982-1999 was surveyed [17]. The statistics showed that the 11kV voltage level
contributed most to the number of failures and customers affected. A system was selected to
investigate this voltage level in more detail.
This system includes the 220/110 kV Bredäng station and 33/11 kV Liljeholmen station, which
are connected to each other via two parallel 110 kV cables. From the Liljeholmen station (LH11)
there are 32 outgoing 11 kV feeders that supply the southern part of central Stockholm and 14,300
customers. Figure 9 shows the resulting model for thje Birka system. Customers are represented
as one average 11kV load point. The following component types were included: bus bars,
breakers, underground cables, and transformers. Furthermore, these were categorized into the
different voltage levels between 220-11kV.

L. Bertling
Sp
c1 220 kV
c2 c8
c3 c9
c4 c10
110 kV
c5 c11
c6 c12
c7 c13 33 kV
c14
c15 c19 c23
c49 c36
c50 c53 c16 c20 c24 c37 c40 c43
c51 c54 c17 c21 c38 c41 c44

c25
c18 c22 11 kV
c52 c55 c26 c39 c42 c45
c56 c27 c29 c28 c46
c57 c47
c58 c48
c30 c31
SJ HD
c32
c33
c34
c35
LH11
Figure 9 Reliability model of the Birka system [5]
Figure 10 Identifying critical components for the Birka system with cases: (1) base case,
(2) bus bars, (3)breakers, (4) cables, and (5) transformers. (Step 2.)

L. Bertling
The reliability of the Birka system was analysed using input reliability data from experience and
statistics and RADPOW tool (a computer program developed at KTH [5]) [16]. Figure 10 shows
results from Step 2 in the RCAM method defining the critical components. For each case, a
specific component failure rate is assumed to be zero, and the resulting effect on the load point
indices is evaluated. Case 1 refers to the base case with no PM. The most significant reduction
occurs in Case 4, when cables are considered 100% reliable. This shows that these have the
greatest impact on the failure rate and the unavailability for the average 11kV customer. The
significant rise in average outage time is because the repair time for the dominant population of
cables, that is 11kV, is much lower than the repair times for the other components. Therefore the
average restoration time increases when the number of short interruptions is reduced. The
conclusion is that the 11kV cables are critical components for this system.
4.4.2 Stage 2 – Component reliability modeling (Step 3-6)

A comprehensive failure modes analysis was made (Step 3) using 18 years of data and 58
interruptions that were caused by the 11kV underground cables. The underlying causes of failures
for each of these interruptions were investigated. The class of material or method made the most
significant contribution with 59% of the total failures, including the underlying failure causes of
material faults.
Approach I
The information from the failure modes analysis provides input data for the failure rate modelling
(Step 4).
Approach II
Data from the statistics (Step 3) were complemented with practical experience. From discussions
with maintenance personnel a list of underlying causes of cable faults was defined. One of these
causes was water treeing. This is a tree-like phenomenon that involves water penetration through
the insulation, occurring primarily in the early produced (mid-1970s) XLPE insulation cables.
Data related to this failure were collected and selected. These include disturbance statistics [18],
measurements and modelling of the cable condition [19], and PM of cables [20]. One effective
method for preventing failures of water-treed cables is the rehabilitation method [20][21]. This
involves injecting a silicon-based liquid between the individual wires of the conductor, which
stops the growth of the current water trees. The water trees, on the other hand, impact on the
breakdown strength of the cable, which can be measured with diagnostic methods. Based on the
experience data and the logic shown in Figure 11, a failure rate model (Step 4) and a functional
relationship between the failure rate and the effect of PM measures (Step 5) were defined [5].
Decreased
Water-tree Increased
breakdown
growth failure rate
voltage
Figure 11 Process to relate underlying failure cause to reliability .(Step 3-5.)
Three different maintenance activities were considered for these studies: no PM activities, PM by
the rehabilitation method and PM by replacing cables systematically before they failed (the
replacement method) with notations: org, si and rp respectively.
Figure 12 shows the final result for modelling the failure rate, assuming one PM action on each
cable. The initial value for the cable failure rate is relatively small but not zero, as the figure

L. Bertling
indicates. The failure rate characteristic with no PM is the resulting approximation of a function
obtained from experience data [5]. The data is assessed from a complete population of cables over
a 13-year aging period. It was assumed that the failure rate, after this time and due to this specific
failure cause, is constant. Furthermore, it was assumed that replacement is made with a cable
having the same characteristics as the current cable had when new. These assumptions were
motivated by two aspects: that the water trees grow to a maximum length (that of the insulation
thickness) and that this provides a worst-case scenario when showing the benefit of PM.
However, it should be noted that for these XLPE insulated cables, a new cable would not have the
same characteristics due to changes in the manufacturing techniques. Nevertheless, a changed
characteristic can be included quite readily.
In practice, PM procedures are likely to be performed several times during the lifetime of a
particular component, in which case the characteristic shown in Figure 12 would have a series of
decrements similar to that shown. The number of occasions and their timing should depend on the
cost of performing the PM actions and the cost-benefit of doing so. The RCAM approach
described in this paper allows this to be assessed objectively.
The resulting cable failure rate model was used for the Birka system. The characteristics of the
XLPE cables in this system are consequently assumed to follow those of the XLPE cables with
insulation degradation due to water treeing. (It should be stressed that this assumption enabled
complete demonstration of the RCAM method, rather than providing a true picture of the cables
in the Birka system.) To obtain the composite failure rate for the cable it was assumed that the
total failure causes were due to water trees and other causes. The resulting input data for the
component then consisted of the developed failure rate model for failures due to water trees, and
the average failure rate for the 11kV cable in the Birka system due to other causes. (Step 6.)
Figure 12 Resulting failure rate model for a water-treed cable affected

by PM measures after 11 years .(Step 4-5, Approach II.)
4.4.3 Stage 3 -System Reliability and Cost/ Benefit Analysis (Step 7-10)
Approach I
Results from the survey of statistics provided input data for modelling the relationship between
PM and reliability using Approach I. Sensitivity studies were made to see the effect at the system

L. Bertling
level if each of these causes of failures were decreased individually or in combination. The
different cases are as follows:
1. base case,
2. fabric or material faults =14%,
3. lack of maintenance =5%,
4. wrong method or instruction =15%,
5. total of (2-4) =34%, and
6. total for (material and method) =59%.
The difference in percentages between cases 5 and 6 (25%) relates to those causes that were
reported as included in material and method, but with no further detailed level of classification.
Figure 5 shows the benefit of these different cases on the system indices. It has been assumed for
each case that the causes of failures can be eliminated by the PM activities. Thus the
corresponding failures would be eliminated and the reliability indices influenced. The results
show that PM measures to reduce individual causes of failures for a critical component in the
system can significantly improve the system reliability. The cases represent different maintenance
strategies for the RCAM method with Approach I (Step 7).
Figure 13 Effect on system reliability for different maintenance

strategies using Approach I for the Birka system. (Step 9.)
Approach II
A system analysis is performed for the Birka system including two strategies for applying the PM
with either rehabilitation ( PMsi ) or replacement ( PMrp ). Both of these involve PM applied on three
occasions (years t PM = 9,11,12 ), and with the following proportions of cables subject to PM per
occasion: 10% for S1 and 30% for S2 (Step 7). The results from the system reliability analysis, as
shown in Table 3-1 (Step 9), show consistently that the best reliability is achieved with PM by
replacement and with as much as possible of the component replaced, that is S2 .
Figure 14 shows one result from the economic evaluation according to the RCAM method. Input
data for the economic assessment was provided by the utility, and from the Swedish customer
interruption costs included in [22]. It is seen that the cost of failures is decreased for the Birka
system, when the 11kV cables are affected by PM measures. Furthermore, it is seen that the most
significant decrease in cost of failures is achieved with the replacement method.

L. Bertling
Table 3-1 Reliability results applying different maintenance methods

Reliability Unit CM PMsi PMsi PMrp PMrp
Factor
S1 S2 S1 S2
λ av, Lpi [int/yr] 0.52 0.50 0.47 0.50 0.45
Uav,Lpi [h/yr] 0.70 0.70 0.65 0.68 0.63
rav , Lpi [h/int] 1.40 1.41 1.44 1.42 1.45
Eav,Lpi [MWh/yr] 16.14 15.75 14.97 15.61 14.57
Figure 14 The impact of maintenance methods and PM strategies on cost

of failure for the Birka system. (Step 10, Approach II.)

L. Bertling
Figure 15 The impact of different maintenance methods on the total annual

costs of applying a PM strategy for the Birka system. Results are shown for
the case with the interest rate d1 = 2%. (Step 10, Approach II.)
The final step in the RCAM analysis is to evaluate the present worth values of the annualised total
costs of maintenance. Figure 15 presents annual costs for the different maintenance methods
using PM strategy S1. It can be seen directly from the annual costs that PM is a dominating cost.
Furthermore, it is clearly more cost-effective to rehabilitate the cable than to replace it, since the
greater benefit in reliability by the replacement method is offset by the higher investment cost.
Consequently, the cost-effective solution is not to carry out PM in this case, but if PM is carried
out, rehabilitation is better than replacement. This is, however, a constructed example considering
only one type of component and does not provide the complete result for the Birka system. It is
also important to note that cables compared with other components in a power system involve
extremely high PM costs with relatively few possible PM actions. It is, however, of significant
importance for efficient maintenance planning to evaluate the relative values of implementing
different maintenance strategies, as shown in this application example.
4.4.4 Further developments into maintenance prioritization

The question of prioritization of maintenance resources is fundamental for all types of systematic
and cost-efficient maintenance planning approaches. In the RCAM approach the first stage, and
Step 1, includes to identify the most critical components, i.e. those that have the greatest impact
on the system reliability.
This section briefly introduce a proposed approach for component reliability importance indices,
which have been developed for a first stage in maintenance optimization [23][24][25].
The proposed indices focus on customer interruption cost as a measure of system performance
and reliability. The customer interruption costs have been calculated based on customer specific

L. Bertling
initial costs for every interruption plus a cost linearly dependant on the duration of the
interruption. The interruption cost based index is defined as follows:
∂C s
I iH = [€/f] (4.17)
∂λi
where Cs [€/yr] is total yearly customer interruption cost and λi [f/yr] component i’s failure rate.
The index identifies components that are critical for the system with respect to their individual
impact on total interruption cost with changes in component failure rate [26]. One interpretation
of IH is that it corresponds to the total expected interruption cost (for all load points) that would
occur if component i failed. Hence, if there were one maintenance action available, which would
result in the same absolute change in failure rate for any component in the network. IH would then
be the adequate index to use for a prioritization of what component the action should be
performed on.
The proposed index, IH, is not affected by the studied component’s failure rate but “only” by
component repair time and the position of the component and all other components in the system.
This is analogous with Birnbaum’s importance index [27]. Hence, the concept of maintenance
potential [26] is introduced. Maintenance potential corresponds to the expected system cost
reduction that would occur in the case of a perfect component, i.e. no failures for the studied
component (hence maintenance potential). Another way to express this measure is the expected
total interruption cost that the studied components failures will result in (alone and/or together
with other components) during one year. Maintenance potential is defined as:
I iMP = CS (1i , λ ) − CS (λ )
[€/yr] (4.18)
where CS is the total system (interruption) cost and λ [f/yr] failure rate for the studied
components.
Results from applying these indices, for the Birka system is presented below. First the different
component reliability indices have been calculated. Then one of three maintenance strategies are
implemented for each component. The three component strategies are:
1. Keep current preventive maintenance level, average failure rate is assumed to remain
unchanged, no change in cost.
2. Improve the preventive maintenance, the failure rate of the component is assumed to
become reduced, increased cost of preventive maintenance.
3. Decrease the preventive maintenance, the failure rate is assumed to increase for the
studied component, cost savings on preventive maintenance.
The selection process of the component strategies has to be performed in an optimization process
that recalculates the indices several times; this in order to assure that an optimal point is reached.

L. Bertling
Sp
c1 220 kV
c2 c8
c3 c9
c4 c10
110 kV
c5 c11
c6 c12
c7 c13 33 kV
c14
c15 c19 c23
c49 c36
c50 c53 c16 c20 c24 c37 c40 c43
c51 c54 c17 c21 c38 c41 c44

c25
c18 c22 11 kV
c52 c55 c26 c39 c42 c45
c56 c27 c29 c28 c46
c57 c47
c58 c48
c30 c31
SJ
SW HD
c32
Increase maintenance
c33
c34 Keep maintenance lev.

c35
Decrease maintenance
LH11
Figure 16 An optimal maintenance strategy for the Birka system [23].
Figure 16 illustrates the results for the Birka system using the component importance indices for
searching an optimal maintenance strategy. The results shows for each component if the
maintenance level should be; kept, increased or decreased. From the figure it is seen that the
maintenance could be decreased for several components at the 33 kV level. This result is
reasonable since these components are located on a high redundant part of the system.
4.5 Conclusions
This chapter has provided an overview of two different approaches for RCM i.e. RCM II and
RCAM. RCAM differs mainly since its approach is purely mathematical. Therefore RCAM needs
more data input and requires a higher level of research than RCM II. The RCAM was developed
for a structurally complex system. For analysis of individual components the structure would be
very simple, since nearly all failures lead to a functional failure and little would be gained from
the system analysis which plays a major part in the RCAM method. The RCAM method has the
advantage of being able to view the system as a unit when deciding maintenance strategies, while
RCM II view the system at component and failure mode level when maintenance strategies are
determined.
The chapter has also shown on application studies for the RCAM approach. Results from
application studies show how the RCAM method can be used to compare different maintenance
methods and PM strategies based on the total cost of maintenance, which includes the impact of

L. Bertling
the PM measure on the system reliability. Furthermore, the application study shows that the
RCAM method can be performed and supported by real input data. Relating maintenance effort
and reliability improvement is, however, a complex problem, and substantial input data is
required to support the method. The RCAM, as well as the RCM, approach consequently provides
a means for creating resources to provide input data.
4.6 References
[1] Billinton, R., Fotuhi-Firuzabad, M. and Bertling, L. ``Bibliography on the application of probability methods in
power system reliability evaluation 1996-1999'', IEEE Transactions on Power Systems, Vol. 16, No. 4,
November 2001.
[2] Endrenyi, J., et al, “The Present Status of Maintenance Strategies and the Impact of Maintenance on
Reliability”, IEEE Transactions on Power Systems, vol. 16, no. 4, November, 2001.
[3] R. Billinton, “Bibliography on the application of probability methods in power system reliability evaluation,”
IEEE Trans. Power App. Syst., vol. PAS-91, Mar./Apr. 1972.
[4] J. Moubray, Reliability-centred Maintenance, Butterworth-Heinemann, Oxford, 1995.
[5] Bertling, L. "Reliability Centred Maintenance for Electric Power Distribution Systems", ISBN 91-7283-345-9,
TRITA-ETS-2002-01, ISSN 1650-674X, KTH Electrical Engineering, August 2002.
[6] Bertling L., Allan R.N., Eriksson, R.,“A reliability-centred asset maintenance method for assessing the impact
of maintenance in power distribution systems”, TPWRS-00271-2003.R3, IEEE Transactions on Power Systems,
Vol. 20, No. 1, Feb. 2005.
[7] Nowlan, F. S. and Heap, H. F., Reliability-Centered Maintenance, National Technical Information Service, U.S.
Department of Commerce., Springfield, Virginia, US, 1978.
[8] Smith, A. M., Reliability-Centered Maintenance, McGraw-Hill, U.S, 1993.
[9] Swedenergy AB, ”RCM for Electrical Distribution Systems - A Simplified Decision Model for Maintenance
Planning Part I” (RCM För Elnät En förenklad beslutsmetod för underhållsplanering - Del 1
Användningsområden och arbetssätt, (ISBN 91-7622-167-9, In Swedish), 2001.
[10] Cigré Working Group 13.08, “Life Management of Circuit-Breakers, International Council on Large Electric
Systems”, Cigré, Paris, France, Working Group 13.08 Report 165, 2000.
[11] Eriksson R., Lindquist T., Bertling L. “Reliability modelling of aged XLPE cables”, Nordic Insulation
Symposium Tampere, June 11-13, 2003.
[12] Lindquist T., Bertling L, Eriksson R., “A Method for Age Modelling of Power System Components based on
Experiences from the Design Process with the purpose of Maintenance Optimization”, Presented at the
Reliability and Maintainability Annual Symposium (RAMS), January 2005.
[13] Lindquist T., Bertling L, Eriksson R., “A Feasibility Study for Probabilistic Modeling of Aging in Circuit
Breakers for Maintenance Optimization”, Proceedings of PMAPS, Ames, Iowa, September 2004.
[14] Lindquist T., Bertling, L., Eriksson,``Estimation of disconnector contact condition for modeling the effect of
maintenance and ageing”, IEEE PowerTech'05 St. Petersburg, June 2005.
[15] Kariuki, K.K. and Allan, R.N., Application of customer outage costs in system planning, design and operation,
IEE. Gener. Transm. Distrib., vol 143, no 2, March 1996.
[16] Bertling, L., Eriksson, R. and Allan, R.N., “Relation between preventive maintenance and reliability for a cost-
effective distribution systems”, Proceedings of IEEE PowerTech'01, vol 4, no 208, September 2001.
[17] Bertling, L., Eriksson, R., Allan, R.N., Gustafsson, L.Å. and Åhlén M. ``Survey of Causes of Failures Based on
Statistics and Practice for Improvements of Preventive Maintenance Plans", 14th PSCC in Seville, June 2002.
[18] Swedenergy AB, The Lifetime and Usefulness of XLPE Cables (PEX-kablar livslängd och användbarhet), (In
Swedish), 1990.
[19] Werelius, P., Thärning, P., Eriksson, R., Holmgren, B. and Gäfvert, U., ”Dielectric Spectroscopy for
Diagnostics of Water Tree Deteriorated XLPE Cables”, IEEE Transactions on Dielectrics and Electrical
Insulation, vol 8,no 1, February 2001.
[20] SINTEF, Faremo, H., Report: Rehabilitation of XLPE Cables with long Water-trees, (Energiforsyningens
Forskningsinstitutt (EFI), EFI TR A 4512, In Norwegian),Trondheim, Norway, 1997.
[21] Pilling, J. and Bertini, G., “Incorporating Cablecure injection into a Cost-Effective Reliability Program”, IEEE
Industry Applications Magazine, Vol. 3333, No 208333, September/October 2000.
[22] Cigré Task Force 38-06-01, Methods to Consider Customer Interruption Costs in Power System Analysis, Paris,
2001.

L. Bertling
[23] Hilber, P., “Component reliability importance indices for maintenance optimization of electrical networks”,
Licentiate thesis, ETS, KTH. Usab ab. ISBN 91-7178-055-6, 2005.
[24] Hilber, P., and Bertling, L., ``A method for extracting reliability importance indices from reliability simulations
of electrical networks”, Proceedings of the 15th PSCC in Liege, Belgium, Aug. 2005.
[25] Hilber, P., Hällgren, B. and Bertling, L. “Optimizing the replacement of overhead lines in rural distribution
systems with respect to reliability and customer value”, Accepted to be presented at the 18th International
Conference on Electricity Distribution (CIRED) in Turin, June 2005.
[26] Hilber, P., Bertling, L., and Hällgren, B., ”Effects of correlation between failures and power consumption on
costumer interruption cost”, Proceedings of the 9th international conference on Probabilistic Methods Applied to
Power Systems, PMAPS, Stockholm, Sweden, June 2006.
[27] Rausand, R. and Høyland, A. “System reliability theory”,.2nd ed, Hoboken, New Jersey: John Wiley & Sons.
ISBN 0-471-47133-X, 2004.
4.7 Biography
Lina Bertling (S’98-M’02) was born in Stockholm in 1973. She received her Ph.D in Electric Power
Systems in 2002 and M.Sc. in Systems Engineering in 1997, from KTH - the Royal Institute of
Technology, Stockholm, Sweden.
She is currently employed at KTH School of Electrical Engineering as Assistant Professor, and is the
leader for a research program at the Swedish Centre of Excellence in Electric Power Systems (EKC2) on
maintenance management. Since 2003 she has been working as a lecturer and research leader at KTH
developing a research group on reliability-centered asset management (RCAM). Her research interests are
in power system maintenance planning and optimization including reliability-centered maintenance (RCM)
methods, reliability modeling and assessment for complex systems, and lifetime and reliability modeling
for electrical components.
Dr. Bertling is a member of the IEEE Power Engineering Society (PES) Subcommittee on Risk,
Reliability, and Probability Applications (RRPA), and the IEEE PES Committee on Power System
Planning and Implementation. She was the general chair of the 9th international conference on probabilistic
methods applied to power systems (PMAPS) in Stockholm in 2006. (linab@kth.se,www.ee.kth.se/rcam .)

L. Bertling
5 Optimizing condition monitoring decisions for maintenance

planning
Dr. Andrew K.S. Jardine
Department of Mechanical & Industrial Engineering
University of Toronto
Toronto, Ontario
Canada, M5S 3G8
Abstract - Condition monitoring is an activity that is widely used, mainly for expensive and complex
equipment/systems consisting of a large number of simpler components, and subject to different failure
modes. Due to the rapid development of IT, much more data from condition monitoring and all types of
maintenance and corrective activity is collected and stored in maintenance data bases.
The Chapter focuses on current industry-driven research that employs proportional hazards modeling to
identify the key risk factors that should be used to identify the health of equipment from amongst those
signals that are obtained during equipment health monitoring. Economic considerations are then blended
with the risk estimate to establish optimal condition-based maintenance (CBM) decisions.
Recent results of the research program are included in the Chapter including development of the EXAKT
software, and its successful application to the condition monitoring techniques of vibration monitoring and
oil analysis. The remaining useful life (RUL) of a system and its associated conditional reliability function
are also considered as a tool that may be used in optimizing CBM decisions.
5.1 Introduction
Condition Monitoring (CM) has become a recognized tool for assessment of the health of
equipment, such as the use of oil analysis for power transformers. Planning and scheduling of
maintenance decisions can be made based on the analysis of CM information. Examples of CM
information that can be utilized include, but are not limited to: Vibration monitoring, Infrared
Thermography, Oil Analysis, Ultrasonics, Motor Current Analysis, etc. [Dunn, 2005]
Control charts are one of the most commonly applied techniques for interpretation of CM data. At
each inspection, levels of some measurements are compared with the
corresponding predefined “warning limits” and a judgment is made based on the outcome. The
method has been applied for several decades and proved to be a helpful and simple to understand
technique.
However, control charts leave several important questions unanswered. Among the variety of
measurements related to the items condition that one can collect, which ones should be paid
attention to? What if there is no single variable that can provide information on true condition of
the equipment? What are the optimal warning limits and should these limits change with
operating age of the item? [Jardine et al, 2006].
In this Chapter we present a procedure that takes into account both the age of the item and it’s
history it significantly expands the space of available maintenance strategies and is termed
Condition-Based Maintenance (CBM).
Optimizing condition monitoring decisions for maintenance planning 48

A. K. S. Jardine
The CBM Consortium research laboratory was established in 1995 at the Department of
Mechanical and Industrial Engineering in the University of Toronto. The lab has developed theory
that combines age and condition monitoring data with economic and/or performance data that
may include the cost of failure, the cost of planned maintenance, the corresponding down times,
and produces a long-run optimal maintenance decision policy. Among current activities of the
project is development of software that can assist maintenance and reliability specialists to
optimize decisions in CBM environment. The current state of development of the software, called
EXAKT™, is presented in section 8.5. Details of the CBM Lab can be obtained at
www.mie.utoronto.ca/cbm.
5.2 Optimizing Condition Based Maintenance Decisions

5.2.1 Introduction
Possibly the most common approach to understanding the health of equipment is through plotting
various measurements and comparing them to specified standards. This procedure is illustrated in
Figure 5.1 where measurements of iron deposits in an oil sample are plotted on the Y-axis and
compared to warning and alarm limits. The maintenance professional then takes remedial action if
deemed appropriate. Many software vendors addressing the needs of maintenance have packages
available to assist in interpreting CM measurements, with the goal of predicting failures.
Alarm > 300ppm
Warning > 200ppm
Normal < 200ppm
WorkingAge
Figure 5.1 Classical Approach to Condition Monitoring.
Clearly there is a need to focus attention on the optimization of condition monitoring procedures.
In the following section we will present an approach for estimating the hazard (conditional
probability of failure) that combines the age of equipment and condition monitoring data using a
PHM. We will then examine the optimization of the CM decision by blending in with the hazard
calculation, the economic consequences of both preventive maintenance, including complete
replacement, and equipment failure. [Jardine & Tsang 2006]

A. K. S. Jardine
5.2.2 The Proportional Hazards Model (PHM).
A valuable statistical procedure for estimating the risk of equipment failing when it is subject to
condition monitoring is the proportional hazards model [Cox, 1972]. There are various forms that
can be taken by a PHM, all of which combine a baseline hazard function along with a component
that takes into account covariates that are used to improve the prediction of failure. The particular
form used in this section is known as a Weibull baseline PHM which is:
β −1
β ⎛t⎞ ⎧m ⎫
h(t , Z (t ) = ⎜⎜ ⎟⎟ exp⎨∑ γ i z i (t )⎬ (5.1)
η ⎝η ⎠ ⎩ i =1 ⎭
where h(t, Z(t)) is the (instantaneous) conditional probability of failure at time t, given the values
of z1 (t ), z 2 (t ),..., z m (t ) .
Each zi (t) in equation (5.1) represents a monitored condition data item at the time of inspection, t,
such as the parts per million of iron or the vibration amplitude at the second harmonic of shaft
rotation. These condition data are called covariates.
The γ’s are the covariate parameters indicating the degree of influence each covariate has on the
hazard function. The model consists of two parts, the first part is a baseline hazard function that
β −1
β⎛t⎞
takes into account the age of the equipment at time of inspection, ⎜⎜ ⎟⎟ , and the second part,
η ⎝η ⎠
e γ 1z1 (t )+γ 2 z2 (t )+"+γ m zm (t ) , takes into account the variables (may be thought of as the key risk factors
used to monitor the health of equipment) and their associated weights.
In the study by Anderson et al [1982] the form of the hazard model for the aircraft engines
was:
3.47
4.47 ⎛ t ⎞
h(t ) = ⎜ ⎟ exp(0.41z1 + 0.98 z 2 ) (5.2)
24100 ⎝ 24100 ⎠
where z1 is Fe concentration and z2 is Cr concentration in parts per million and t is the age of the
aircraft engine in flying hours at the time of inspection. Since ß = 4.47 we know that the age of
the aircraft engine is an influencing factor in estimating the hazard rate of the engine. η = 24,100
hours is a parameter of the Weibull distribution. The values 0.41 and 0.98 are the weights to give
the iron and chrome measurements when calculating the hazard rate. They are estimated from the
data that is analyzed and will be different for different engines, and will depend on their operating
environment.
The procedure to estimate the values of ß, η and the weights, along with determining the
condition monitoring variables to be included in the model is discussed in a number of books and
papers, including Kalbfleisch and Prentice [2002].
Standard statistical software such as SAS and S-Plus have routines to fit a PHM.

A. K. S. Jardine
5.2.3 Blending Hazard and Economics: Optimizing the CBM Decision
Makis and Jardine [1992] presented an approach to identify the optimal interpretation of condition
monitoring signals. The approach is illustrated graphically in Figure 5.2 and Figure 5.3.
DATA PLOT
Data
Age
RISK PLOT
Risk
Age
Figure 5.2 Calculating Hazard from Condition Monitoring Measurements.
RISK PLOT
Ignore risk
Risk
Optimal
risk level
Age
COST PLOT Replace at

failure only
Cost/unit time
minimal cost
optimal risk Risk 47
Figure 5.3 Establishing the Optimal Hazard Level for Preventive Replacement.
Figure 5.2 illustrates that given a set of condition monitoring measurements (the data plot) it is
possible to convert the measurements to the equivalent hazard estimate (the risk plot). This
conversion is achieved through using a PHM.
Once we have a method of monitoring an equipment’s hazard value, the next question is: What
should we do about it to make an optimal maintenance decision? The answer is illustrated in
Figure 5.3. There it can be seen that one possibility is to ignore risk (Risk Plot). If risk

A. K. S. Jardine
information is ignored, then the equipment will be used until it fails, and only then will it be
maintained (for the time being, assume that the maintenance action is equivalent to a replacement,
as is the case of some complex equipment, such as aircraft engines where after maintenance the
engines are re-lifed and have the same guarantees as a new engine). The cost associated with this
decision (ignoring risk) is the cost of a failure replacement divided by the mean time to failure of
the equipment). Thus we obtain the cost of replacing only on failure as identified on the Cost Plot.
As the risk is reduced, then there will be more preventive replacement actions, and less failure
replacements. Assuming that the cost of a failure replacement is greater than the cost of a
preventive replacement then a cost function as illustrated on the Cost Plot will be obtained. Thus
it is possible to identify the optimal hazard level at which the equipment should be replaced: if the
hazard rate is greater than a certain threshold value, preventive replacement should take place;
otherwise, operations can continue as normal.
In the Makis and Jardine [1992] paper it is shown that the expected average cost per unit time,
Φ(d), is a function of the threshold risk level, d, and is given by:
C ⋅ (1 − Q(d )) + (C + K ) ⋅ Q(d )
Φ(d ) = (5.3)
W (d )
where C is the preventive replacement cost and C+K the failure replacement cost. Q(d) represents
the probability that failure replacement will occur, at hazard level d. W(d) is the expected time
until replacement, either preventive or failure.
The optimal risk, d*, is that value that minimizes the right hand side of equation (5.3), and the
optimal decision is then to replace the item whenever the estimated hazard, h(t, Z(t)), calculated
on completion of the condition monitoring inspection, equals or exceeds d*.
5.2.4 Applications
The topic of optimizing CBM decisions has been an active research thrust at the University of
Toronto that has been conducted for some years in partnership with a number of companies, many
of them having global operations (www.mie.utoronto.ca/cbm). As a consequence, pilot studies
have been undertaken and published in the open literature. Brief summaries of three of them, each
utilizing a different form of condition monitoring are:
5.2.4.1 Use of vibration monitoring

A company undertook regular vibration monitoring of critical shear pump bearings. At each
inspection 21 measurements were provided by an accelerometer. Using the theory described in the
previous section, and its embedding in software called EXAKT, see Section 3.5.6, it was
established that of the 21 measurements there were 3 key vibration measurements: Velocity in the
axial direction in both the first band width and the second band width, and velocity in the vertical
direction in the first band width.
In the plant the economic consequence of a bearing failure was 9.5 times greater than when the
bearing was replaced on a preventive basis. Taking account of risk as obtained from the PHM and
the costs it was clear that through following the optimization approach total cost could be reduced
by 35%. Fuller details are available in Jardine et al. [1999]

A. K. S. Jardine
5.2.4.2 Use of oil analysis

Electric wheel motors on a fleet of haul trucks in an open-pit mining operation were subject to oil
sampling on a regular basis. Twelve measurements resulted from each inspection. These were
compared to warning and action limits in order to decide whether or not the wheel motor should
be removed preventatively. These measurements were: Al, Cr, Ca, Fe, Ni, Ti, Pb, Si, Sn, Visc 40,
Visc 100, and Sediment.
After applying a PHM to the data set, it was identified that there were only two key risk factors,
that is, oil analysis measurements that were highly correlated to the risk of the wheel motor
failing; these measurements were of iron (Fe) and sediment. The economic advantage of
following the optimal replacement strategy was a cost reduction of 22 %. The cost consequence of
a wheel motor failure was estimated as being three times the cost of replacing it preventatively.
Fuller details are available in Jardine et al. [2001].
5.2.4.3 Use of visual inspection: Transportation

Traction motor ball bearings on trains were inspected at regular intervals to determine the color of
the grease; it could be in one of four states; light grey, grey, light black, black. Depending on the
color of the grease and knowing the next inspection time a decision was made to either replace or
leave the ball bearings in service. As a result of building a PHM relating the hazard of a bearing
failing before the next planned inspection a decision was made to dramatically reduce the interval
between checks from 3.5 years to 1 year. Before the study was undertaken the transportation
organization was suffering, on average, 9 train stoppages per year. The expected number with a
reduced inspection interval was estimated to be one per year. The year following the study the
transportation system identified two system failures due to a ball bearing defect. The overall
economic benefit was identified as a reduction in total cost of 55%. It should be mentioned that
this included the cost of additional inspectors and took into account the reduction in passenger
disruption. A “notional” cost was identified with passenger delays.
5.2.5 Further Comments

Case studies dealing with the optimization of CBM decisions in the utilities sector include:
Nuclear plant refueling, Jardine et al [2003] and Turbines in a nuclear plant, Chevalier et al
[2004].
5.3 Software for CBM Optimization

To ease the application of the theory described in Section 8.2, a software package named EXAKT
(www.omdec.com) has been developed. As explained by Wiseman [2004], “EXAKT takes
processed signals, correlates them with past failure and potential failure events. Using modeling,
it subsequently provides failure risk and residual life estimates tuned to the economic
considerations and the availability requirements for that asset in its current operating context”
Table 5-1 shows the form of condition monitoring data that EXAKT requires if the CM tool is
vibration monitoring. In addition, “event data” is required. This is information about when
equipment went into service and when it came out of service. It is also information about any
maintenance interventions that took place between installation and removal of the equipment,
such as the events defined in Table 5-2, which may affect interpretation of the CM data. A sample

A. K. S. Jardine
of the vibration analysis event data for the example being illustrated in this section is provided in
Table 5-3 where the working age of the bearing being monitored was days.
Table 5-1 Vibration Monitoring Data.

g
69
Table 5-2: Different Forms of Event Data
Definition of an event:
1. A beginning event. This indicates the start of a history ( A “history” is
the time from installation to removal of a item). Designated by “B”.
2. A failure event. Designated by “EF”. (Ending with failure)
3. A preventive replacement. Designated by ES (Ending by suspension).
An event is also an occurrence during a history which

effects the condition data. Here are some examples:
1. An oil change
2. A rotor balance
3. A shaft/coupling alignment
4. A soft foot correction
5. Tightening, calibration, minor adjustments that affect the condition data
6. A filter replacement
7. and so on 88

A. K. S. Jardine
Table 5-3 Vibration Analysis Event Data.
Data from Table 5-1 and Table 5-3 are used to obtain the PHM. The same data is used to obtain
the transition probabilities which are then used in combination with cost data to obtain the optimal
decision figure; see Banjevic et al [2001].
Table 5-4 is an example of the transition probability matrix for the vibration measurement
“velocity in the axial direction, first band width” and when the interval for the transition is
specified as 30 days. Thus, if today the velocity is in the range 0.15 – 0.22 there is a probability of
0.37788 that the equipment will be in the same state 30 days from today. Similarly the table can
be use to estimate the probability of the equipment being in a failure state in 30 days time as
0.199714. Transition probabilities are provided for all possible combinations of states.
Table 5-4 Transition Probability Matrix.
Inspection Interval = 30 days
Very Smooth
Smooth
Rough
Very Rough
Failure
Finally using the PHM, transition matrices and the costs associated with preventive and failure
replacement, the figure used for decision-making is obtained – Figure 5.4.

A. K. S. Jardine
Vibration Monitoring Decision
Figure 5.4 Optimizing the CBM Decision.
Thus whenever an inspection is made the values of the key risk factors are obtained. In this case
the key risk factors are: velocity in the axial direction, first band width; velocity in the axial
direction second band width and velocity in the vertical direction, first band width. These
measurements are then multiplied by their weighting factors, 5.8312, 36.552 and 24.053
respectively, then added together to give a Z-value which is marked on the Y-axis. The X-axis
defines the age of the item (a bearing in this example) at the time of inspection. The intersection
of a horizontal line from the Z-value and a vertical line from the age indicates the optimal
decision. If the intersection is in the light shaded area (green) the recommendation is to continue
operating – with reference to the lower figure in Figure 5.3 the cost curve is still declining. If the
intersection is in the dark shaded area (red) the recommendation is to replace – with reference to
the lower figure in Figure 5.3 the cost curve is now in the increasing range. If the intersection lies
in the clear area it indicates that the optimal change-out time is between two inspections.
On the site www.omdec.com there is a detailed explanation of EXAKT along with the answers to
many frequently asked questions and a number of tutorial problems. The Chapter Interpretation
of inspection data emanating from equipment condition monitoring tools: Method and software in
Mathematical and Statistical Methods in Reliability, [Armijo, Y.M. (Editor), (2005)] provides an
overview of the theory and application of the CBM optimization approach presented in this
section.
5.4 Recent Developments

5.4.1 Conditional distribution of time to failure [Banjevic and Jardine, 2005]
Within the framework of statistical models introduced in sections 8.2, the conditional reliability
function of the item, given the current state of the covariate process can be expressed as follows:

A. K. S. Jardine
R (t | x, i ) = P(T > t | T > x, Z ( x) = i ) = ∑ Lij ( x, t ) (5.4)

j
Once the conditional reliability function is calculated we can obtain the conditional density from
its derivative. We can also find the conditional expectation of T − t , termed the remaining useful
life (RUL), as
∞
E (T − t | T > t , Z (t )) = ∫ R ( x | t , Z (t )) dx (5.5)
t
In addition, the conditional probability of failure in a short period of time [t , t + Δt ] can be found
as
P (Survive during [t , t + Δt ] | t , Z (t )) = R(t | t , Z (t )) − R(t + Δt | t, Z (t )) (5.6)
For a maintenance engineer, predictive information based on current CM data, such as RUL and
probability of failure in a certain period of time, can be a valuable tool for assessment of risks and
planning appropriate maintenance actions.
5.5 EXAKT Summary

The current state of development of the software, named EXAKT™, allows the user to:
• Create a convenient database by extracting the event and condition (inspection)
data from external databases;
• Detect logical errors in the databases;
• Perform data analysis and preprocessing, using graphical and statistical analysis;
• Estimate parameters of the PHM and Markov process model. The model can be
evaluated based on such statistical tests as Wald test, Log-likelihood test, Kolmogorov-
Smirnov test, χ 2 test for independence of covariates and for homogeneity of the Markov
process;
• Calculate and graphically present the conditional probability distribution for a
given item and provide such characteristics as RUL and probability of failure in a short
time period;
• Compute and save the optimal replacement policy. Alternate policies are also
available based on Age and Block replacement strategies;
• Perform separate analysis for different failure modes or components of the system
and create an integrated decision module;
• Make and save decisions for current records whenever it is required, using the
developed decision model.
Figure 5.5 illustrates the principle of the software and the way it can be used in decision-making.
As outlined above, the program utilizes the age data and the condition-monitoring data in order to
produce a statistical model, which in turn can be used to derive useful justified predictions and/or
to optimize economic considerations. It is our belief that when supplied with the results of these
analyses, an engineer can make better maintenance decisions.

A. K. S. Jardine
Figure 5.5 Principle of EXAKT™.
5.5.1 Marginal Analysis in EXAKT™

For a multi-component system, or a system with multiple failure modes, the software has an
option called Marginal Analysis. Under this option, for a single set of data, separate models can
be built for different components (or failure modes) and then integrated to produce one general
decision model.
Separate analyses of different components (or failure modes) can help for better planning and
scheduling of preventive maintenance activities, more targeted work orders, possibilities for
opportunistic preventive maintenance, etc. However, marginal analysis requires additional
information on lifetime history of equipment, such as classification of events of failure, which
might not always be accessible.
One of the case studies undertaken by the CBM lab was intended to analyze performance of
Diesel Engines employed on ships. As many as ten different failure modes have been defined,
five of which have been found related to the available condition monitoring data (oil analysis
data) collected by the user over the years. If ignored, interactions between different causes of
failure could have led to a conclusion that time was not a significant risk factor for the engine. At
the same time, when separated, analyses of different failure modes showed that at the component
level it was possible to build time-dependent statistical models and, thus, derive more targeted
policies for component replacements. In terms of the system, it translated into a component
replacement strategy which yielded 20%-50% of improvement (depending on the ratio of costs of
planned and failure replacements) in the long-run cost per unit time as compared with the Run-to-
Failure strategy.
Challenge remains to develop theory revealing relations between different components (or failure
modes) within a system. This problem, among others, is one of the current research interests of
the CBM lab. An approach to analysis and modeling of complex systems as well as review of
literature can be found for example in Lugtigheid et al [2004].
5.6 Conclusion
The growing competitiveness in the industrial world is driving the interest in improvement of
asset effectiveness. Application of condition monitoring techniques is growing and produces a
challenge to develop appropriate decision making strategies. Statistical modeling of acquired data
and economic considerations of maintenance activities have proven to be useful for making

A. K. S. Jardine
evidence-based decisions and building justified predictions for the future behavior of the
equipment. Development of theoretical optimization models should be followed by the
development of software for analysis of condition-monitoring and equipment lifetime data in
order to ensure successful implementation of new techniques in industry.
5.7 References
[1] Anderson, M., Jardine, A.K.S., and Higgins, R.T., (1982), The use of concomitant variables in reliability
estimation, Modeling and Simulation, Vol 13, pp 73-81
[2] Armijo, Y.M. (Editor), (2005), Interpretation of inspection data emanating from equipment condition
monitoring tools: Method and software in Mathematical and Statistical Methods in Reliability, World
Scientific Publishing Company
[3] Banjevic D., Jardine A.K.S., Calculation of reliability function and remaining useful life for a Markov
failure time process, IMA Journal of Management Mathematics, [Online] doi:10.1093/imaman/dpi029,
2005.
[4] Banjevic, D, Jardine, A.K.S., Makis, V and Ennis M., (2001), A control –limit policy and software for
condition-based maintenance optimization, INFOR, Vol 39, pp 32 - 50
[5] Barlow R.E., Hunter L.C., Optimum preventive maintenance policies, Operations Research, Vol. 8, pp. 90–
100, 1960.
[6] Chevalier, R., Benas, J-C, Garnero, M.A., Montgomery, N, Banjevic, D. and Jardine, A.K.S.(2004)
“Optimizing CM Data from EDF Main Rotating Equipment Using Proportional Hazard Model”,
Surveillance5 Conference, France, October 11- 13, 2004.
[7] Cox, D.R., (1972), Regression models and life tables (with discussion), J.Roy. Stat. Soc. B, 34, 187-220
[8] Dunn, S., Condition monitoring in the 21st century, [Online]
[9] http://www.plant-maintenance.com/articles/ConMon21stCentury.shtml, 2005.
[10] Jardine, A.K.S., Banjevic, D., Montgomery, N., and Pak A, Repairable system reliability: recent
developments in CBM optimization, International Journal of Performability Engineering, (in press)
[11] Jardine, A.,K.S., Banjevic, D., Wiseman, M., Buck, S, (2001), Optimizing a mine haul truck wheel motors’
condition monitoring program", Journal of Quality in Maintenance Engineering, No 1, pp. 286-301.
[12] Jardine, A.K.S., Joseph, T and Banjevic, D, (1999), Optimizing condition-based maintenance decisions for
equipment subject to vibration monitoring, Journal of Quality in Maintenance Engineering, Vol. 5. No. 3, pp
192-202
[13] Jardine, A.K.S., Kahn, K., Banjevic, D., Wiseman, M. and Lin, D. (2003), An Optimized Policy for the
Interpretation of Inspection Data from a CBM Program at a Nuclear Reactor Station”, COMADEM,
Sweden, August 27-29
[14] Jardine, A.K.S., and Tsang, A. H. C., Maintenance, Replacement, and Reliability: Theory and Applications,
CRC Press, Taylor and Frances, 2006
[15] Kalbfleisch, J.D., and Prentice, R.L., (1980) The statistical analysis of failure times, Wiley
[16] Lugtigheid D., Banjevic D., Jardine A.K.S., Modelling repairable system reliability with explanatory
variables and repair and maintenance actions, IMA Journal of Management Mathematics, Vol. 15, pp. 89–
110, 2004.
[17] Makis, V., Jardine, A.K.S., (1992), Optimal Replacement in the Proportional Hazards Model, INFOR, Vol.
20, pp 172-183
[18] Wiseman, M. (2004) , Private communication
5.8 Biography
Andrew K.S. Jardine, Ph.D., C.Eng., M.I.Mech.E., M.I.E.E., P.Eng. is Professor and Principal Investigator at the
Condition-Based Maintenance (CBM) Laboratory at the University of Toronto where the EXAKT software for
CBM optimization and the SMS software for the optimization for emergency spares have been developed . The
CBM Laboratory is funded by the following 10 organizations. From Canada: ABB, Department of National
Defence, Diavik Diamond Mines, Dofasco Steel, Hydro One, INCO, Irving Pulp and Paper, Syncrude Canada,
Teck Cominco and internationally: the Ministry of Defence (U.K.). CBM lab details can be found at

A. K. S. Jardine
www.mie.utoronto.ca/cbm. Dr. Jardine also serves as an advisor to IBM’s Asset Management Centre of
Excellence.
Dr. Jardine is the author of the economic life software AGE/CON and PERDEC that is licensed to organizations
including transportation, mining, electrical utilities, and process industries and is author of the OREST software
used for optimizing component preventive replacement decisions and forecasting demand for spare parts.
Professor Jardine wrote the book, “Maintenance, Replacement and Reliability”, first published in 1973 and now in
its 6th printing. He is the co-editor with J.D. Campbell of the 2001 published book Maintenance Excellence:
Optimizing Equipment Life Cycle Decisions. His new book “Maintenance, Replacement & Reliability: Theory and
Applications”, co-authored with Dr. A.H.C. Tsang, was published by CRC Press, 2006.
Professor Jardine was the 1993 Eminent Speaker to the Maintenance Engineering Society of Australia
and in 1998 was the first recipient of the Sergio Guy Memorial Award from the Plant Engineering and
Maintenance Association of Canada in recognition of his outstanding contribution to the Maintenance
profession. He is listed in Who’s Who in Canada. (jardine@mie.utoronto.ca )

A. K. S. Jardine
IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28,
2007, Tampa, USA
6 Computer program for decision support in the management of

equipment maintenance
Dr. G.J. Anders, Fellow IEEE
Kinectrics Inc.,
Toronto, Canada
Abstract – In the new business environment of competition and re-regulation, the determination of asset
values and methods of reaching the best investment decisions are of increasing interest. Traditional
approaches to establishing maintenance and replacement expenditures can no longer satisfy regulators or
bottom-line-driven decision-makers. Quantitative methods are needed which combine technical factors
with financial and business risk factors. This document describes three closely related computer programs
for selecting optimal maintenance policies. A combination of probabilistic, financial and engineering
information is used to compute the effects of several maintenance programs on equipment reliability and
the incurred costs. In two programs the approach is based on the evaluation of asset life curves which
describe equipment condition as a function of time. The computation of costs involves an analysis of the
present value, over a given time horizon, of future capital investment, and the costs of maintenance and
possible failures. Description of the features of the computer software implementing the above concepts is
presented, together with a numerical example. The third program looks at a more general picture of the
optimal timing for major interventions requiring large capital investments for power equipment
refurbishments. Numerical example is also included.
6.1 Introduction
In the emerging operating environment of deregulation and market-based competition, every
management decision involves a certain amount of risk. These risks need to be evaluated and
courses of actions selected so that they are minimized.
For quantitative risk evaluations analytical tools are necessary. This paper describes two closely
linked computer programs for decision support in the area of equipment maintenance. For the
maintenance (or asset sustainment) function at an electric utility, the following question is of
particular interest: Faced with multiple options for re-investment in equipment maintenance,
what is the best course of action to take in order to maximize reliability at minimum cost?
Typical options could be, (1) to continue present maintenance policy; (2) to do nothing; i.e., to
run the equipment in the future without any maintenance; (3) to perform major overhaul, followed
by the original or a modified maintenance policy; (4) to replace aging or failed equipment with a
new one and apply the original or a modified maintenance policy for the replacement.
The decision-maker can use several criteria for selecting the best re-investment policy. In the
past, engineers operating an electric power system were mainly concerned about equipment
reliability, with the financial aspect playing a secondary role. However, in the new economic
environment the reliability and financial aspects of system operation will be equally important.
Hence, both reliability and cost should be considered in the selection of maintenance alternatives.
With this in mind, a substantial effort has been put in developing suitable decision support tools to
address the question of option selection. The following sections build on earlier studies [1, 2] and
describe three programs AMP (Asset Management Planner), ARM (Asset Reliability Modeling)
and LcmPlus and their application in the re-investment decision process.
Computer program for decision support in the management of equipment maintenance 61

G.J. Anders
2007, Tampa, USA
6.2 Asset Management Planer (AMP) Program
The Asset Management Planner is a computer program designed for equipment exposed to
deterioration but undergoing maintenance at prescribed times [1]. It computes the probabilities,
frequencies and mean durations of the states of such equipment. The basic ideas in the AMP
model are the probabilistic representation of the deterioration process through discrete stages, and
the provision of a link between deterioration and maintenance (Figure 6.1). Clearly, maintenance
is expected to slow the rate of deterioration. In most applications, it is sufficient to represent
deterioration by three stages, an initial (D1), a minor (D2), and a major (D3) stage. This last is
followed, in due time, by equipment failure (F) which requires extensive repair or replacement. A
detailed description of the principles governing construction of an AMP model are given in the
Chapter 3 by Dr John Endrenyi and present chapter concentrates on the computer implementation
of this model.
Figure 6.1 Representation of deterioration and maintenance in the AMP model.
6.2.1 Studies of the effect of changes in maintenance policies

6.2.1.1 Input data requirements
The input information needed to apply the model in this program includes the mean durations of
the various stages of deterioration and of the inspection and maintenance activities, and the
probabilities associated with the various choice and outcome possibilities. Estimates of these
quantities are based on historical experience with similar units operated in similar conditions:
usually they are obtained by analyzing the records of all abnormal operating conditions,
maintenance activities and their results, and general observations from maintenance personnel.
6.2.1.2 Computed values

In order to assess the effect of maintenance policies on the remaining life and the associated cost,
a number of cases are normally examined. The first is a base case study which models the present
maintenance policy. The other cases may range from consideration of no maintenance at all to a
full replacement of the equipment in question. The results represent the remaining of the
equipment, the probabilities of residing in each deterioration state and the expected life time cost
of the equipment. Sample output from the program is shown in Figure 6.2a.
In addition to display of the remaining life of the equipment, a sensitivity study is usually carried
out to explore the effects of changing the inspection frequency. Figure 6.2b shows how the
remaining life from entering the initial stage of deterioration varies with the time between
inspections. The curve indicates that, not surprisingly, the higher is the frequency of inspections,

G.J. Anders
2007, Tampa, USA
the longer is the life of the equipment. There is, however, a price to be paid for the extension of
the equipment life. To explore the cost aspect of the maintenance policy, the sensitivity study can
be repeated for costs, with sample results shown in Figure 6.2c.
The cost change for longer times between inspections is composed of two components: (1) a
decrease in the maintenance cost caused by the reduction of maintenance activities, and (2) an
increase in the total costs caused by an increase in the number of replacements of failed
equipment. In this study, the later part of the cost is always smaller than the former. Therefore,
increasing intervals between inspections will result in a decrease of the total maintenance and
replacement costs.
(a)
(b)

G.J. Anders
2007, Tampa, USA
(c)
Figure 6.2 Sample results from the AMP program: (a) remaining life and expected cost,
(b) sensitivity analysis remaining life as a function of the time between inspections and
(c) expected cost as a function of the time between inspections.
As routinely used, these approaches will yield only the mean values of the durations involved.
Often, however, questions of the type, "what is the probability that the remaining time (to failure)
is shorter (or longer) than a given value?", need to be answered. For this, the probability
distributions of these times are required in addition to their mean values. These distributions can
be obtained by an extension of the standard Markov and FPT techniques through Monte Carlo
simulation [1].
6.2.2 Generation of life curves

Equipment at substations, generating plants or transmission lines age with time and in-service
duty and the probability of failure generally increases with time and usage as well. A convenient
way to represent the aging process is by the life curve of the equipment. Such a curve shows the
relationship between asset condition, expressed in either engineering or financial terms and time.
Since there are many uncertainties related to the prediction of equipment life, probabilistic
analysis must be applied to construct and evaluate life curves. This analysis directly integrates
into established classical decision and financial analysis methods. The objective is to determine
the type of optimal asset sustainment action, and the year this action is to take place, so that NPV
is maximized while not violating financial and reliability constraints. The subject is treated in
more detail in the Chapter 3 by Dr John Endrenyi and only a brief introduction on how the life
curves are generated in the AMP computer program is given below.
The generation of a life curve requires several steps. They are described in the following.
1. In the first step, decisions are made about where the borderlines lie between deterioration
stages D1, D2 and D3, in terms of the (percentage) equipment condition. The results are
entered into the program as shown on the screen in Figure 6.3, and marked on the vertical
axis of the life curve diagram.

G.J. Anders
2007, Tampa, USA
Figure 6.3 Selection of asset condition ranges for deterioration stages.

2. Next, input data are collected for the AMP model and FPT calculations are carried out by
the program, to determine the first passage times between states D1 and D2, D1 and D3,
and D1 and F. These are entered on the time axis of the life curve diagram.
3. The life curve must lie somewhere between the two borders. At time 0 it should be at
100%; at D1F it should be 0. At the remaining two ordinates, by arbitrary decision, it will
be at the midpoint of the respective domains, as shown in Figure 3.6.
6.2.3 Finding an optimal inspection interval

Comparing the curves in Figure 6.2(b) and Figure 6.2(c) it is not obvious what is the best
inspection policy in this case. On the one hand, an increase in the interval between inspections
results in the reduction of the equipment remaining life and, on the other hand, the life cycle costs
are decreased. Recently, a new mathematical model for the selection of an optimal maintenance
policy using the AMP model has been proposed [5]. The original model proposed in [1]
presented a method of calculation of the remaining life of equipment. This paper defines several
possible optimization procedures to find out the best maintenance policy and demonstrates an
implementation of the simulating annealing algorithm for this purpose. The whole procedure is
illustrated on a practical numerical example involving high voltage circuit breakers.
The objective function is composed of three components: (1) the Remaining Life of Equipment
represented in the model as the First Passage Time (FPT) from the current deterioration state to
the failure state, (2) The Life Cycle Costs represented as the cost of maintenance and failure, and
(3) equipment Unavailability. The goal is thus to define an optimization model that would
minimize a function of these three parameters, that is:
F (r ) = min f (total _ cos t , − FPT , unavailability ) (6.1)
Vector r symbolizes parameters of the model that can be varied and are all related to the amount
of money that the utility is willing to spend on the maintenance activities for a particular piece of
equipment. Thus, the model assumes that putting more money into maintenance activities can
result either in faster repairs or more thorough work or both. More thorough work is translated

G.J. Anders
2007, Tampa, USA
into an increased probability of the equipment ending up in a better deterioration state after the
repair.
Function f is a special function that transforms three parameters to be expressed in the same units
of measurement. In addition, the model allows representation of various degrees of risk aversion
of the person performing the analysis.
The results of the analysis are new parameters of the semi-Markov model and the revised
(optimized) values of the first passage times, component unavailability and the lifetime cost.
6.3 Asset Reliability Model (ARM) Program

The aim of this program is to used the information provided by the life curves generated by AMP
to help in selection of the maintenance policy option.
6.3.1 Input required

A typical study involves the establishment of a semi-Markov model for the equipment in
question, including maintenance states and appropriate transitions to and from them. The study
can be directed to analyze life curves, cost curves, or probabilities of failure. These are carried out
by inputting data and other information with the help of a series of screens. The first screen asks
for instructions as to the analyses to be performed and the results to be displayed. On the next
three screens input data can be entered. This data can be automatically copied from the AMP
program. The results are shown in the form of graphs as illustrated in the example that follows.
On the screen, these graphs appear in color.
6.3.2 Cost computations

In many financial evaluations, the costs are expressed as present value quantities. The present
value approach is also used in this study because maintenance decisions on aging equipment
include timing, and the time value of money is an important consideration in any decision
analysis. In selecting the best course of action, the proposed alternatives are compared with some
reference action. The corresponding cost difference is often referred to as the Net Present Value
(NPV). In the case of maintenance, the NPV can be obtained for several re-investment options
and are compared with the present maintenance policy.
Cost computations involve calculation of the following cost components:
1. cost of maintenance activities,
2. cost of the action selected (overhaul or replacement),
3. costs associated with failures (cost of repairs, system cost, penalties).
The costs are given as Present Value (PV). To compute the PV, inflation and discount rates are
required for a specified time horizon. The time horizon is a period of time, starting at the present
and ending after a chosen number of years, for which the costs of the various operating and
maintenance options are calculated and compared. The costs associated with equipment failure
over the time horizon are computed as the sum of two components: one for failures that occur
before the action is taken (during the delay period) and one for failures that occur after. These
costs are multiplied by the probabilities of failures before and after the action, respectively, and
the two products are added.

G.J. Anders
2007, Tampa, USA
6.3.3 Sample application: maintenance of high voltage air blast breakers
6.3.3.1 General
This study involves the analysis of several breakers with a total operating history of about 100
breaker-years [1]. According to the current policy, three types of maintenance are routinely
performed on each breaker. About every eight months, minor maintenance is performed
involving timing adjustments and lubrication at a cost of about $700. Its average duration is 0.25
day. Medium maintenance involving replacement of some parts, taking on the average 2 days at a
cost of about $6000, is performed approximately every ten years. Major maintenance involving
breaker overhaul takes place every twelve years with an average duration of 22 days and a cost of
about $75,000.
It follows that in this application a simplified form of the AMP model is used: instead of having
regular inspections, and maintenance performed only as needed, the various types of maintenance
are performed at regular (but still stochastically determined) intervals. Note that if at a given
point in time it is decided that the optimal maintenance policy is, say, to perform overhaul as
soon as possible and then continue by resuming the original maintenance routine, this overhaul is
out of step with the original policy and incurs extra costs. As mentioned before, other alternatives
include making no changes in the maintenance policy, stopping all maintenance altogether, or
installing a new breaker.
6.3.3.2 Financial information

The financial assumptions used below are usually well established in the approved financial
procedures of a corporation, and are available to the engineer. These assumptions must be
included because they have significant effect on the impact of re-investment action timing.
Generally, two sets of financial assumptions must be considered. The first set concerns the time
value of the dollar. It includes the projected inflation rate to account for the eroding value of
money with time, as well as the corporate discount rate used to set a required return on
investment. The second set has to do with the composite income tax and the property tax rates. In
the example presented here, only the first set of financial assumptions is considered with the
following numerical values.
Time horizon 10 years
Inflation rate 3%
Discount rate 5%
The system and penalty costs associated with equipment failure are assumed to be $10,000 each.
In order to calculate the effect of the proposed action, we need to specify the asset condition, or
asset value, at “present time” (the beginning of the time horizon). In this example, it is at 80%
which, for the given equipment, corresponds to 20 years of service. This information determines
where the equipment is located on the life curve.
6.3.3.3 Engineering information
The engineering information required is a simple description of the current maintenance practices.
In the breaker example, the three types of maintenance routines mentioned above are modeled.
In order to analyze re-investment alternatives, possible options need to be defined. Such options
were discussed before. More can be added or some deleted. In case of failure, the user has a

G.J. Anders
2007, Tampa, USA
choice of either repairing or replacing the equipment. In the former case, the condition of the
asset after repair has to be specified. In case of replacement, a new equipment type can be
entered, if desired.
The option “Continue As Before” represents the current maintenance policy and cannot be
deleted. This option does not require any additional parameters. The “Do Nothing” option
(named “Stop All Maintenance” in this example) requires only one additional parameter, the
delay period after which this “action” is implemented. The “Overhaul” and “Replacement”
options require three parameters: delay, cost of action, and the state of the equipment after the
action has been taken (in the replacement case, it is assumed that the equipment returns to the
100% condition level). Note that in these cases, the overhaul and replacement actions are carried
out just once, and it is assumed that after the action regular maintenance continues either in the
original or in a changed form. In the present example, the policy will slightly change: the minor
maintenance after overhaul or replacement will be performed once every 15 months rather than
every eight months.
With the financial and engineering data specified, the calculation of reliability and costs can now
proceed.
6.3.3.4 Life curves
Figure 6.4 shows three typical life curves for the selected breaker. Curve (a) describes the
existing maintenance policy as calculated by the program (action: “Continue As Before”). Curve
(b) is valid for the “reduced” maintenance policy where minor maintenance is performed less
frequently, as specified above. Curve (c) describes conditions where no maintenance is
performed at all. Note that the life curves always start at 100% asset condition and the policies
shown end when a failure occurs.
Figure 6.4 Life curves computed by the program: (a) present maintenance policy,
(b) reduced maintenance policy, (c) no maintenance.
The curves can be edited manually by inserting or deleting points. In the present study it is
assumed that the mid-point for the as-new stage, in terms of asset conditions, is 68%, for the
minor deterioration it is 25%, and for major deterioration 8% (not shown in the figure). Failure is
at 0%.
Figure 6.5 shows two life curves. Curve (a) represents the option where replacement is carried
out after a 3-year delay from the present time and following that, regular maintenance is
continued, but in the “reduced” form. The time horizon is indicated by a heavy line on the time
axis and it begins, as explained before, at the “present time”. In the replacement action, the

G.J. Anders
2007, Tampa, USA
equipment is assumed to return to the “as new” conditions. Since the new policy prescribes
maintenance less frequently, the 40-year life expectancy of a new breaker (Figure 6.4) shrinks to
26 years, thus the replacement after 23 years of a fairly good breaker results in an only slightly
improved expected life of 49 years.
Figure 6.5 Life curves with (a) replacement, (b) no maintenance, after a 3-year delay.
Curve (b) in Figure 6.5 represents the interesting situation where a decision is made to abandon
all maintenance activities altogether after a 3-year delay period. Upon equipment failure repair is
performed that brings the equipment to an assumed 90% of its original condition but, again, no
further maintenance is performed afterwards. Since the entire life curve for a breaker without any
maintenance is about 7 years (see Figure 6.4), the repair after failure adds only about six years to
the life of the equipment.
6.3.3.5 Cost diagrams
Cost computations involve the calculation of the expected numbers of failures and the various
types of maintenance activities during the specified time horizon. These expected numbers are
computed separately for the periods before and after the action. The cost of each maintenance
activity is then expressed by its present value. The probabilities of failures before and after the
action are either computed by the program or entered by the user if the life curves are specified by
him/her. The cost curves are then presented as functions of the delay.
Figure 6.6 illustrates the present costs for all options with a three-year delay for each. This
diagrams shows that in the case of a 3-year delay in starting a new policy, the best action (of those
considered) is to continue with the original maintenance policy. The expected cost of this is
$100,000 for the 10-year time horizon. The costs are the highest for the “Stop All Maintenance”
option because the probability of failure after 3 years is much higher than for the other options.
The maintenance cost is high for the “Continue As Before” policy because minor maintenance is
performed quite often and, during the time horizon, a major maintenance can also be expected to
occur.

G.J. Anders
2007, Tampa, USA
.
Figure 6.6 Cost diagram for various actions performed after a three-year delay.
6.3.3.6 Probability of failure
In order to compute the expected costs during the specified time horizon, the probability of failure
within this time-period is required. The probabilities for each option are computed by ARM and
can be displayed as functions of the delay. The before-action and after-action values can be
obtained separately, or in a composite form [2].
6.3.3.7 Sensitivity studies
Looking at the results in Section 6.3.3.5 the question arises how “robust” the findings are if some
of the input values are subject to uncertainty. To find an answer, several of the inputs were varied
to see how these changes affect the costs of options, and the selection of the preferred option.
Some of the results are shown in Figure 6.7. The diagrams indicate the present values of the costs
associated with each option for two time horizons, 10 years and 20 years, and for a range of
delays in time before the actions are implemented. One can observe that the option “Continue As
Before” is the least expensive, approximated by the “Do Overhaul” option in certain ranges.
Thus, in this example, “Continue as Before” appears to be a “robust” choice.
The sudden jump at 4.5 years occurs because during the delay period the original maintenance
policy is continued and in the course of this a major maintenance is expected at 4.5 years. If the
delay is less, this major maintenance will not happen because the maintenance schedule is
restarted at the time of action.

G.J. Anders
2007, Tampa, USA
Figure 6.7 Costs of options for two time horizons in terms of the delay in action.
It is interesting to find that costs are not at minimum if action is taken without delay. This is,
partly, because the present value of the cost of action becomes less if the action is delayed, and
partly, because during the delay period the comparatively cheap original maintenance policy is
applied.
If the inflation rate is varied between 1 and 10% (and the corresponding discount rate between 1.5
and 14%), all curves show a maximum near 2% inflation; and the “Continue” option is still the
most desirable in every case. The latter also holds if the rate of minor maintenance after action is
varied between 0.5 and 2 occurrences per year (this range includes the rate of 1.5 per year which
represents the case of no change in maintenance policies after action). The costs over this range
vary hardly at all.
6.4 Optimal refurbishment strategy

Both programs described above deal with analysis of various maintenance scenarios. Maintenance
is aimed at slowing down the deterioration process of the equipment. The problem of ageing
equipment is a universal engineering concern. Every system, structure or component (SSC) is
designed to function for some specified period of time; the actual degree of deterioration during
this specified time, however, will strongly vary by equipment and application. Carefully planned
use of equipment, including appropriately selected maintenance policies, can reduce the number
of failures and, thus, result in considerable savings. Moreover, it can prolong useful equipment
life, thereby increasing equipment reliability. Obviously, the savings obtained must be balanced
against the costs incurred by employing a possibly more costly maintenance plan.

G.J. Anders
2007, Tampa, USA
The following developments present a software tool, which enables Life Cycle Management
(LCM) analysis. The first version of the software [6-7] allows for the comparison of up to four
alternative plans. This is done by computing, through simulation, the present values of the costs
expected for each alternative, and also the Benefit to Investment Ratios (BIR) for alternatives B,
C and D, assuming that plan A forms the base case (usually the present plan). In comparisons of
this type, results are relative, not absolute numbers. Relative results minimize the effect of
erroneous statistical inputs as all alternatives are affected the same way.
The above approach was implemented in the Electric Power Research Institute (EPRI) LCM
studies using a program called LcmVALUE [8-11], completed by several nuclear power plants in
the US. Deterministic calculations compared the total NPV of up to four alternatives. In another
approach, most of the parameters were treated as random variables with triangular distributions
and Monte Carlo simulation were performed to obtain the mean values of these costs [11]. The
industry has developed LCM evaluation tools, including Westinghouse’s Proactive Asset
Management (PAM), and the EPRI/STP/ABS Risk-Informed Asset Management (RIAM)
method. The methods are distinctly different tools with unique features, yet each assists in LCM
planning for important SSCs.
The LcmPlus program is an extension of the EPRI’s approach whereby an optimization is

performed to find the best timing of the possible investments. A genetic algorithm belonging to a
class of evolutionary optimization methods is employed to minimize the total life cycle cost of the
system, structure or components (SSCs). The life cycle cost includes operation, proactive
maintenance, cost of failure (corrective maintenance and lost revenue) plus the major investment
costs planned for the future. Thus, instead of defining various scenarios for timing of future major
refurbishments, the software finds the optimal timing of all investments planned for the SSC.
Since most of the parameters entering the analysis are not known with certainty, they are treated
as random variables with prescribed probability distributions and the whole process is treated as a
stochastic optimization problem.
6.4.1 Optimization of the timing of investments for LCM of SSCs

The problem that the software models can be briefly described as follows. In order to keep a
particular SSC in good operating conditions, the company monitors its operation and performs
routine predictive maintenance. In spite of the best company efforts, the equipment occasionally
fails. We are interested only in those failures whose occurrence is caused by equipment
deterioration because through improved maintenance activities we hope to reduce the rate at
which the SSC fails. The most promising way of achieving this goal is through system
refurbishments. Such refurbishments may have various beneficial effects for the operation of the
SSC. For example, a replacement of the major parts of the SSC or the installation of new
monitoring equipment may reduce the failure rate or can reduce the time the equipment will be
out of service following a forced outage.
The usual practice in the LCM of the SSCs is to postulate several possible investment
alternatives. Each alternative will result in a predefined outcome, such as a reduction of the
equipment failure rate or outage duration or both. Such investments are usually very costly and
because of the usual financial constraints, they are staggered in time. We will assume that each
investment can occur in a predefined time interval, which in the most general case may span the
period from the present moment to the end date of the study. Our objective is to minimize the

G.J. Anders
2007, Tampa, USA
total life cycle costs (or maximize BIR) by optimally timing the investments. The constraints will
describe the intervals during which the refurbishments can take place.
This is a fairly complex optimization problem because each investment has a different effect on
the outage costs and some of those effects may be cumulative. LcmPlus employs a genetic
algorithm for this purpose. The evolutionary algorithms seldom find the absolute minimum of the
objective function but usually give a value in a close vicinity of it. Therefore, after the
neighbourhood of the absolute minimum is established, additional classic non-linear optimization
is performed to home-in on the best timing of the investments. The approach yields a set of dates
at which various refurbishments will be undertaken and associated costs. It should be pointed out
that it is quite possible that some investments will not be selected at all.
As already mentioned, some quantities are treated as random variables. This further complicates
the search of the best timing of the investments. The following procedure is used to perform the
stochastic optimization.
1. Enter the most probable values of all the input variables and define their probability
distributions. Any distribution can be used.
2. Select time intervals during which major investments can take place. Costs of such
investments can also be treated as random variables.
3. Select via Monte Carlo simulation the values of all random variables.
4. Perform optimization to find the best timing of the investments.
5. For the optimal set of investment dates obtained in step 4, perform Monte Carlo
simulation selected number of times and compute the mean value of the NPV of the total
cost and of the benefit-to-investment ratios.
6. Repeat steps 3 to 5 a specified number of times.
7. From the results obtained select one set that gives the optimal value of NPV cost or BIR.
The above process is summarized graphically in Figure 6.8.
If the number of the MC simulations in the first and the second runs is equal to N, then, in the
worst case, the program will need to perform N 2 MC runs and N optimizations. Through
numerous tests it has been determined that N = 10000 gives satisfactory results. Thus, up to
100,000,000 MC simulations may be required. This number is much reduced in practice since
many of the date sequences are not acceptable because of the order of investments on the same
SCC may be predetermined. Block four in Figure 6.8 selects allowable sequences. The details of
the calculations are described in the following sections.

G.J. Anders
2007, Tampa, USA
Figure 6.8 Flowchart of the stochastic optimization process.
6.4.2 Objective function

6.4.2.1 Cost computations
The total cost of a plan is composed of many components, including fixed and variable planned
costs as well as the unplanned costs of failures. In a probabilistic analysis, the last component is
computed by multiplying the equipment failure rate by the repair expenses and consequential
costs of failures. One of the greatest challenges of LCM is to build a probabilistic model of the
SSC’s failure rates that would be as close as possible to reality.
In comparing alternative plans, the most obvious way is to look at the total cost of each. The total
cost of an alternative is composed of the following two components:
1. Cost of maintenance (MC)
2. Cost of failure (FC)
In addition to regular maintenance costs, each alternative may have one-time costs associated
with the selected maintenance action. For example, in the case examined later, such costs would
include the purchase of a spare rotor, or the rewinding of a stator. Similarly, the cost of failure
can be composed of several components. The total cost, TC, of an alternative is equal to
TC = MC + FC (6.3)
The MC expenses including ongoing yearly costs (YC), planned refurbishment costs (RFC) and
special one-time costs (SC). Components of YC are the engineering expenses, operating
expenses, costs of craftsmen, all man-hours times rates. Rates may change yearly. To be added

G.J. Anders
2007, Tampa, USA
are the costs of subcontractors, materials, and other expenses not mentioned above. The RFC
costs are computed similarly, except that now the costs pertain to refurbishment rather than to the
ongoing yearly costs.
The second component in equation (1) represents the costs associated with the SSC failures. This
component includes the costs of repairs (RC), lost production (LPC) and consequential effects
(CC). Further, one needs to estimate the expected number of failures in each year, that is, the
failure rate. This last will be denoted by λi (k ) for failure mode i in year k. The total cost of
failures, FC, is then expressed as
K n
FC = ∑∑ λi (k ) [ RCi (k ) + Di (k ) LE (k ) PL + CCi (k )] (6.4)
k =1 i =1
where
K = number of study years
n = number of failure modes
Di (k ) = average outage duration of failure mode i in year k (h/occ)
LE(k) = PV of the lost production cost per each MWh energy loss in year k ($/MWh)
PL = power lost during each outage (MW)
The middle term of the right-hand side represents LPC, the cost of lost production due to failure
mode i in year k. All cost components are converted to the present values taking into account the
inflation and discount rates.
6.4.2.2 Benefit to Investment Ratio

Another criterion which is widely used in decision making is the Benefit to Investment Ratio (B/I
Ratio or BIR). In general, BIR is used in comparisons of alternatives, and is defined as the change
in failure costs (benefit), divided by the change in maintenance costs (investment) as one
alternative plan is substituted by another. All values used are present values. Thus,
−ΔFC
BIRAB = (6.5)
ΔMC
The negative sign in the numerator accounts for the fact that if the failure costs decrease when
moving from plan A to plan B, the benefits increase. Equation 4 can be rewritten as
FCB − FC A
BIRAB = − (6.6)
MCB − MC A
and substituting (2) in the numerator,

(TC B − MC B ) − (TC A − MC A ) TC B − TC A ΔTC
(6.7)
BIR AB = − = 1− = 1−
MC B − MC A MC B − MC A ΔMC
Values of BIR greater than one indicate that alternative B is better than the reference plan A. On
the other hand, if BIR is less than one, the investment required by plan B would be ineffective.
The BIR can also attain negative values; this occurs when the total value of the alternative is
higher than the total value of the base case.

G.J. Anders
2007, Tampa, USA
6.4.3 Optimization problem
The optimization problem we are solving involves minimization of the objective function given
by equations (6.3) and (6.4) subject to the constraints representing permissible investment
periods. In our example, the investment period covers the entire study time horizon. Let R be a set
of all possible investments. Some parameters are dependent on a particular combination of the
investments; hence, the objective function takes the form
N n
min ∑
r∈R
r∈r
∑ MC (k ) + ∑ λ (k , r) ⋅ [ RC (k , r) + D (k , r) LE (k ) PL + CC (k , r)]
k =1
r
i =1
i i i i (6.8)
where r denotes a sequence of investments in R and MCr (k ) represents the rth investment that
took place in year k. The algorithm considers all feasible investment combinations. In this
application, we assumed that only the failure rate, the repair and consequential costs and the
duration of the outage are affected by the refurbishment scenarios.
As an alternative, the objective function could maximize the BIR value given by (6.7).
6.4.3.1 Evolutionary algorithm

Since the proposed optimization problem involves a class of functions that cannot be defined a
priori (the number of variables varies and is a function of the selected investments), the program
employs evolutionary algorithms that are well suited to handle such situations [12-14]. When
designing the optimization problem one has to remember that the analyzed solutions (dates of
refurbishments) are dependent on additional constraints that eliminate some investments. For
example, if we have two possible investments one involving replacement of the equipment and
the other only the replacement of some parts, the second investment cannot follow the first one
for obvious reasons, whereas the reverse order is permitted.
Evolutionary algorithms use techniques inspired by evolutionary biology such as inheritance,

mutation, natural selection, and recombination (or crossover). Discussion of the implementation
of an evolutionary algorithm for the LCM optimization problem is given in [6].
6.5 Program description

The web application was developed using the J2EE technology. The project is based on Model-
View-Contoroller (MVC) design pattern and is discussed in [6]. The input data consists of
economic information (inflation and discount rates, cost of energy, etc.) and the routine
maintenance and refurbishment costs. The possible investments and their mutual relationships are
also defined and the effects of the investments are specified (e.g., the change of the failure rate or
the outage duration or costs). The required data is illustrated in the numerical example in the next
section.
6.6 Numerical example

The example concerns the LCM plan for a main generator in an electric power station. The
licensed period of the station is 40 years and it is assumed that the license will not be renewed.
The study period starts at year 20 of the plant’s operation; therefore, the pay-off time must be
shorter than the remaining 20 years if the plan is to be successful. The data are based on real-life
industry experience.

G.J. Anders
2007, Tampa, USA
6.6.1 Economic data
Table 6-1 Summary of the economic parameters and their bounds for this study.
Parameter Lower bound Nominal value Upper bound
Replacement Energy Cost
12 24 72
($/MWh)
Discount Rate (%) 6.75 9 12
Inflation Rate (%) 1.5 3 4.5
Other costs are as follows: labor cost is 60 $/h and engineering cost equals 70 $/h.
6.6.2 Base case equipment parameters

6.6.2.1 Failure rates
For the analysis of the LCM plans for a large generator, four failure modes were considered as
follows.
1. Stator winding and core.
2. Rotor winding, forging and RR.
3. Exciter and voltage regulator.
4. Other.
Traditionally, the equipment failure rate is computed by dividing the number of outages by the
equipment equivalent operating years considered in the studies. The failure and maintenance
rates computed in such a way are usually assumed to be constant throughout equipment life.
However, many characteristics could influence equipment failure or maintenance rates causing
variation in equipment failure rates with time and usage. These could include, for example,
equipment age, manufacturer and the maintenance depth and frequency. The LCM Solutions
software can accommodate other mathematical models of failure rate histories such as linear and
Weibull. In this example, all the failure rates will be treated as linearly changing with time; that
is, they will take the form
λi (k ) = ai + bi ⋅ k (6.9)
where k represents a year and i is the failure mode. In particular, when bi = 0 the failure rate is
independent of age but still can be a random variable.
Four failure types are considered:

1. A stator failure
2. A rotor failure
3. Excitation system failure
4. Other equipment failure
6.6.2.2 Random variables

Most of the parameters in the LCM studies are uncertain, including in particular: failure rates,
outage costs and outage durations. For the purpose of this example, only the failure rates will be
changed following each investment. It was assumed that each parameter follows a triangular
distribution with a given mode and the minimal and maximum values. The assumed values of the
parameters are summarized in Table 6-2 where failure rate λ, outage cost C, and outage duration
D, are given for each failure mode of the unit.

G.J. Anders
2007, Tampa, USA
Table 6-2 Parameters of random variables - base case.
Value of a Value of b
Failure type Parameter Lower Upper
Nominal
bound bound
λ1 (1/y) 0.02 0.038 0.08 0
1 C1 (k$) 200 800 10,000 0
D1 (days) 5 30 90 0
λ2 (1/y) 0.01 0.03 0.1 0.0005
2 C2 (k$) 200 500 6,000 0
D2 (h) 15 20 50 0
λ3 (1/y) 0.05 0.076 0.15 0.0005
3 C3 (k$) 5 20 100 0
D3 (days) 0.5 2 10 0
λ4 (1/y) 0.05 0.076 0.15 0
4 C4 (k$) 10 10 100 0
D4 (days) 0.5 1 10 0
6.6.3 The alternative plans

The following investment alternatives have been defined.
A. The base case. The assumption is that the current maintenance program is being
continued. Maintenance is carried out regularly, and after failures, repairs are performed;
the associated lost production costs are the dominant expenses in this case. Failures are
expected to occur with some regularity – this frequency is estimated from past
performance.
B. In this plan, the rotor is rewound in the future. The estimated cost of this investment is
$300,000.
C. This plan invests even more in maintenance. It includes the purchase of a spare rotor at the
cost of $4,000,000.
D. This investment postulates a purchase of a digital voltage regulator at the price of
$1,000,000.
E. In this plan, a new exciter is purchased for $4,000,000.
The investment plans are selected in the hope of reducing the expenditures necessitated by
failures, including the lost production costs caused by curtailed energy. The parameters may have
different limits following each investment. The effect of each investment is summarized in Table
6-3.

G.J. Anders
2007, Tampa, USA
Table 6-3 Bounds for failure rates following each investment.
. Value of parameter a Value of parameter b
Investment Failure type lower nominal upper nominal
B 2 0.005 0.02 0.02 0.00025
C 2 0.01 0.02 0.05 0.00020
D 3 0.025 0.035 0.04 0.00050
E 3 0.01 0.057 0.10 0.00025
Since investments B and C both concern the rotor, an additional constraint is added that once a
new rotor is purchased, no rewound is required. Similarly for the excitation system, we will
assume that if a new exciter is purchased (investment E), there is no need for a new voltage
regulator (investment D).
We will assume that each of the above investments can take place at any time during the entire
study period. We will also assume that the plant is 20 years old and the study period extends from
the present moment to the end of the planned life, which is assumed to be 40 years; that is, 20
years from now.
6.6.4 Study results

Several cases will be presented in this section with progressive complexity.
6.6.4.1 Deterministic parameters

In this study it is assumed that the parameters take their most probable (nominal) values. The cost
of the base case alternative is 319,315.97 k$. This cost includes operating expenses and the cost
of forced outages.
A sequence of the optimal dates for investments B, C, D and E is [35, 119, 14, 131] with the total
cost of 222,184.90 k$ which is equal to 69.6% of the base case cost. The investment dates are
represented in months from the beginning of the study period.
Normally, the installation of the new equipment can take place only during a planned outage. If
we were to select the sequence of the optimal investment scenario given above, we would install a
new voltage regulator during the first outage (outages take place every 18 months) and we would
rewind the rotor during the second outage. On the other hand, the more expensive investments
would take place further in the future. A new rotor would be installed in 10 years from now and a
new excitation system would be purchased 1 year later. In practice, both investments could take
place during the same outage.
The results might be quite different if a different allowable period was selected for each
investment.
Additional studies were performed in which only investments B and C were considered. The
optimal solution is given by a vector [35, 120] with the total cost of 232,145.17 k$. We can
observe that the optimal investment dates are the same as before but the total cost is about 10,000
k$ larger. Figure 6.9 shows these results in a graphical form.

G.J. Anders
2007, Tampa, USA
Figure 6.9 Optimal investment dates for scenarios B and C.

If, on the other hand, only excitation system is considered (investments D and E), the optimal
sequence is [15, 132] with the total cost of 309,355.84 k$.
We can observe that, in this case, installing a new excitation system or purchasing a new voltage
regulator has very small influence on the total life cycle cost of the generator. This can be
explained by the fact that the cost of failure associated with the excitation system is much smaller
than the cost resulting from a rotor or a stator failure.
6.6.4.2 Probabilistic analysis

In order to assess the effect of the uncertainty in the input parameters on the cost and the B/I
ratios, a Monte Carlo study was performed. The purpose of the Monte Carlo analysis is to
establish the probability distributions of the Total Cost and the Benefit to Investment ratios for the
optimal sequence of investments. This will allow us to answer the following questions:
• What is the chance that the Total Cost of the selected sequence will be grater/smaller than
a specified value?
• What is the probability that the selected sequence will be better than an alternative one?
Through experimentation, 30,000 simulations were selected for the Monte Carlo runs. From the
first round of simulations, 30,000 optimal investment dates were obtained. For each of these
dates, the second set of Monte Carlo simulations was performed to find the sequence with the
lowest expected NPV.
The base case scenario costs 691,650.2 k$ and the optimal sequence of investments is given by
the vector [32, 122, 0, 122] with the cost of 499,497.03 k$, which is 72.2% of the original cost.
We can observe that the costs in this study are more than double the values of the deterministic
case. This can be explained by the fact that triangular probability distributions of the input costs
are skewed to the right with the upper limit much further away from the most probable value than
the lower limit.
In the stochastic optimization study a sequence [131, 26, 0, 115] gave the cost of 502,156.3k$,
which is only slightly higher than the optimal one. However, in this sequence, we would
purchase a new rotor during the first outage and would not do rewind at all since the rewind falls
after the new rotor installation which is not allowed sequence of investments. The utility may
prefer this scenario to the optimal one. In order to analyze further these two alternatives, the
probability density functions of the BIR for both sequences were plotted in Figure 6.10.

G.J. Anders
2007, Tampa, USA
0.08
0.07
Optimal solution
0.06
Alternative solution
0.05
Probabili
0.04
0.03
0.02
0.01
0
0.0 20.0 40.0 60.0 80.0
BIR
Figure 6.10 The frequency chart of the BIR values of two alternative sequences of optimal
investment dates. The graph has been truncated for negative values of the BIR.
Information in Figure 6.10 gives additional valuable insight into the decision making process. The
higher mean value is confirmed since for the original optimal investments dates the density of
BIR is somewhat skewed to the left and the alternative scenario has fairly high probability of
large values of BIR.
6.7 Conclusions
This chapter presents advanced computer programs to help the decision-makers in choosing the
best maintenance strategy from a selection of options. The intention was to develop tools that can
be easily used by asset management planners and field engineers. Their development was guided
by the needs of users at Hydro One Networks Inc. of Toronto. The tools are used to complement
other methods in the area employed in Hydro One and other utilities.
The successful application of the method employed in AMP and ARM hinges on the proper
representation the equipment deterioration process under various maintenance policies. These
processes are graphically represented by life curves. The creation and application of such curves
was described in this report.
The most important features are the following.

• Probabilistic modeling of all variables entering the analysis.
• Intensive application of semi-Markov models and Monte Carlo simulation, allowing
application of several types of standard probability distributions.
• Calculation of the First Passage Times for analysis of the remaining life of the power
equipment undergoing maintenance.
• Calculation of the benefit-to-investment ratios for alternatives of investment/asset-
sustainment plans extended over period of time spanning the time horizon.
• Implementation of simulated annealing and evolutionary algorithms for the selection of
the optimal investment intervals.
The method encoded in the programs use sophisticated probability techniques. In the real world,
many parameters are really random variables; that is, their values are uncertain and can be
described only by probability distributions. These distributions can take on many shapes and,
once chosen in an application, can be best evaluated through either semi-Markov models or
Monte Carlo simulation techniques, as implemented in these programs.

G.J. Anders
2007, Tampa, USA
6.8 References
[1] Endrenyi, J.,. Anders G.J. and Leite da Silva A.M., "Probabilistic Evaluation of the Effect of Maintenance
on Reliability - An Application", IEEE Trans. on Power Systems, Vol. 13, No.2, May 1998, pp. 575-583.
[2] Anders, G., Endrenyi, J. and Yung, C., “Risk-Based Planner for Asset Management”, IEEE Computer
Applications in Power, Vol. 14, No. 4, pp. 20-26, October 2001.
[3] Ross S, Stochastic processes, John Wiley & Sons, N.Y., 1995.
[4] Anders G.J., Leite da Silva, A.M., “Cost Related Reliability Measure For Power System Equipment”, IEEE
Trans. On Power Systems, Vol. 15, No.2, May, 2000, pp. 654-660.
[5] Stopczyk, M., Sakowicz B., Anders G.J., “Application of a semi-Markov model and a simulated annealing
algorithm for the selection of an optimal maintenance policy for power equipment”, submitted to IEEE
Trans. on Power Systems.
[6] Sakowicz, B., Stopczyk M., Anders G.J., “Scheduling of Major Investments for a Steam Generating Unit
Using a Stochastic Model”, submitted to IEEE Trans. on Energy Conversion.
[7] Anders G.J. and Sakowicz B., “Life Cycle Management – distributed Web-based software development with
evolutionary programming and stochastic optimization”, PMAPS’2006 Int. Conference, Stockholm, June
2006.
[8] EPRI Life Cycle Management Planning Tool, LcmVALUE, Beta Version 0.2, June 2002.
[9] EPRI Technical Report 1000806, “Demonstration of Life Cycle Management Planning for Systems,
Structures and Components” With Pilot Applications at Oconee and Prairie Island Nuclear Stations, January
2001.
[10] EPRI Technical Report 1003058, “Life Cycle Management Planning Sourcebooks-Overview Report”,
December 2001.
[11] Electric Power Research Institute, Inc. (EPRI), “Demonstration of Life Cycle Management Planning for
Systems, Structures and Components – LcmVALUE User Manual and Tutorial Final Version 1.0”, Project
no.6118, July 2002.
[12] Goldberg, David E, Genetic Algorithms in Search, Optimization and Machine Learning, Kluwer Academic
Publishers, Boston, MA, 1989.
[13] Goldberg, David E, The Design of Innovation: Lessons from and for Competent Genetic Algorithms,
Addison-Wesley, Reading, MA, 2002.
[14] Schmitt, Lothar M, Theory of Genetic Algorithms II: models for genetic operators over the string-tensor
representation of populations and convergence to global optima for arbitrary fitness function under scaling,
Theoretical Computer Science (310), pp. 181-231, 2004.
6.9 Biography
George Anders received a Masters Degree in Electrical Engineering from Technical University
of Lodz in Poland in 1973, an M.Sc. Degree in Mathematics and Ph.D. Degree in Power System
Reliability from the University of Toronto in 1977 and 1980, respectively. He also received a
Doctor of Science degree from the Technical University of Lodz in Poland in 2000. Since 1975
he has been employed by Ontario Hydro, first as a System Design Engineer in Transmission
System Design Department and currently as a Principal Engineer/Scientist in the Electrical
Systems Technologies Department of Kinectrics Inc. which is a successor company of Ontario
Hydro Technologies. For several years, Dr. Anders has been teaching at the University of
Toronto and he is now an Adjunct Professor in the Department of Electrical and Computer
Engineering. He is author of over 160 technical papers and several books. Dr. Anders is a
registered Professional Engineer in the Province of Ontario and a Fellow of the IEEE.

G.J. Anders
IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA
7 Risk Based Asset Management – Applications at Transmission Companies

Wenyuan Li, Fellow IEEE
British Columbia Transmission Corporation,
Vancouver, Canada
Abstract - This chapter of the tutorial discusses two actual applications of risk based asset management
approaches at British Columbia Transmission Corporation, Vancouver Canada. The concepts and methods
presented are general and can be applied in any utility.
The first application is the risk evaluation based approach to the replacement strategy of aged HVDC
components. It includes estimation of unavailability of individual HVDC components due to repairable and
aging failures, calculations of capacity state probabilities of the HVDC system, quantified risk evaluation of
the power system containing the HVDC link and benefit/cost analysis for different replacement strategies. The
approach can be also used for other system components. The replacement strategy for an aged submarine
cable of the HVDC link in a power supply system at BCTC is analyzed as an example to demonstrate the
actual aspects. The procedure of the analysis is explained in detail in the example.
The second application is a probabilistic approach to determining the number of spare transformers for a
group of transformers and the timing requirement for each spare transformer to meet the specific reliability
criterion. The historical reliability performance metric designated as System Average Interruption Duration
Index (SAIDI) is used to establish the specified reliability criterion. The proposed method considers both
repairable and aging failures of transformers. The 138/25 kV 25 MVA transformer group in the BCTC system,
consisting of both fixed turn ratio and on-load tap changing transformers, is used for an illustration. The
detailed analysis in determining the number of spare transformers and their timing requirements during a 10
year planning period is presented.
7.1 Introduction
Asset management is associated with a variety of topics, including maintenance, replacement, aging
and retirement, life cycle assessment, equipment spare planning, risk management and reliability
evaluations, etc. Both traditional and risk based asset management methods have been addressed in
the past [1 - 16].
This chapter of the tutorial discusses two actual applications of risk based asset management
approaches at British Columbia Transmission Corporation (BCTC). The concepts and methods
presented are general and can be applied in any utility.
The first application is the risk evaluation based approach to the replacement strategy of aged HVDC
components. It includes estimation of unavailability of individual HVDC components due to
repairable and aging failures, calculations of capacity state probabilities of the HVDC system,
quantified risk evaluation of the power system containing the HVDC link and benefit/cost analysis
for different replacement strategies. The approach can be also applied for other system components.
The replacement strategy for an aged submarine cable of the HVDC link in a power supply system at
BCTC is analyzed as an example to demonstrate the actual aspects. The procedure of the analysis is
explained in detail in the example.
Risk Based Asset Management – Applications at Transmission Companies 83

W. Li
The second application is a probabilistic approach to determining the number of spare transformers
for a group of transformers and the timing requirement for each spare transformer to meet the
specified reliability criterion. The historical reliability performance metric designated as System
Average Interruption Duration Index (SAIDI) is used to establish the specified reliability criterion.
The proposed method considers both repairable and aging failures of transformers. The 138/25 kV 25
MVA transformer group in the BCTC system, consisting of both fixed turn ratio and on-load tap
changing transformers, is used for an illustration. The detailed analysis in determining the number of
spare transformers and their timing requirements during a 10 year planning period is presented.
7.2 Replacement Strategy of Aged HVDC Components [17]

7.2.1 Problem description
HVDC links have been widely used in electric power systems across the world for many years. A
HVDC link is much more complex than a simple AC circuit since it not only consists of overhead
lines or underground/submarine cables but also a variety of converter station equipment including
valves, converter transformers, smoothing reactors, filters and auxiliary protection and control
devices. It is actually a sub-system of multiple components.
Many HVDC systems in the world have been operated for 25 to 35 years or even longer and some
components have reached their end-of-life stage [18]. An important issue is the replacement strategy
for an aged HVDC component. Utilities have different practices for replacement, including:
• The aged component is continuously used until it dies. The problem with this policy is that for
major transmission system components (e.g. cables, transformers, reactors, etc.), it will take
more than one year to complete the whole replacement process including purchase,
transportation, installation and commissioning of a new component. The power system may be
exposed to severe risks of being unable to meet security criteria during the replacement period.
• The aged component is continuously used with close field monitoring. The process of
purchasing a new component for replacement starts when phenomena associated with fatal
failure are observed. Unfortunately, some component cannot be monitored in such a way. For
example, it is extremely difficult to monitor a cable since sampling a section of cable cannot
represent the status of the whole cable. For a power transformer, although oil sampling can be
performed to partially monitor the status of its wear-out, the decision on replacement is still
difficult.
• The replacement is set at a given retirement age which is normally around the estimated mean
life of component. Once a component reaches this age, the replacement is imposed. The
problem with this policy is the fact that any aged component may die before or after the
specified retirement age. If it dies before, it will result in a high system risk that is caused by
its absence from the system. If it can survive longer, its early retirement will result in a waste
of capital because of unnecessary earlier investment for replacement.
The questions utilities are facing for replacement strategy are:

• Should a piece of equipment be replaced?
• If yes, when should it be replaced: before or after it fails?

W. Li
This section presents a risk evaluation based approach to answer these two questions for aged
components. The basic idea is to quantify the expected system risks and risk costs due to three
replacement options: replacing the aged component before it fails, replacing it after it fails and not
replacing it at all. The difference in the expected risk cost between the options can be compared with
the difference in the capital cost between them. Although the approach can be applied to any
equipment in power systems, the descriptions and example given here focus on an aged HVDC
component since evaluating impacts of a HVDC component on the system risk requires more efforts
than an AC component.
7.2.2 Methodology
7.2.2.1 Procedure of the approach
Conceptually, the value of a component in a power system depends on the variation of system risk
caused by its absence from the system. If the absence of a component creates very marginal
degradation in system reliability, the benefit of replacing it becomes minor. This situation may not
occur often since the majority of power system components are installed for a specific purpose that
contributes to the reliable delivery of power. However, when system configuration is changed or
system enhancement is performed, some equipment may become less important to the system.
Generally speaking, the impact of any equipment on the system risk is an extremely complex
function of system configuration, effects of other new equipment, load levels and failure probabilities
of all system components that vary from year to year. In other words, the decision to replace and the
choice of replacing before or after component failure will have different impacts on the total system
risk in a period. Therefore, quantified system risk evaluation is the key for selecting a replacement
strategy. Calculating failure probabilities due to aging failures is one of crucial steps in the risk
assessment.
Considerable efforts have been devoted to risk evaluation of power systems in the past [19 - 22].
However, relatively little literature has discussed risk evaluation of power systems containing HVDC
links. It is difficult to directly evaluate the system risk of a power system containing HVDC sub-
systems using traditional methods. A HVDC system consists of multiple components and can be
operated at different capacity levels. The proposed method is to calculate a capacity probability
distribution of the HVDC system and incorporates it into the risk evaluation of the whole power
system as an equivalent component with multiple states. The presented approach includes the
following steps with a focus on the equivalent modeling of the HVDC system:
1. Estimating average unavailability of individual HVDC components including both repairable
and end-of-life failure modes
2. Calculating capacity levels and capacity probability distributions of the HVDC system for
three cases: with all existing components, with the replacement of a component whose
replacement strategy is investigated, and with the component out-of-service without
replacement
3. Evaluating the risks of the power system containing the HVDC system for the three cases in
Step 2
4. Performing the analysis for the replacement strategy of the component under consideration
It can be seen that Steps 1 and 2 are to obtain an equivalent component of the HVDC system under
different replacement strategies. For a system without HVDC link, the procedure is simpler. An AC
component generally can be represented using a two-state model (up and down) and only unavailable
probabilities of AC components are prepared. The three cases to be evaluated are the same for an AC
component under investigation for replacement strategy.
W. Li
7.2.2.2 Estimating unavailability of system components

The unavailability of a system component due to repairable failure is defined as [1]:
f ⋅ MTTR
Ur = (7.1)
8760
where f is the average failure frequency (failures/year) and MTTR is the mean time to repair
(hours/repair).
For an aged system component, particularly the component under investigation for replacement, its
aging failure mode should be considered. The unavailability due to aging failures depends on the age
of a component and a subsequent period to consider. By denoting its age and the subsequent period
by T and t respectively and dividing the t into N equal intervals with an interval length D, the
unavailability due to its aging failure can be calculated by [1, 14-15].
1N
Ua = ∑ Pi ⋅ [ t − ( 2i − 1 )D / 2 ] (7.2)
t i =1
where
T +iD T + (i −1) D
∫ f ( x)dx − ∫ f ( x )dx
Pi = T T (i=1, 2,…, N) (7.3)
∞
∫ f ( x)dx
T
The f(x) is a failure density probability function. The Weibull distribution is often used and in this
case, equation (7.3) becomes:
β β
⎡ T + (i − 1) D ⎤ ⎡ T + iD ⎤
exp⎢− ⎥ − exp⎢− α ⎥
⎣ α ⎦ ⎣ ⎦ (i=1, 2,… N) (7.4)
Pi = β
⎡ T⎤
exp⎢− ⎥
⎣ α⎦
where α and β are the scale and shape parameters for the Weibull distribution, which can be
estimated using historical data [23].
The total unavailability of the two failure modes is obtained using a union concept:
U t = U r + U a − U rU a (7.5)
The above equations are general and apply to both AC and HVDC components in the power
system.
7.2.2.3
7.2.2.4 Calculating state capacity probability of HVDC system
The unavailability alone is sufficient to model a two-state model for AC components whereas an
equivalent multiple capacity state model is needed for a HVDC pole with multiple components. The
HVDC pole has its full capacity when all HVDC components are available. The probability at the full
capacity is calculated as follows:
W. Li
K
P full = ∏ (1 − U i ) (7.6)
i =1
where Ui is the unavailability of Component i and K the number of components in the HVDC pole.
A failure of some components leads to the derated state, which can be called the half-pole operation
mode. The probability of the derated capacity level is calculated as follows:
Nj
M
∏ Un
Pdr = ∑ n =1
P full (7.7)
Nj
j =1
∏ (1 − U n )
n =1
where M is the number of the failure events that lead to the derated capacity level and Nj the number
of failed components in the jth failure event. Normally, Nj contains only one critical component in
most cases.
The probability of the full HVDC pole being down (at the zero capacity) is:
Pdw = 1 − P full − Pdr (7.8)
Multiple derated states can be modeled in a similar way if necessary [24].
7.2.2.5 Evaluating risk of power system
The purpose is to evaluate impacts of different replacement strategies on the risk exposure to the
power system. Generally, it is necessary to evaluate the risk of the composite generation and
transmission system that contains the component for replacement. The procedure and details of
composite system risk evaluation can be found in Reference [1, 3]. However, in some cases, a
simplified risk evaluation model can be applied. For the replacement of a HVDC component, the
subsystem impacted by the replacement is the region that the HVDC supplies power to. In this case, a
power source-demand system risk model is sufficient for comparison between different replacement
strategies. In the risk evaluation model, all power sources including the HVDC poles and transmission
lines supplying to the region as well as location generators in the region can be treated as power
sources while the total load with an annual load curve is the demand. The risk evaluation method for
such a model is summarized as follows:
1. A multiple level load model is created using chronological hourly load records during one
year. All the load levels are considered successively and the resulting indices for each load
level are weighted by their probability to obtain annual indices.
2. System states at each load level are selected using Monte Carlo simulation techniques. This
includes:
• The HVDC pole states are modeled using a multiple-state random variable (full up, down
and derated states)

W. Li
• Generating unit states are modeled using multiple-state random variables or two-state
random variables (up and down states) depending on the generators.
• AC transmission equipment states are modeled using two-state random variables (up and
down states).
Take a three-state random variable for a HVDC component as an example. A uniformly

distributed random number Rj is drawn between [0, 1] for each power source component. The
state of the jth power source component is determined by
⎧0 (up) if R j > (Pdr ) j + (Pdw) j

⎪
s j = ⎨1 (down) if (Pdr ) j < R j ≤ (Pdr ) j + (Pdw) j (7.9)
⎪2 (derated) if 0 ≤ R j ≤ (Pdr ) j
⎩
where, Pdw and Pdr are the probabilities in down and derated states. In the case of two-state random variable
for an AC component, the sampling concept is similar without considering the derated state.
3. The capacity of each power source component is determined according to its state so that the total
system power capacity can be obtained. For a given load level, the demand not supplied in the kth
sampling is calculated by
⎧⎪ m ⎫⎪
DNSk = max ⎨0, Li − ∑ G jk ( s j )⎬ (7.10)
⎪⎩ j =1 ⎪⎭
where Li is the load at the ith level, Gjk the available capacity of the jth power source in the kth
sampling and m the number of power sources supplying the subsystem considered.
If uncertainty of the load is considered, the load level Li is used as the mean with the
uncertainty represented by a standard deviation σi. A standard normal distribution random
number Xk is created using an approximate inverse transformation method [1, 3]. The sampled
value of the load in the kth sampling is given by
Lσ i = X kσ i + Li (7.11)
The Lσi is used to replace Li in Equation (7.10) in order to capture the uncertainty of the load.
4. The EENS (Expected Energy Not Supplied) that reflect the system supply risk is calculated by
N L ⎛ T Si ⎞
LOEE = ∑ ⎜ i ∑ DNS k ⎟ (7.12)
⎜ ⎟
i =1 ⎝ N i k =1 ⎠
where, NL is the number of the load levels in the multiple step model of an annual load curve,
Ti the time length of the ith load level and Si the number of samples at the ith load level.
7.2.2.6 Benefit/cost Analysis in Comparison between Replacement Strategies
Different replacement strategies – replacing before the component fails, or replacing after it fails, or
not replacing at all – have different system risks and costs. Therefore they can be compared using a
benefit/cost analysis approach. The analysis may vary slightly depending on the case. The detail of
benefit/cost analysis is illustrated using an actual example in the following subsection.

W. Li
7.2.3 Actual Example

7.2.3.1 Case description
The Vancouver Island region in the BCTC system is supplied through two 500 kV lines, a bipolar
HVDC link and several local generators. The schematic diagram of the island supply system is
shown in Figure 7.1. The HVDC link is an aged system with Pole 1 in service for 37 years and
Pole 2 for 30 years. The schematic diagram of the HVDC system is shown in Figure 7.2.
According to system planning studies, a new 230 kV AC line will be added to the power supply
system in 2008 to replace the aged HVDC system. On the other hand, the existing HVDC system
must be available at least until the new 230 kV AC line is in-service. A recent field inspection (in
2005) found that the cable 1 of HVDC Pole 1 has some armor damage with three broken wire
strands [25]. Cable experts estimated that the damaged section (5 km) of the cable 1 has a very
high possibility of fatal failure within a couple of years. The questions the utility faces are:
Should the damaged section be replaced? If yes, should it be replaced before or after it fails?
7.2.3.2 Study Conditions
The main study conditions include:
• A new 230 kV AC line is expected to be in service in 2008. The HVDC system has a much
smaller effect on the reliability of the island supply system after the 230 kV line in service
than before.
• The HVDC system is an old system. Once the 230 kV line is in service, the HVDC system will
be kept for a transition period and possibly retired around 2010 when the cost of maintenance
and repairs exceeds the benefit. The time frame in the study is the 5 years from 2006 to 2010.
• The replacement of the damaged cable section will take about one year because marine work
can only be performed under fair weather. Preparation for replacement also takes long time to
complete.

W. Li
1200 MW 500 kV line
1200 MW 500 kV line

Vancouver Island load
312MW/156 MW
HVDC Pole 1
476 MW/238 MW
HVDC Pole 2
304 MW in total
ASH JHT 1 - 6 PUN LDR1 -2 SCA1 - 2 UCO/Zeballos

27 MW 21 or 26 MWx6 24 MW 24 MWx2 32 MW x2 15 MW
170 MW Steam 7 0MW
ICG
170 MW JOR
600 MW 230 kV AC line (future)
Figure 7.1 Schematic diagram of Vancouver Island supply system.
• The peak loads in the island region from 2006 to 2010 are based on the recent load
forecast. It has been assumed the annual load curves for all the 5 years follow the same
shape that is based on the hourly load records in 2005.
• Both Poles 1 and 2 of the HVDC system were modeled using three capacity states (full
up, derated to half and full down). If the cable 1 of Pole 1 has the end-of-life failure with
no replacement, the maximum capacity of HVDC Pole 1 will be derated to 156 MW from
312 MW whereas the maximum capacity of Pole 2 will be derated to 336 MW from 476
MW according to the HVDC configuration.
• Both repairable and aging failure modes of all the components in the HVDC system are
modeled whereas only repairable failure modes for AC transmission components and
local generators are considered. The repairable failure data are obtained from historical
records.
7.2.3.3 Capacity state probabilities of HVDC system

The capacity state probabilities of the existing HVDC system (Poles 1 and 2) and the HVDC
system with the replacement of damaged cable section or without replacement are evaluated
using the methods given in Section 7.2.2.2 and 7.2.2.3. The results are shown in Table 7-1 to
Table 7-6 respectively. The following observations can be made:

W. Li
Submarine cable 4
Filter Filter
Submarine cable 3
Reactor Reactor
Pole 2
Valves
Valves
Transformers Submarine return cable Transformers
Valves Valves
Pole 1
Filter Reactor Reactor Filter

Submarine cable 2
Submarine cable 1
Figure 7.2 Schematic diagram of HVDC system.
• Pole 1 has extremely high failure probability since its age has greatly exceeded its mean life. The
failure probability of Pole 2 is also high because the age of major components is close to the mean
life.
• Replacing the damaged cable can slightly increase the probabilities of both poles serving the
maximum capacity levels. However, the increase is very small because only one cable section of 5
km is replaced and the rest portion (27.5 km) is still an aged cable and the impact of the cable 1 on
the capacity probability distribution of the whole HVDC is minimal.
• The probabilities of HVDC Poles 1 and 2 performing at the maximum capacity without cable 1 are
slightly higher than those with the cable 1, which results in slightly lower probabilities at the zero
and/or derated capacity levels for the case without the cable 1. This is because all the cables are
required to reach the maximum capacity, or say, all the cables are logically in series in the
reliability model. One basic concept in reliability evaluation is that removing one more component
from a series logical model leads to a higher success (at the maximum capacity) probability or a
lower failure probability. The impact of the cable 1 out-of-service is mainly the reduced capacities
for both Poles 1 and 2 but not capacity state probabilities in this example.
Table 7-1 Capacity state probabilities of Pole 1 for the existing HVDC system.
at 312 MW at 156 MW at zero MW
2006 0.106243735 0.152434503 0.741321762
2007 0.075725132 0.124754433 0.799520435
2008 0.051009050 0.097306577 0.851684374
2009 0.032753449 0.072326656 0.894919895
2010 0.019887959 0.050931581 0.929180460

W. Li
Table 7-2 Capacity state probabilities of Pole 2 for the existing HVDC system.
2006 0.554333069 0.216997424 0.228669507
2007 0.512838492 0.217244321 0.269917187
2008 0.463541606 0.218515517 0.317942876
2009 0.413689862 0.216221708 0.370088431
2010 0.362198344 0.211159543 0.426642113
Table 7-3Capacity state probabilities of Pole 1 for the cable 1 replaced.

2006 0.106944494 0.152709123 0.740346383
2007 0.076228654 0.125058682 0.798712664
2008 0.051351387 0.097602359 0.851046254
2009 0.032975628 0.072585300 0.894439072
2010 0.020024502 0.051138621 0.928836877
Table 7-4Capacity state probabilities of Pole 2 for the cable 1 replaced.

2006 0.557989321 0.214615684 0.227394995
2007 0.516248523 0.215131435 0.268620042
2008 0.466652574 0.216735362 0.316612064
2009 0.416496079 0.214758473 0.368745447
2010 0.364685055 0.210011597 0.425303347
Table 7-5Capacity state probabilities of Pole 1for the cable 1 out-of-service.

2006 0.122508347 0.147066434 0.730425219
2007 0.087353876 0.123346221 0.789299902
2008 0.059378386 0.098648047 0.841973567
2009 0.038348715 0.074954605 0.886696679
2010 0.023438967 0.053882509 0.922678524
Table 7-6Capacity state probabilities of Pole 2 for the cable 1 out-of-service.

2006 0.578098707 0.201516114 0.22038518
2007 0.535003695 0.203510560 0.261485745
2008 0.483762895 0.206944509 0.309292596
2009 0.431930277 0.206710684 0.361359039
2010 0.378361967 0.203697894 0.417940139
7.2.3.4

W. Li
7.2.3.5 Risk evaluation of the power supply system
The risk of the power system supplying the Vancouver Island region was evaluated for three cases
with the existing cable 1, with the cable 1 replaced and with the cable 1 out-of-service. The EENS
index (Expected Energy Not Supplied) is used as the indicator of system risk. The EENS indices for
the three cases from 2006 to 2010 are shown in Table 7-7. It can be seen that the EENS indices for
using the existing damaged cable 1 and replacing the damaged section of the cable 1 are almost the
same due to the fact that there are same state capacities but very minor differences in capacity
probability distributions for the two cases. The EENS indices for the case with the cable 1 out-of-
service are higher than the other two cases. Note that the ENNS indices have a drop starting 2008
because the 230 kV AC line is expected to be in service from that year.
Table 7-7 EENS for VI supply system (MWh/year).

With
existing With replaced Without
Cable 1 Cable 1 Cable 1
2006 4850 4843 6097
2007 5655 5642 6881
2008 1140 1138 1406
2009 1271 1268 1504
2010 1542 1541 1755
7.2.3.6 Replacement strategy analysis
Using the results in Table 7-7, a replacement strategy analysis for the cable 1 can be performed. The
following three options are considered for comparison:
1. Replacing the damaged section of the cable 1 in 2006 before it fails.
2. Replacing the damaged section of the cable 1 after it fails.
3. Not replacing the damaged section of the cable 1 (using it until it fails and operating the
HVDC system without it after its failure).
As mentioned earlier, the replacement duration is assumed to be one year and the period of the five
years from 2006 to 2010 is considered in the analysis.
1. If the cable 1 is replaced in 2006 before it fails, the HVDC system will be operated without the
cable 1 for replacement in 2006 and with it (after replacement) from 2007 to 2010. The total
EENS for the period of the 5 years is: 6097+5642+1138+1268+1541 = 15,686 MWh.
2. If the cable 1 is replaced after it fails, there will be different possibilities since it can fail in any
year from 2006 to 2010. If it fails in 2006 and is replaced right away, the total EENS for the 5
year’s period is the same as that for Option (1). If it fails in some year later and starts
replacement right after its failure, the HVDC will be operated without the cable 1 for that year,
with the existing cable 1 for years before that year and with the replaced cable 1 for other
years after that year. For example, if it fails in 2007, the total EENS for the period of the 5

W. Li
years is: 4850 + 6881 + 1138 + 1268 + 1541 = 15,678 MWh. The total EENS indices for
replacement after the failure in the period of the 5 years for the different failure years are
summarized in Table 7-8.
3. If the cable 1 is never replaced, the Vancouver Island supply risk also depends on the year in
which it fails. The later it fails, the lower the risk. For example, if it fails in 2008, the total
EENS for the period of the 5 years is: 4850+5655+1406+1504+1755=15,170 MWh. The total
EENS indices for not replacing the cable 1 after its failure in the period of the 5 years for the
different failure years are also summarized in Table 7-8. Note that if the cable 1 fails in early
2010, the total EENS without replacement in the 5 year’s period is the same as that with
replacement because the replacement is assumed to take one year and therefore the HVDC will
be still operated without the cable 1 during replacement. Performing the replacement in 2010
will only have a benefit on the island reliability after 2010, which will be minimal. As
mentioned earlier, according to the previous planning studies, once the 230 kV line is in
service, the HVDC system will be kept just for a few years before its complete retirement.
It can be seen by comparing the EENS indices between Options 1 and 2 that replacing the cable 1
after its failure results in a lower risk. The later its failure occurs, the lower risk. Between Options
2 and 3, we should compare the reduced risk due to replacing the cable 1 against the cost required
to replace it. The reductions of EENS and risk cost due to replacing the cable 1 for different
failure years are given Table 7-9. The reduced risk cost is the product of the reduced EENS and
unit interruption cost. The unit interruption cost is obtained by the Provincial Gross Domestic
Product divided by electricity energy consumption in the province where the utility is located and
is $CAN3.07/kWh.
The cost of replacing the damaged section (5 km) of the cable 1 is estimated to be $8 million. The
reduction of risk cost due to replacement is the benefit and the benefit/cost ratios for different
failure years are listed in Table 7-10. It can be seen that the benefit/cost ratio for any year in which
the cable 1 may fail is less than 1.0. This indicates that not replacing the cable 1 is more cost
effective than replacing it.
Table 7-8 Total EENS (MWh) in the 5 year’s period for Options 2 and 3.
Failure year of
Cable 1 Option 2 Option 3
In 2006 15,686 17,643
In 2007 15,678 16,396
In 2008 14,720 15,170
In 2009 14,690 14,904
In 2010 14,671 14,671

W. Li
Table 7-9 Reduction of EENS (MWh) and risk cost (M$) due to replacing the cable 1.
Reduction Reduction
Failure year of of EENS of risk cost
Cable 1 (MWh) (M$)
In 2006 1,957 6.008
In 2007 718 2.204
In 2008 450 1.382
In 2009 214 0.657
In 2010 0 0.000
Table 7-10 Benefit/cost ratios for replacement of the cable 1.
Failure year of Cable 1 Benefit/cost ratio

In 2006 0.751
In 2007 0.276
In 2008 0.173
In 2009 0.082
In 2010 0
7.2.4 Summary
In this application, a risk evaluation based approach to replacement strategy of aged HVDC
components has been presented. The approach includes the following four steps:
1. Estimating average unavailability of individual HVDC components
2. Calculating capacity probability distributions of the HVDC subsystem for different

replacement strategies
3. Assessing risks of the power system containing the HVDC subsystem
4. Performing a probabilistic benefit/cost analysis for different replacement strategies
Conceptually, the approach is not limited to the replacement strategy of aged HVDC components but
can be applied to a replacement of any other system components.
The replacement strategy for an aged submarine cable of the HVDC link in a power supply system at
British Columbia Transmission Corporation has been analyzed as an example to demonstrate the
actual application of the presented approach. The procedure of the analysis has been explained in
detail through the example. The results show that not replacing the damaged cable is the most cost
effective option in this particular case.

W. Li
7.3 Determination of the Number and Timing of Spare Transformers [1, 26]
7.3.1 Problem description
A sophisticated spare analysis is a challenge in asset management. The practice of most utilities in
this area so far is to use a deterministic method, which is basically based on an engineering
judgment.
There are several drivers for the need of spares. First of all, a repairable failure of power equipment
such as a transformer, reactor, capacitor, generator, etc. may often require a relatively long repair
time. If adequacy of equipment in a system is not enough due to lack of spares, the system may
experience an extensive loss of energy supply and a financial loss of revenue. Secondly, equipment
aging has been a major concern in utilities for years. Aged equipment implies higher failure
probability and thus more needs for spares. Besides, the policy of common spares shared by an
equipment group is becoming popular under the competitive environment in the power industry.
Traditionally, for example, the N-1 security principle has been widely used for substation
transformers. Each substation is often designed to have two or more transformers in parallel so that
the peak load can be still carried when one of the transformers fails. This is a secure but very
expensive criterion. Compared to the N-1 security principle in each substation, the common spare
transformer strategy can avoid considerable capital expenditure and still assure a sufficient reliability
level.
The following are two basic questions in the spare analysis:

1. How many spares are needed and when should each of them be in place in order to maintain
the system reliability?
2. How can spares be financially justified?
Generally, there are two risk-evaluation based methods for the spare analysis. The first one is based
on reliability criteria and the second one is based on probabilistic risk cost models. The more details
of the two methods can be found in Reference 3. In this section, only the reliability criterion method is
discussed and an example of a transformer group is used to demonstrate the application of the method.
7.3.2 Methodology
7.3.2.1 Procedure of the method
Spares are considered for an equipment group. Each component in the group has its failure
probability or unavailability and when it fails, a spare must be put in service to assure normal
operation of the system. Therefore how many spares are needed depends on the requirement for
group reliability. With the unavailability of individual components, a Monte Carlo simulation or state
enumeration technique can be used to conduct evaluations of group failure probability with and
without spares. The spare analysis for an equipment group includes the following steps:
1. Calculating unavailability of components in the group
2. Evaluating individual failure event probabilities and the total group failure probability
3. Performing spare analysis based on a specified reliability criterion
4. Repeating Steps 1 to 3 for all years in consideration

W. Li
7.3.2.2 Unavailability of components
There are two failure modes for power system equipment: repairable and aging failures. In many risk
evaluations of power system, only unavailability of repairable failures is considered. However, a
model for unavailability due to aging failures must be taken into consideration in the spare analysis as
in the replacement strategy analysis give in Section 7.2 since the aging failure is one of the reasons
why spares are needed, particularly for an aged equipment group.
The unavailability values of components due to both repairable and aging failures are calculated using
the same equations as given in Section 7.2.2.2, i.e., Equations (7.1) – (7.5). It should be noted that the
input data (scale and shape parameters of the Weibull distribution) for unavailability estimation of
different components (cable or transformers) are different and based on respective historical statistics.
7.3.2.3 Group reliability and spare analysis

As mentioned above, the evaluation of group reliability can be conducted using a Monte Carlo or
state enumeration technique. The procedure using the state enumeration method is given to explain
the concept. Consider a three-component group. It is assumed that the unavailability values of the
three components have been calculated and they are U1, U2 and U3. An event probability table is
built as shown in Table 7-11.
Table 7-11 Event probability.

Comp. No Event Event probability
1 1 down, 2 up &3 up U1·(1-U2)·(1-U3)
2 2 down, 1 & 3 up U2·(1-U1)·(1-U3)
3 3 down, 1 & 2 up U3·(1-U1)·(1-U2)
4 1 & 2 down, 3 up U1·U2·(1-U3)
5 1 & 3 down, 2 up U1·U3·(1-U2)
6 2 & 3 down, 1 up U2·U3·(1-U1)
7 all 1, 2 & 3 down U1·U2·U3
8 all 1, 2 & 3 up (1-U1)·(1-U2)·(1-U3)
Cumulative failure probabilities for each failure level can be calculated from the table.
Probability for any one failure:
P(a) = U1·(1-U2)·(1-U3) + U2·(1-U1)·(1-U3) + U3·(1-U1)·(1-U2)
Probability for any two failures:
P(b) = U1·U2·(1-U3) + U1·U3·(1-U2) + U2·U3·(1-U1)
Probability for all the three component failures
P(c) = U1·U2·U3

W. Li
Given a system failure criterion, the spare analysis can be conducted. For instance, if the system
failure criterion for this example is that any failure of one or more components results in a group
failure, the spare analysis is shown in Table 7-12. Note that the reliability values in the column of
“Example value” are arbitrarily given here just for the purpose of explanation. If an acceptable group
reliability level is specified, the number of spares can be determined. For instance, if the acceptable
group reliability level is 0.9, the first spare is needed. If the acceptable level is selected as 0.98, the
second one is also needed.
Table 7-12 Spare analysis based on a group reliability criterion.

Spare Group reliability Example Spare contribution
value
Zero 1.0-[P(a)+P(b)+P(c)] 0.85
First 1.0-[P(b)+P(c)] 0.95 0.10
Second 1.0-P(c) 0.99 0.04
Third 1.0 1.00 0.01
7.3.2.4 Reliability criterion
The historical reliability performance metric designated as System Average Interruption Duration
Index (SAIDI) has been utilized in BCTC for setting the company performance target [27]. The
SAIDI of 2.1 hours/year/delivery point is used as a specified reliability criterion in the actual example
given in the next section. Conceptually, the SAIDI can be converted to an unavailability target for a
group of substations as shown in the following example:
Assume that 35 substations (delivery points) are considered as a group in the study. Therefore,
Total average interruption duration target for the group is:
SAIDI × (number of delivery points) = 2.1×35 = 73.5 hrs/year
Therefore, the unavailability target = 73.5/8760 = 0.0084
The availability target = 1 – 0.0084 = 0.9916 (or 99.16%)
The above example indicates that the availability of 0.9916 is required as a specified reliability
criterion for this substation group in order to maintain the company performance target in SAIDI of
2.1 hours/year/delivery point.
It should be noted that converting the SAIDI target into availability is not a unique approach to set the
reliability criterion. Other approaches can be used depending on different cases or utility’s
requirements [1].

W. Li
7.3.3 Actual example [28]

7.3.3.1 Case description
The 138/25 kV transformers, which have capacities of 10-30 MVA, are considered as a transformer
group that is backed up by 138/25 kV 25 MVA spare transformers. Three study scenarios are
presented in this example. The first one focuses on the fixed turn ratio transformer group, which
consists of 34 transformers located in 29 substations. The second one focuses on the on-load tap
changing (LTC) transformer group, which consists of 16 transformers located in 12 substations. The
third one combines both fixed turn ratio and LTC transformers altogether, which consists of 50
transformers located in 35 substations. The planning period for the transformer group is 10 years
from 2006 to 2015. The Weibull distribution model for the aging failure of transformers has an
estimated mean life of 57.1 years with a standard deviation of 14.5 years. These two parameters were
obtained from historical records for the same type of transformers at BCTC. The reliability criterion
for each scenario and the results are presented in the following.
7.3.3.2 Fixed turn ratio transformer group

Total average interruption duration target is:
SAIDI×( number of delivery points) = 2.1×29 = 60.9 hours/year
The unavailability target = 60.9/8760 = 0.007
The availability of 0.993 is used as the specified reliability criterion for the 34 fixed turn ratio
transformers located in 29 substations. The transformer group reliability must be at least equal to or
above this specified reliability level all the time during the planning period (2006 – 2015). The
SPARE program that has been designed for the spare analysis was used. The results obtained are
shown in Table 7-13 and graphically presented in Figure 7.3. Table 7-13 shows the annual
availability of the 138/25 kV fixed turn ratio transformer group associated with/without the number
of spare transformers (up to 3 spares). It is worthy to note that the annual availability is decreased
with years since the aging failure probability of transformers increase with years.
Figure 7.3 shows that two fixed turn ratio spare transformers are needed in year 2006, and these two
spare transformers are able to meet the specified reliability level (0.993 availability) until the end of
the planning period (2015).
Table 7-13 Availability of the 138/25 kV fixed turn ratio transformer
group (34 units) for different numbers of spare transformers.
Number of Spare Transformers
Year
0 1 2 3
2006 0.8757 0.9922 0.9997 1.0000*
2007 0.8651 0.9908 0.9996 1.0000*
2008 0.8537 0.9891 0.9995 1.0000*
2009 0.8417 0.9872 0.9993 1.0000*
2010 0.8289 0.9849 0.9991 1.0000*
2011 0.8154 0.9824 0.9989 0.9999
2012 0.8011 0.9794 0.9986 0.9999
2013 0.7862 0.9761 0.9982 0.9999
2014 0.7706 0.9723 0.9978 0.9999
2015 0.7542 0.9680 0.9972 0.9998
* The values of 1.0000 were obtained by rounding in order to present only 4 digits after decimal.

W. Li
1.0000
0.9990
2 spares
Availability (/year)
0.9980
0.9970
0.9960
0.9950
0.9940 Specified reliability criterion
0.9930
0.9920
2006 2008 2010 2012 2014 2016
Year
Figure 7.3 The number of fixed turn ratio spare transformers required to meet the specified reliability level.
7.3.3.3 On-load tap changing (LTC) transformer group

SAIDI×(number of delivery points) = 2.1×12 = 25.2 hrs/year
The availability of 0.9971 is used as the specified reliability criterion for the 16 on-load tap-
changing transformers located in 12 substations. The transformer group reliability must be at least
equal to or above this specified reliability level all the time during the planning period (2006 – 2015).
The results obtained using the SPARE program for the 138/25 kV LTC transformers are shown in
Table 7-14 and graphically presented in Figure 7.4. It can be seen from Figure 7.4 that one LTC
spare transformer is required in year 2006 in order to maintain the specified reliability level (0.9971
availability) for the 138/25 kV LTC transformer group. In year 2012, the first spare transformer will
no longer meet the specified reliability criterion and the second spare LTC transformer will be
required in this year.

W. Li
Table 7-14 Availability of the 138/25 kV LTC transformer group

(16 units) for different numbers of spare transformers.
Year
0 1 2
2006 0.9514 0.9989 1.0000*
2007 0.9470 0.9987 1.0000*
2008 0.9422 0.9984 1.0000*
2009 0.9371 0.9981 1.0000*
2010 0.9316 0.9978 1.0000*
2011 0.9257 0.9974 0.9999
2012 0.9194 0.9969 0.9999
2013 0.9127 0.9963 0.9999
2014 0.9055 0.9957 0.9999
2015 0.8979 0.9950 0.9998
1.0000
0.9995 2 spares
1 spare
0.9990
0.9985
0.9980
0.9975
0.9970
Specified reliability criterion
0.9965
0.9960
2006 2008 2010 2012 2014 2016
Year
Figure 7.4 The number of LTC spare transformers required to meet the specified reliability level.
7.3.3.4 Combined fixed turn ratio and LTC transformer group

The advantage of an LTC spare transformer is that it can replace either a fixed turn ratio or a LTC
transformer. The number of LTC spare transformers needed to back up all the 138/25 kV fixed turn
ratio and LTC transformers at BCTC can be determined using the same method.

SAIDI×(number of delivery points) = 2.1×35 = 73.5 hrs/year
The availability of 0.9916 is used as the specified reliability criterion for the 50 transformers (fixed
turn ratio and LTC) located in 35 delivery points (substations). The transformer group reliability
must be at least equal to or above this specified reliability level all the time during the planning
period (2006 – 2015). The results obtained using the SPARE program for this group are shown in
W. Li
Table 7-15 and graphically presented in Figure 7.5. It can be seen from Figure 7.5that two LTC spare
transformers are needed in year 2006 to backup both the fixed turn ratio and LTC transformers, and
these two spare transformers are able to maintain the specified reliability level until the end of a
planning period (2015).
The results in the three sub-sections above indicate that if the fixed turn ratio spare and LTC spare
transformers are considered separately, the system would need 4 spare transformers (2 fixed turn
ratio spare transformers and 2 LTC spare transformers) by the end of 2015. However, if the LTC
spare transformers are considered to backup both the fixed turn ratio and LTC transformers, the
system would need only two LTC spare transformers. This strategy leads to a considerable saving in
the capital investment while still maintaining the specified reliability criterion for the 138/25 kV 25
MVA transformer group.
Table 7-15 Availability of 138/25 kV fixed turn ratio and LTC transformer
group (50 units) for different numbers of spare transformers.
Year
0 1 2 3
2006 0.8331 0.9856 0.9992 1.0000*
2007 0.8192 0.9829 0.9989 1.0000*
2008 0.8044 0.9799 0.9986 0.9999
2009 0.7887 0.9764 0.9983 0.9999
2010 0.7722 0.9724 0.9978 0.9999
2011 0.7548 0.9678 0.9972 0.9998
2012 0.7366 0.9626 0.9964 0.9997
2013 0.7175 0.9566 0.9955 0.9997
2014 0.6978 0.9499 0.9944 0.9995
2015 0.6772 0.9423 0.9931 0.9994
1.0000
0.9980
2 spares
0.9960
0.9940
Specified reliability criterion

0.9920
0.9900
2006 2008 2010 2012 2014 2016
Year
Figure 7.5 The number of LTC spare transformers required to meet the specified reliability
level for the transformer group composed of both fixed turn ratio and LTC transformers.

W. Li
7.3.4 Summary
In this second application, a reliability based method for spare equipment planning is presented. It
can be applied to any power system equipment. The method is very useful in practical planning and
decision making processes of utilities in order to minimize the capital investment cost without
sacrificing the reliability requirements. The method includes the following main aspects:
• Estimating average unavailability of individual equipment due to both repairable and aging
failures
• Evaluating reliability of the equipment group with different numbers of spares
• Selecting a reliability level that the equipment group should meet in the planning period
• Performing the spare analysis to determine the numbers and timing in order to meet the
specified reliability criterion.
The 138/25 kV 25 MVA transformer group in the BCTC system is used to illustrate the application
procedure of the presented spare equipment analysis method. The reliability criterion in this example
is based on the corporative reliability performance target on the SAIDI index at BCTC. The results
indicate that two spare LTC transformers are required to meet the specified reliability criterion for 50
138/25 kV transformers in the 10 year period from 2006 to 2015.
7.4 Further Discussions

This chapter discussed two applications of risk based asset management. The first application is the
risk evaluation based approach to the replacement strategy of aged equipment in power systems. The
decision on the replacement of an aged HVDC cable in the Vancouver Island supply system at BCTC
was used as an example to demonstrate the application procedure. The second application is a risk
evaluation based method to determine the number and timing of spare equipment. A 138/25 kV
transformer group was used as an example to illustrate the application procedure. The risk evaluation
based techniques can be applied to other aspects of asset management such as preventive
maintenance planning, maintenance scheduling, workforce planning in maintenance, equipment
retirement strategy, life cycle management, etc. More materials can be found in the references. A lot
of risk/reliability evaluation studies have been conducted in probabilistic planning and asset
management at BCTC. The 17 technical reports in this area are available at the BCTC website [29].
The traditional asset management focuses on individual equipment, including investigation into
physical condition of equipment, operation performance and field environment. A basic fact, which
has been more or less ignored in the traditional asset management, is that importance of individual
equipment in a system does not depend on itself but on impacts due to its absence from the system on
overall system reliability. If the absence of a piece of equipment from system due to maintenance,
retirement or failure creates a little or very marginal impact on system operation risk, it should be in a
much less important position in the asset management process. On the contrary, if the absence of a
piece of equipment from the system has very large effects on system reliability, any issue associated
with its maintenance, replacement or retirement should be emphasized. Quantified probabilistic
assessment of equipment unavailability on system reliability is the key idea of the risk evaluation
based asset management method presented in this chapter. It should be emphasized that there is no
conflict between traditional considerations in asset management and risk evaluation based asset
management methods. Both can be performed to enhance the asset management process.

W. Li
Another important point associated with asset management is aging failure modeling of equipment.
In the traditional risk evaluation, only repairable failures are considered but aging failures are ignored
or improperly modeled. Equipment aging is a basic fact in majority of power systems. One of
objectives in asset management is how to deal with aged system components. Both aging and
repairable failure models have been incorporated in the presented risk based asset management
method.
The input data for repairable and aging failure models is crucial for risk evaluation based asset
management. Collection, processing, storage, reporting and utilization of historical failure records is
one of keys in asset management. A computerized reliability database management system becomes
increasingly important. More information on the data management system can be found in Reference
30.
7.5 References
[1] W. Li, Risk Assessment of Power Systems: Models, Methods, and Applications, IEEE Press and Wiley & Sons,
2005
[2] R. Billinton and R. N. Allan, Reliability Evaluation of Power Systems, Plenum Press, New York, 1996
[3] R. Billinton and W. Li, Reliability Assessment of Electric Power Systems Using Monte Carlo Methods, Plenum
Press, New York, 1994
[4] J. Endreyi, Reliability Modeling in Electric Power Systems, Wiley & Sons, Chichester, 1978
[5] G. J. Anders, Probability Concepts in Electric Power Systems, Wiley & Sons, New York, 1990
[6] A. K. S. Jardine, Maintenance, Replacement and Reliability, Pitman Publishing, London, 1973
[7] N. B. Bloom, Reliability Centered Maintenance, McGraw-Hill, Inc., New York, 2006
[8] IEEE Tutorial Course Text, Electric Delivery System Reliability Evaluation, 05TP175, 2005
[9] IEEE Task Force, “The Present Status of Maintenance Strategies and the Impacts of Maintenance on
Reliability”, IEEE Trans. on Power Systems, Vol. 16, No. 4, November 2001, pp638-646
[10] J. Endrenyi, G. J. Anders and A. M. Leite da Silva, “Probabilistic Evaluation of the Effect of Maintenance on
Reliability – An Application”, ITTT Trans. on Power Systems, Vol. 13, No. 2, May 1998, pp576-583,
[11] W. Li, E. Vaahedi and P. Choudhury, “Power System Equipment Aging – Assessment, Maintenance and
Retirement”, IEEE Power& Energy, Vol. 4, No. 3, May/June, 2006, pp52-58
[12] W. Li, J. Zhou, J. Lu and W. Yan, “A Probabilistic Analysis Approach to Making Decision on Retirement of
Aged Equipment in Transmission Systems”, accepted for publication in IEEE Trans. on Power Delivery
[13] W. Li and J. K. Korczynski, "A Reliability Based Approach to Transmission Maintenance Planning and Its
Application in BCTC System", IEEE Trans. on Power Delivery, Vol. 19, No. 1, January 2004, pp303-308
[14] W. Li, “Incorporating Aging Failures in Power System Reliability Evaluation”, IEEE Transactions on Power
Systems, Vol. 17, No. 3 August 2002, pp. 918 – 923
[15] W. Li and S. Pai, “Evaluating Unavailability of Equipment Aging failures”, IEEE Power Engineering Review,
February, 2002, pp52-54
[16] L. Bertling, Reliability Centered Maintenance for Electric Power Distribution Systems, Ph.D. thesis, Royal
Institute of Technology (KTH), Stockholm, 2002
[17] W. Li, P. Choudhury, D. Gillespie and J. Jue, “A Risk Evaluation Based Approach to Replacement Strategy of
Aged HVDC Components and Its Application at BCTC”, accepted for publication in IEEE Transaction on
Power Delivery
[18] EPRI, High-Voltage Direct Current Handbook, EPRI TR-104166, prepared by GE Industrial and Power
Systems, 1994
[19] R.N. Allan, R. Billinton, A.M. Breipohl, C.H. Grigg, “Bibliography on the Application of Probability Methods
in Power System Reliability Evaluation: 1992-1996”, IEEE Transactions on Power Systems, Vol. 14, No. 1,
1999, pp. 51-57
[20] R. Billinton, M. Fotuhi-Firuzabad and L. Bertling, “Bibliography on the Application of Probability Methods in
Power System Reliability Evaluation 1996-1999”, IEEE Transactions on Power Systems, Vol. 16, No. 4, Nov.
2001, pp595 – 602
[21] EPRI Report, Framework for Stochastic Reliability of Bulk Power System, TR-110048, Palo Alto, California
1998

W. Li
[22] CIGRE Task Force 38-03-10, Composite Power System Reliability Analysis, CIGRE Symposium on Electric
Power System Reliability, September, 16-18, 1991
[23] W. Li, "Evaluating Mean Life of Power System Equipment with Limited End-of-Life Failure Data", IEEE
Trans. on Power Systems, Vol. 19, No.1, February 2004, pp236-242
[24] W. Li, “Probability Distribution of HVDC Capacity Considering Repairable and Aging Failures”, IEEE Trans.
on Power Delivery, Vol. 21, No. 1, January 2006, pp523-525
[25] BC Hydro Report, Pole 1 and Pole 2 DC Cable - 2005 ROV Inspection (Summary of Results), January 2006
[26] W. Li, E. Vaahedi and Y. Mansour, “Determining Number and Timing of Substation Spare Transformers Using
a Probabilistic Cost Analysis Approach”, IEEE Transactions on Power Delivery, Vol. 14, No. 3, July 1999, pp.
934 – 939
[27] British Columbia Transmission Corporation, “Service Plan: For Fiscal Year 2005/06 to 007/08”, February 2005,
available at: http://www.bcbudget.gov.bc.ca/2005/sp/crownagency/bctc.pdf
[28] W. Wangdee, W. Li, W. Shum and P. Choudhury, “Applying Probabilistic Methods in Determining the Number
of Spare Transformers and their Timing Requirements”, IEEE CCECE 2007 conference, Vancouver, April 2007
[29] British Columbia Transmission Corporation, 17 technical reports on reliability assessment, available at:
http://www.bctc.com/the_transmission_system/reliability_assessment/
[30] W. Li, H. C. Jonas, S. Yan, B. Corns, P. Choudhury and E. Vaahedi, “Reliability Decision Management System:
Experience at BCTC”, IEEE CCECE 2007 conference, Vancouver, April 2007
7.6 Biography
Dr. Wenyuan Li (SM86, F02) is currently a Principal Engineer at BCTC in Canada and an advisory
professor of Chongqing University in China. He is an IEEE Fellow. Dr. Li is the author/coauthor of a
considerable number of papers in power system planning, operation, optimization, reliability and asset
management. He published four books in power system operation and risk assessment, including the book of
“Risk Assessment of Power Systems: Models, Methods, and Applications”, IEEE Press and Wiley & Sons,
2005, and completed more than sixty technical reports for industry applications. He also delivered many
tutorials and seminars at different international conferences (IEEE, PMAPS and CEA) and industrial
workshops (EPRI, WECC and NWPP). Dr. Li was the winner of the 1996 “Outstanding Engineer Award” by
the IEEE Canada and the recipient of the “Significant Reviewer Award” by IEEE PES in 2006. He can be
reached at wen.yuan.li@bctc.com.

W. Li

Tutorial Book On Asset Management - Maintenance and Replacement Strategies at The IEEE PES ... PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tutorial Book On Asset Management - Maintenance and Replacement Strategies at The IEEE PES ... PDF

Uploaded by

Copyright:

Available Formats

Tutorial book on Asset Management -

Maintenance and Replacement Strategies

Dr. George Anders

Dr. Lina Bertling

Dr. Gerard Cliteur

Dr. John Endrenyi

Dr. Andrew Jardine

Dr. Lina Bertling

Lina Bertling, Editor

Contact for further information:

2 Maintenance as a strategic tool for asset management

Maintenance as a strategic tool for asset management 4

2.1.1 Condition forecasting

2.1.2 Condition quantification

Maintenance as a strategic tool for asset management 5

2.1.3 Why this matters

Maintenance as a strategic tool for asset management 6

2.2 Are Utility assets aging?

2.2.1 Do we accrue money for emergency replacements?

2.2.2 If this is true, do we have a time bomb?

Maintenance as a strategic tool for asset management 7

2.3 Condition Assessments

2.3.1 So, what is done?

Maintenance as a strategic tool for asset management 8

2.3.2 And what is not done?

physical knowledge (1)

(5) Read checks Failure threshold Trigger level Mx.Orders

Include for criticality (system impact, safety)

Figure 2.1 Improvement process for integrated condition assessments

Maintenance as a strategic tool for asset management 9

2.4 Driving today’s network into the future

Maintenance as a strategic tool for asset management 10

Aging Asset Base - computations

Figure 2.2 Concept of hazard rate and age distribution convolution

Maintenance as a strategic tool for asset management 11

Baseline assessment – Equipment

Maintenance as a strategic tool for asset management 12

Maintenance as a strategic tool for asset management 13

3.1 What is maintenance?

Figure 3.1 Life curves

3.2 Review of maintenance policies

3.2.1 Improvement vs. replacement

3.2.2 Regular vs. as-needed maintenance

3.2.3 Empirical vs. mathematical approaches

3.2.4 A simple deterministic model

From the second statement, the optimal value of n becomes

3.3 Linking reliability and maintenance: a probabilistic approach

3.3.1 Basic models

Figure 3.4 State diagram including three deterioration stages

3.3.2 The Asset Management Planner (AMP): a practical model

Figure 3.5 The AMP model

3.3.3 Generation of life curves

3.6 Appendix: Deterministic or probabilistic models

Figure 3.7: Maintenance every 3 years, resulting in

4 RCM and its extension into a quantitative approach RCAM

Dr. Lina Bertling, Member IEEE

Abstract -Reliability-centred maintenance (RCM) is a qualitative systematic approach to organizing

RCM and its extension into a quantitative approach RCAM 27

4.2 Reliability-centred maintenance (RCM)

4.2.2 RCM according to Moubray

RCM and its extension into a quantitative approach RCAM 28

4.2.2.1 What are the functions of the asset?

4.2.2.2 In what ways does it fail to fulfill its functions?