Modern societies depend on the smooth operation of many complex systems

(designed and built by humans) that provide a variety of outputs (products and
services). These include transport systems (trains, buses, ferries, ships and aero-
planes), communication systems (television, telephone and computer networks),
utilities (water, gas and electricity networks), manufacturing plants (to produce in-
dustrial products and consumer durables), processing plants (to extract and process
minerals and oil), hospitals (to provide services) and banks (for financial trans-
actions) to name a few.
Every system built by humans is unreliable in the sense that it degrades with
age and/or usage. A system is said to fail when it is no longer capable of delivering
the designed outputs. Some failures can be catastrophic in the sense that they can
result in serious economic losses, affect humans and do serious damage to the
environment. Typical examples include the crash of an aircraft in flight, failure of a
sewerage processing plant and collapse of a bridge. The degradation can be con-
trolled, and the likelihood of catastrophic failures reduced, through maintenance
actions, including preventive maintenance, inspection, condition monitoring and
design-out maintenance. Corrective maintenance actions are needed to restore a
failed system to operational state through repair or replacement of the components
that caused the failure.
Maintenance has moved from being an engineering activity after a system has
been put into operation into an important issue that needs to be addressed during
the design and manufacturing or building of the system. Maintenance impacts on
reliability (a technical issue) with serious economic and commercial implications.
This implies that operators of complex systems need to look at maintenance from
an overall business perspective that integrates the technical and commercial issues
in an effective manner.
The literature on maintenance is vast. Over the last 50 years, there have been
dramatic changes due to advances in the understanding of the physics of failure, in
technologies to monitor and assess the state of the system, in computers to store
viii Preface

and process large amounts of relevant data and in the tools and techniques needed
to build model to determine the optimal maintenance strategies.
The aim of this book is to integrate this vast literature with different chapters
focusing on different aspects of maintenance and written by active researchers
and/or experienced practitioners with international reputations. Each chapter re-
views the literature dealing with a particular aspect of maintenance (for example,
methodology, approaches, technology, management, modelling analysis and opti-
misation), reports on the developments and trends in a particular industry sector or,
deals with a case study. It is hoped that the book will lead to narrowing the gap
between theory and practice and to trigger new research in maintenance.
The book is written for a wide audience. This includes practitioners from indus-
try (maintenance engineers and managers) and researchers investigating various
aspects of maintenance. Also, it is suitable for use as a textbook for postgraduate
programs in maintenance, industrial engineering and applied mathematics.
We would like to thank the authors of the chapters for their collaboration and
prompt responses to our enquiries which enabled completion of this handbook on
time. We also wish to acknowledge the support of the University of Salford and the
award of CAMPUS Fellowship in 2006 to one of us (PM). We gratefully acknowl-
edge the help and encouragement of the editors of Springer, Anthony Doyle and
Simon Rees. Also, our thanks to Sorina Moosdorf and the staff involved with the
production of the book.

An Overview

K.A.H. Kobbacy and D.N.P. Murthy

K. Kobbacy and D. Murthy

1.1 Introduction
The efficient functioning of modern society depends on the smooth operation of
many complex systems comprised of several pieces of equipment that provide a
variety of products and services. These include transport systems (trains, buses,
ferries, ships and aeroplanes), communication systems (television, telephone and
computer networks), utilities (water, gas and electricity networks), manufacturing
plants (to produce industrial products and consumer durables), processing plants
(to extract and process minerals and oil), hospitals (to provide services) and banks
(for financial transactions) to name a few. All equipment is unreliable in the sense
that it degrades with age and/or usage and fails when it is no longer capable of
delivering the products and services. When a complex system fails, the conse-
quences can be dramatic. It can result in serious economic losses, affect humans
and do serious damage to the environment as, for example, the crash of an aircraft
in flight, the failure of a sewage processing plant or the collapse of a bridge.
Through proper corrective maintenance, one can restore a failed system to an
operational state by actions such as repair or replacement of the components that
failed and in turn caused the failure of the system. The occurrence of failures can
be controlled through maintenance actions, including preventive maintenance,
inspection, condition monitoring and design-out maintenance. With good design
and effective preventive maintenance actions, the likelihood of failures and their
consequences can be reduced but failures can never be totally eliminated.
The approach to maintenance has changed significantly over the last one
hundred years. Over a hundred years ago, the focus was primarily on corrective
maintenance delegated to the maintenance section of the business to restore failed
systems to an operational state. Maintenance was carried out by trained technicians
and was viewed as an operational issue and did not play a role in the design and
operation of the system. The importance of preventive maintenance was fully
appreciated during the Second World War. Preventive maintenance involves
additional costs and is worthwhile only if the benefits exceed the costs. Deciding
4 K. Kobbacy and D. Murthy

the optimum level of maintenance requires building appropriate models and use of
sophisticated optimisation techniques. Also, around this time, maintenance issues
started getting addressed at the design stage and this led to the concept of main-
tainability. Reliability and maintainability (R&M) became major issues in the
design and operation of systems.
Degradation and failure depend on the stresses on the various components of the
system. These depend on the operating conditions that are dictated by commercial
considerations. As a result, maintenance moved from a purely technical issue to a
strategic management issue with options such as outsourcing of maintenance, leasing
equipment as opposed to buying, etc. Also, advances in technologies (new materials,
new sensors for monitoring, data collection and analysis) added new dimensions
(science, technology) to maintenance. These advances will continue at an ever-
increasing pace in the twenty-first century.
This handbook tries to address the various issues associated with the main-
tenance of complex systems. The aim is to give a snapshot of the current status and
highlight future trends. Each chapter deals with a particular aspect of maintenance
(for example, methodology, approaches, technology, management, modelling
analysis and optimisation) and reports on developments and trends in a particular
industry sector or deals with a case study. In this chapter we give an overview of
the handbook. The outline of the chapter is as follows. Section 1.2 deals with the
framework that is needed to study the maintenance of complex systems and we
discuss some of the salient issues. Section 1.3 presents the structure of the book
and gives a brief outline of the different chapters in the handbook. We conclude
with a discussion of the target audience for the handbook.

1.2 Framework for Study of Maintenance

A proper study of maintenance requires a comprehensive framework that incorpo-
rates all the key elements. However, not all the elements would be relevant for a
particular maintenance problem under consideration.
The systems approach is an effective approach to solving maintenance prob-
lems. In this approach, the real world relevant to the problem is described through
a characterisation where one identifies the relevant variables and the interaction
between the variables. This characterisation can be done using language or a
schematic network representation where the nodes represent the variables and the
connected arcs denote the relationships. This is good for qualitative analysis. For
quantitative analysis, one needs to build mathematical models to describe the
relationships. Often this requires stochastic and dynamical formulations as system
degradation and failures occur in an uncertain manner. In this section, we discuss
the various key elements and some related issues.
We use the term “asset” to denote a complex system or individual equipment. It
can include infrastructures such as buildings, bridges etc. in addition to those listed
in Section 1.1.
An Overview 5

1.2.1 Stakeholders

For an asset there can be several stakeholders as indicated in Figure 1.1.

Figure 1.1. Stakeholders for maintenance of an asset

The number of parties involved would depend on the asset under consideration.
For example, in case of a rail network (used to provide a service to transport people
and goods) the customers can include the rail operators (operating the rolling
stock) and the public. The owner can be a business entity, a financial institution or
a government agency. The operator is the agency that operates the track and is
responsible for the flow of traffic. The service provider refers to the agency
carrying out the maintenance (preventive and corrective). It can be the operator (in
which case maintenance is done in-house) or some external agent (if maintenance
is outsourced) or both (when only some of the maintenance activities are out-
sourced). The regulator is the independent agency which deals with safety and risk
issues. They define the minimum standards for safety and can impose fines on the
owner, operator and possibly the service provider should the safety levels be
compromised. Government plays a critical role in providing the subsidy and
assuming certain risks. In this case all the parties involved are affected by the
maintenance carried out on the asset. If the line is shut either frequently and/or for
long duration, it can affect customer satisfaction and patronage, the returns to the
operators and owners and the costs to the government.

1.2.2 Different Perspectives

We focus our attention on the case where the asset is owned by the owner and
maintenance is outsourced. In this case, we have two parties – (i) owner (of the
asset) and (ii) service agent (providing the maintenance). Figure 1.2 is a very
simplified system characterisation of the maintenance process where the main-
6 K. Kobbacy and D. Murthy

tenance activities are defined through a maintenance service contract. The problem
is to determine the terms of the service contract.

Figure 1.2. System characterisation for maintenance out-sourcing

Each of the elements of Figure 1.2 involves several variables. For example, the
maintenance service contract involves the following: (i) duration of contract, (ii)
price of contract, (iii) maintenance performance requirements, (iv) incentives and
penalties, (v) dispute resolution, etc. The maintenance performance requirements
can include measures such as availability, mean time between failures and so on.
The characterisation of the owner’s decision-making process can involve costs,
asset state at the end of the contract, risks (service agent not providing the level and
quality of service) and so on. The interests and goals of the owner are different
from that of the service agent.
The study of maintenance is complicated by the unknown and uncontrollable
factors. It could be rate of degradation (which depends on several factors such as
material properties, operating environment etc) and other commercial factors (high
demand for power in the case of a power plant due to very hot weather).

1.2.3 Key Issues and the Need for Multi-disciplinary Approach

The key issues in the maintenance of an asset are shown in Figure 1.3. The asset
acquisition is influenced by business considerations and its inherent reliability is
determined by the decisions made during design. The field reliability and degrada-
tion is affected by operations (usage intensity, operating environment, operating
load etc.). Through use of technologies, one can assess the state of the asset. The
analysis of the data and models allow for optimizing the maintenance decisions
(either for a given operating condition or jointly optimizing the maintenance and
operations). Once the maintenance actions have been formulated it needs to be
An Overview 7

Figure 1.3. Key Issues in maintenance of an asset

To execute effective maintenance one needs to have a good understanding of a

variety of concepts and techniques for each of the issues. Another issue is the
computer packages that allow one to collect and analyze data and build models and
derive the optimal solutions.

The linking of the technical and commercial issues is indicated in Figure 1.4
and this requires an inter-disciplinary approach.

Figure 1.4. Linking technical and commercial issues

8 K. Kobbacy and D. Murthy

The disciplines involved are as follows Engineering
The degradation of an asset depends to some extent on the design and building (or
production) of the asset. Poor design leads to poor reliability that in turn results in
high level of corrective maintenance. On the other hand, a well-designed system is
more reliable and hence less prone to failures. Maintainability deals with main-
tenance issues at the design and development stage of the asset. Science
This is very important in the understanding of the physical mechanisms that are at
play and have a significant influence on the degradation and failure. Choosing the
wrong material can have a serious consequence and impact on the subsequent
maintenance actions needed. Economic
Maintenance costs can be a significant fraction of the total operating budget for a
business depending on the industry sector. There are two types of costs – annual
cost and cost over the life cycle of the asset. The costs can be divided into direct
(labour, material etc.) and indirect (consequence of failure). Legal
This is important in the context of maintenance out-sourcing and maintenance of
leased equipment. In both cases, the central issue is the contract between the
parties involved. Of particular importance is dispute resolution when there is a
disagreement between the parties in terms of the violation of some terms of the
contract. Statistics
The degradation and failures occur in an uncertain manner. As such, the analysis of
such data requires the use of statistical techniques. Statistics provide the concepts
and tools to extract information from data and for the planning of efficient collec-
tion systems. Operational Research

Operation research provides the tools and techniques for model building, analysis
and optimization. Often, analytical approaches fail and one needs to use simulation
approach to evaluate the outcomes of different decisions and to choose the optimal
(or near optimal) strategies. Reliability Theory

Reliability theory deals with the interdisciplinary use of probability, statistics and
stochastic modelling, combined with engineering insights into the design and the
scientific understanding of the failure mechanisms, to study the various aspects of
reliability. As such, it encompasses issues such as (i) reliability modelling, (ii)
reliability analysis and optimization, (iii) reliability engineering, (iv) reliability
science, (v) reliability technology and (vi) reliability management.
An Overview 9 Information Technology and Computer Science

The operation and maintenance of complex assets generates a lot of data. One
needs efficient ways to store and manipulate the data and to extract relevant
information from data. Computer science provides a range of artificial intelligence
techniques such as data mining, expert systems, neural networks etc., which are
very important in the context of maintenance.

1.2.4 Maintenance Management

Maintenance management deals with the overall management of the maintenance

of an asset. The management needs to be done at three different levels (strategic,
tactical and operational) as indicated in Figure 1.5.






Figure 1.5. Maintenance management

The strategic level deals with maintenance strategy. This needs to be formu-
lated so that it is consistent and coherent with other (production, marketing,
finance, etc.) business strategies. The tactical level deals with the planning and
scheduling of maintenance. The operational level deals with the execution of the
maintenance tasks and collection of relevant data.

1.3 Structure of the Handbook

The handbook integrates the vast literature on maintenance with each chapters
focussing on a different aspect of maintenance and written by active researchers
with international reputation and/or experienced practitioners from industry. Each
chapter either reviews the literature dealing with a particular aspect of maintenance
(for example, methodology, approaches, technology, management, modelling ana-
10 K. Kobbacy and D. Murthy

lysis and optimisation), reports on developments and trends in a particular industry

sector, or deals with a case study.
The book is structured into five parts and each of the last four parts contains
several chapters. The topic of the different chapters is as indicated below.

Part A: An Overview

Chapter 1: An Overview (Khairy Kobbacy and Pra Murthy)

Part B: Evolution of Concepts and Approaches

Chapter 2: Maintenance: An Evolutionary Perspective (Liliane Pintelon and

Alejandro Parodi Herz)
Chapter 3: New Technologies for Maintenance (Jay Lee and Haixia Wang)
Chapter 4: Reliability Centred Maintenance (Marvin Rausand and Jorn Vatn)

Part C: Methods and Techniques

Chapter 5: Condition-based Maintenance Modelling (Wenbin Wang)

Chapter 6: Maintenance Based on Limited Data (David F. Percy)
Chapter 7: Reliability Prediction and Accelerated Testing (Elsayed A. Elsayed)
Chapter 8: Preventive Maintenance Models for Complex Systems
(David F. Percy)
Chapter 9: Artificial Intelligence in Maintenance (Khairy A.H. Kobbacy)

Part D: Problem Specific Models

Chapter10: Maintenance of Repairable Systems (Bo Henry Lindqvist)

Chapter 11: Optimal Maintenance of Multi-component Systems: A Review
(Robin P. Nicolai and Rommert Dekker)
Chapter 12: Replacement of Capital Equipment (Philip A. Scarf and Joseph
C. Hartman)
Chapter 13: Maintenance and Production: A Review of Planning Models
(Gabriella Budai, Rommert Dekker and Robin P. Nicolai)
Chapter 14: Delay Time Modelling (Wenbin Wang)

Part E: Management

Chapter 15: Maintenance Outsourcing (Pra Murthy and Nat Jack)

Chapter 16: Maintenance of Leased Equipment (Pra Murthy and Jarumon
Chapter 17: Computerised Maintenance Management Systems (Ashraf Labib)
Chapter 18: Risk Analysis in Maintenance (Terje Aven)
Chapter 19: Maintenance Performance Measurement (MPM) System
(Uday Kumar and Aditya Parida)
Chapter 20: Forecasting for Inventory Management of Service Parts
(John E. Boylan and Aris A. Syntetos)
An Overview 11

Part F: Applications (Case Studies)

Chapter 21: Maintenance in the Rail Industry (Jorn Vatn)

Chapter 22: Condition Monitoring of Diesel Engines
(Renyan Jiang and Xinping Yan)
Chapter 23: Benchmarking of the Maintenance Process at Banverket
(The Swedish National Rail Administration)
(Ulla Espling and Uday Kumar)
Chapter 24: Integrated e-Operations–e-Maintenance: Applications in North Sea
Offshore Assets (Jayanta P. Liyanage)
Chapter 25: Fault Detection and Identification for Longwall Machinery Using
SCADA Data (Daniel Bongers and Hal Gurgenci)

A brief outline of each chapter is as follows

Chapter 2: Maintenance: An Evolutionary Perspective

In the past few decades industrial maintenance has evolved from a non-issue into a
strategic concern. During this period the role of maintenance has drastically been
transformed. This chapter, while considering the fundamental elements of main-
tenance and its environment, describes the evolution path of maintenance manage-
ment and the driving forces of such changes. It basically explains how and why
maintenance practice has evolved in time. It includes basic notions of maintenance
and clearly classifies and distinguishes between different types of maintenance
actions, policies and concepts currently available. The chapter concludes by en-
lightening the reader with some new challenges in maintenance

Chapter 3: New Technologies for Maintenance

Predictive maintenance is critical to any engineering system, especially complex
systems, in order to avoid system breakdown. With the recent advances in pervasive
computing, prognostics can be easily embedded in any devices and systems. When
smart machines are networked and remotely monitored, and when their data is
modelled and continually analyzed with sophisticated embedded systems, it is
possible to go beyond mere “predictive maintenance” to intelligent “prognostics”, the
process of pinpointing exactly which components of a machine are likely to fail and
then autonomously trigger service and order spare parts. This chapter addresses the
paradigm shift in modern maintenance systems from the traditional “fail and fix”
practices to a “predict and prevent” methodology. Recent advances in prognostic
technologies and tools are presented, and future work directions are discussed.

Chapter 4: Reliability Centred Maintenance

This chapter gives an introduction to reliability centred maintenance (RCM). The
RCM analysis process is divided into 12 distinct steps. Each step is thoroughly
described and discussed. The main RCM process is similar to the processes
outlined in RCM standards and guidelines, but has more focus on the optimization
of maintenance intervals. A new approach is proposed based on generic RCM
analyses related to specified classes of consequences. The new approach will
significantly reduce the workload of the RCM analysis. A computer tool OptiRCM
12 K. Kobbacy and D. Murthy

that has been developed by the authors, is used to illustrate the new approach.
Several examples from railway applications are provided.

Chapter 5: Condition-based Maintenance Modelling

This chapter presents a model for supporting condition based maintenance decision
making. The chapter discusses various issues related to the subject, such as the
definition of the state of an asset, direct or indirect monitoring, relationship be-
tween observed measurements and the state of the asset, and current modelling
developments. In particular, the chapter focuses on a modelling technique used
recently in predicting the residual life via stochastic filtering. This is a key element
in modelling the decision making aspect of condition based maintenance. A few
key condition monitoring techniques are also introduced and discussed. Methods of
estimating model parameters are outlined and a numerical example based on real
data is presented.

Chapter 6: Maintenance-based on Limited Data

Reliability applications often suffer from paucity of data for making informed
maintenance decisions. This is particularly noticeable for high reliability systems
and when new production lines or new warranty schemes are planned. Such issues
are of great importance when selecting and fitting mathematical models to improve
the accuracy and utility of these decisions. This chapter investigates why reliability
data are so limited and proposes statistical methods for dealing with these
difficulties. It considers graphical and numerical summaries, appropriate methods
for model development and validation, and the powerful approach of subjective
Bayesian analysis for including expert knowledge about the application area.

Chapter 7: Reliability Prediction and Accelerated Testing

This chapter presents an overview of accelerated life testing (ALT) methods and
their use in reliability prediction at normal operating conditions. It describes the
most commonly used models and introduces new ones which are “distribution
free”. Design of optimum test plans in order to improve the accuracy of reliability
prediction is also presented and discussed. The chapter provides, for the first time,
the link between accelerated life testing and maintenance actions. It develops
procedures for using the ALT results for estimating the optimum preventive
maintenance schedule and the optimum degradation threshold level for degrading
systems. The procedures are demonstrated using two numerical examples.

Chapter 8: Preventive Maintenance Models for Complex Systems

Preventive maintenance (PM) of repairable systems can be very beneficial in
reducing repair and replacement costs, and in improving system availability.
Strategies for scheduling PM are often based on intuition and experience, though
considerable improvements in performance can be achieved by fitting mathemati-
cal models to observed data. For simple repairable systems comprising few compo-
nents or many identical components, compound renewal processes are appropriate.
This chapter reviews basic and advanced models for complex repairable systems
and demonstrates their use for determining optimal PM intervals. Computational
An Overview 13

difficulties are addressed and practical illustrations are presented, based on sub-
systems of oil platforms and

Chapter 9: Artificial Intelligence in Maintenance

AI techniques have been used successfully in the past two decades to model and
optimise maintenance problems. This chapter reviews the application of Artificial
Intelligence (AI) in maintenance management and introduces the concept of
developing intelligent maintenance optimisation system. The chapter starts with an
introduction to maintence management, planning and scheduling and a brief
definition of AI and some of its techniques that have applications in maintenance
management. A review of literatures is then presented covering the applications of
AI in maintenance. We have focused on five AI techniques namely Knowledge
Based Systems, Case Based Reasoning, Genetic Algorithms, Neural Networks and
Fuzzy Logic. This review also covers “hybrid” systems where two or more AI
techniques are used in an application. A discussion of the development of the
prototype hybrid intelligent maintenance optimisation system (HIMOS) which was
developed to evaluate and enhance PM maintenance routines of complex en-
gineering systems then follows. The chapter ends with a discussion of future
research and concluding remarks.

Chapter 10: Maintenance of Repairable Systems

A repairable system is traditionally defined as a system which, after failing to per-
form one or more of its functions satisfactorily, can be restored to fully satisfactory
performance by any method other than replacement of the entire system. An
extended definition used in this chapter includes the possibility of additional main-
tenance actions which aim at servicing the system for better performance, referred to
as preventive maintenance (PM). The common models for the failure process of a
repairable system are renewal processes (RP) and non-homogeneous Poisson pro-
cesses (NHPP). The chapter considers several generalizations and extensions of the
basic models, for example the trend renewal process (TRP) which includes NHPP
and RP as special cases, and having the property of allowing a trend in processes of
non-Poisson type. When several systems of the same kind are considered, there may
be an unobserved heterogeneity between the systems which, if overlooked, may lead
to wrong decisions. This phenomenon is considered in the framework of the TRP
process. We then consider the extension of the basic models obtained by introducing
the possibility of PM using a competing risks approach. Finally, models for peri-
odically inspected systems are studied, using a combination of time-continuous and
time-discrete Markov chains.

Chapter 11: Optimal Maintenance of Multi-component Systems: A Review

This chapter gives an overview of the literature on multi-component maintenance
optimization focusing on work appearing since the 1991 survey by Cho and Parlar.
A classification scheme primarily based on the dependence between components
(stochastic, structural or economic) is introduced. Next, the papers are also classi-
fied on the basis of the planning aspect (short-term vs. long-term), the grouping of
maintenance activities (either grouping preventive or corrective maintenance, or
opportunistic grouping) and the optimization approach used (heuristic, policy
14 K. Kobbacy and D. Murthy

classes or exact algorithms). Finally, attention is paid to the applications of the


Chapter 12: Replacement of Capital Equipment

This chapter deals with models of replacement of capital equipment. Capital replace-
ment models may be classified as economic life models or dynamic programming
models. The former are concerned with determining the optimal lifetime of an item
of equipment taking account of costs over some planning horizon. The latter con-
siders replacement decisions dynamically, determining whether plant should be
retained or replaced after each period. We begin by looking at simple economic life
models. These are applied in a case study on escalator replacement. Economic life
models are then extended to consider first an inhomogeneous fleet and then second a
network system viewed as an inhomogeneous fleet with interacting items. A number
of different dynamic programming models are introduced for singular systems and
then expanded to homogeneous and inhomogeneous fleets and networks of assets.

Chapter 13: Maintenance and Production: A Review of Planning Models

This chapter gives an overview of the relation between planning of maintenance
and production. Production planning and scheduling models where failures and
maintenance aspects are taken into account are considered first. The planning of
maintenance activities are considered next, where both preventive as well as
corrective maintenance are discussed. Third, the planning of maintenance activities
at such moments in time where the items to be maintained are not or less needed
for production, also called opportunity maintenance is considered. Apart from
describing the main ideas, approaches, and results a number of applications are

Chapter 14: Delay Time Modelling

This chapter presented a modelling tool that was created to model the problems of
inspection maintenance and planned maintenance interventions, namely Delay
Time Modelling (DTM). This concept provides a modelling framework readily
applicable to a wide class of actual industrial maintenance problems of assets in
general and inspection problems in particular. The delay time defines the failure
process of an asset as a two-stage process. The first stage is the normal operating
stage from new to the point that a hidden defect has been identified. The second
stage is defined as the failure delay time from the point of defect identification to
failure. It is the existence of such a failure delay time which provides the oppor-
tunity for preventive maintenance to be carried out to remove or rectify the
identified defects before failures. With appropriate modelling of the durations of
these two stages, optimal inspection intervals can be identified to optimise a
criterion function of interest. This chapter first gives an outline of the delay time
concept then introduces two delay time inspection models of a single component
and a complex system respectively. The parameters estimation techniques used in
DTM are discussed. Extensions to the basic delay time model are highlighted and
future research in DTM concludes the chapter.
An Overview 15

Chapter 15: Maintenance Outsourcing

It is often uneconomical for businesses to carry out their own maintenance on
complex equipment. The alternative is to ‘out-source’ the maintenance function
and use an external agent, under a service contract, to carry out some or all of the
maintenance actions (preventive and corrective). This chapter develops the frame-
work needed to study decision-making for maintenance outsourcing from both the
customer (equipment owner) and service agent perspectives. The relevant literature
is reviewed and a game theoretic approach to maintenance outsourcing and the use
of agency theory is discussed. The link between maintenance outsourcing and
extended warranties is highlighted and the scope for future research in both areas is

Chapter 16: Maintenance of Leased Equipment

For leased equipment, the lessor has to carry out the maintenance of the equipment
over the lease period. To ensure satisfactory performance and maintenance, the
lease contract has penalty terms which result in the lessor having to compensate the
lessee if the number of failures exceeds some specified number and/or the time to
rectify each failure exceeds some specified value. This implies that the lessor needs
to take into account these penalties in determining the optimal maintenance
strategy. The chapter starts with a conceptual framework to discuss the different
issues involved and then looks at models to help the lessor in developing the
optimal maintenance strategy.

Chapter 17: Computerised Maintenance Management Systems

Computerised maintenance management systems (CMMSs) are vital for the co-
ordination of all activities related to the availability, productivity and maintainability
of complex systems. Modern computational facilities have offered a dramatic scope
for improved effectiveness and efficiency in, for example, maintenance. CMMSs
have existed, in one form or another, for several decades. In this chapter, the
characteristics of CMMSs have been investigated and have highlighted the need for
them in industry and identified their current deficiencies.
A proposed model is then presented to provide a decision analysis capability
that is often missing in existing CMMSs. The effect of such model is to contribute
towards the optimisation of the functionality and scope of CMMSs for enhanced
decision analysis support. The use of AI techniques in CMMSs is illustrated. The
features of next generation maintenance systems are finally highlighted.

Chapter 18: Risk Analysis in Maintenance

Risk analysis can be used for selection and prioritisation of maintenance activities,
and this application of risk analysis has been given increased attention in recent
years. This chapter presents and discusses the use of risk analysis for this purpose.
The chapter reviews some critical aspects of risk analysis important for the
successful implementation of such analyses in maintenance. This relates to risk
descriptions and categorisations, uncertainty assessments, risk acceptance and risk
informed decision making, as well as selection of appropriate methods and tools.
Both qualitative and quantitative approaches are covered. A detailed risk analysis
is outlined showing the effect of maintenance on risk.
16 K. Kobbacy and D. Murthy

Chapter 19: Maintenance Performance Measurement (MPM) System

It is important that factors influencing the performance of maintenance process
should be identified, and measured, so that they can be monitored and controlled for
improvement. In this chapter, besides an overview of performance measurement,
maintenance performance indicators, associated issues and challenges for developing
a maintenance performance measurement framework, and indicators as in use by
different industries are discussed. The framework considers stakeholders, business
environment, multi-criteria and hierarchical needs amongst other.

Chapter 20: Forecasting for Inventory Management of Service Parts

This chapter addresses issues pertinent to forecasting for the inventory management
of service parts. In some sectors, such as the aerospace and automotive industries, a
very wide range of service parts are held in stock, with significant implications for
availability and inventory holding. Their management is therefore an important task.
First, a number of possible approaches to classifying service parts for forecasting
and inventory management related purposes are reviewed. Second, parametric and
non-parametric approaches to forecasting service parts requirements are discussed
followed by the presentation of appropriate metrics for measuring the performance
of the inventory management system. The existing empirical evidence on various
forecasting methods is then summarised. Finally, the conclusions of this work are
presented along with the identification of some natural avenues for further research.

Chapter 21: Maintenance in the Rail Industry

The chapter presents two case studies in railway maintenance. The first case study
presents an optimisation model preventive maintenance of a train bogie. In the model
a dynamic approach to grouping of maintenance activities is used enabling, e.g.,
opportunity maintenance. Data from the Norwegian State Railways have been used
in the calculation example. The second case study present a life cycle cost approach
to prioritization of larger maintenance and renewal projects under budget constraints.

Chapter 22: Condition Monitoring of Diesel Engines

Various techniques have been widely used to monitor the condition of diesel
engines. Analysis of engine lubricant is a most widely used condition monitoring
technique. In this chapter, a case study applying oil analysis technique to monitor
the condition of marine diesel engines is presented. The case study focuses on
analysis and modelling of oil monitoring data. The study first introduces the con-
cept of state discriminant capability of condition variables and uses it to identify
the significant condition variables, and then develops a state discriminant model to
determine the state of the monitored system based on the current observation. The
model parameters are obtained by directly minimizing the misjudgment probabil-
ity. We believe that the proposed model has a great potential to be used due to its
plausible mathematical basis and simplicity though it needs further testing with
new data.
An Overview 17

Chapter 23: Benchmarking of the Maintenance Process at Banverket (The Swedish

National Rail Administration)
For sustaining a competitive edge in the business, railway companies all over the
world are looking for ways and means to improve their maintenance performance.
Benchmarking is a very effective tool that can assist the management in their
pursuit of continuous improvement of their operation. Three different benchmarks
have been studied based on a project benchmarking of the maintenance process
across borders, another project dealing with benchmarking of maintenance out-
sourcing by different track regions in Sweden, and a third project studying the level
on transparency among the European railway administrations. The chapter discuss
the pro and cons, the areas for improvement and the need for improvement of
benchmarking metrics and framework.

Chapter 24: Integrated e-Operations–e-Maintenance: Application in North Sea

Offshore Assets
Ongoing developments in Norway brings a good example of how an industry-wide
re-engineering process has triggered major changes in operations and maintenance
practice of complex and high-risk assets leading towards what is termed integrated
e-operations e-maintenance. It aims towards a step-change to the conventional
operations and maintenance practices of offshore assets. Initiatives have already been
taken to exploit new methods, smart techniques, and digital technologies to enable
remote monitoring of offshore equipment condition and asset performance in land-
based onshore support facilities using large ICT networks. This has already proved to
have direct positive implications on the technical and safety integrity of assets, and
subsequently on the plant economics. This chapter shares current experience and
knowledge with reference to ongoing developments in the Norwegian oil and gas
industry. It highlights current offshore asset maintenance practice, changing technical
and economic environment that lead towards an e-approach, development and im-
plementation of integrated e-operations and e-maintenance solutions in the North sea,
key features of the e-approach in North sea assets, and future challenges to be fully-
integrated and fail-safe.

Chapter 25: Fault Detection and Identification for Longwall Machinery Using
In an attempt to improve equipment availability and facilitate informed, preventa-
tive maintenance, engineers may choose to implement one or more fault detection
and identification (FDI) technologies. For complex systems (systems for which
component interactions are not understood and model uncertainties are significant),
data-driven methods of FDI are often the only practicable solution. The develop-
ment of a data-driven FDI system for longwall mining equipment using SCADA
data is described here.
Significant data preprocessing was required to generate a quality example set.
Missing value estimation (MVE) techniques were required to complete the high-
dimensional stream of condition monitoring data from existing sensors. A cost
function, in combination with a linear discriminant analysis, was used to ‘align’ the
inaccurate, categorical delay records with those delays inferred by the SCADA
data. A neural network was developed to determine the state of the system as a
18 K. Kobbacy and D. Murthy

function of the real-time SCADA data input. Validation of this algorithm with
unseen condition monitoring data showed misclassification rates of machine faults
as low as 14.3%.

1.4 Target Audience

The unique features of the book are as follows:

1. A coverage of the different approaches to maintenance.

2. Deals with many different aspects (scientific, technical, commercial,
management, quantitative modelling) etc.
3. Blends theory with practice.

As such it should appeal to both researchers and practitioners. For researchers

(from different disciplines) it should provide a starting point for new research into
different aspects of maintenance. For practitioners it should provide the concepts and
tools so that these can be used for improvements in the overall business performance.
Also we hope that it will serve as a reference book for use in postgraduate programs
in maintenance
Maintenance: An Evolutionary Perspective

Liliane Pintelon and Alejandro Parodi-Herz

L. Pintelon and A. Parodi-Herz

2.1 Introduction
Over the last decennia industrial maintenance has evolved from a non-issue into a
strategic concern. Perhaps there are few other management disciplines that under-
went so many changes over the last half-century. During this period, the role of
maintenance within the organization has drastically been transformed. At first
maintenance was nothing more than a mere inevitable part of production, now it is
an essential strategic element to accomplish business objectives. Without a doubt,
the maintenance function is better perceived and valued in organizations. One
could considered that maintenance management is no longer viewed as an under-
dog function; now it is considered as an internal or external partner for success.
In view of the unwieldy competition many organizations seek to survive by
producing more, with fewer resources, in shorter periods of time.To enable these
serious needs, physical assets take a central role. However, installations have
become highly automated and technologically very complex and, consequently,
maintenance management had to become more complex having to cope with
higher technical and business expectations. Now the maintenance manager is
confronted with very complicated and diverse technical installations operating in
an extremely demanding business context.
This chapter, while considering the fundamental elements of maintenance and
its environment, describes the evolution path of maintenance management and the
driving forces of such changes. In Section 2.2 the maintenance context is described
and its dynamic elements are briefly discussed. Section 2.3 explains how main-
tenance practice have evolved in time and different epochs are distinguished.
Further, this sections devotes special attention to describe a common lexicon for
maintenance actions and policies to further focuss on the evolution of maintenance
concepts. Section 2.4 underlines how the role of the maintenance manager has been
reshaped as a consequence of the changes of the maintenance function. Finally, the
chapter concludes with Section 2.5 identifying the new challenges for maintenance.
22 L. Pintelon and A. Parodi-Herz

2.2 Maintenance in Context

To discuss the context in which maintenance management is embedded, one may
raise the question what is maintenance as such? Most authors in maintenance
management literature, one way or another, agree on defining maintenance as the
“set of activities required to keep physical assets in the desired operating condition
or to restore them to this condition”. While this defines what maintenance is about,
it may suggest that maintenance is simple, which it is not, as will be confirmed by
any maintenance practitioner. Hence “maintenance management” is needed to
ingrain maintenance practice in a complex and dynamic context. From a pragmatic
view, the key objective of maintenance management is “total asset life cycle
optimization”. In other words, maximizing the availability and reliability of the
assets and equipment to produce the desired quantity of products, with the required
quality specifications, in a timely manner. Obviously, this objective must be
attained in a cost-effective way and in accordance with environmental and safety
regulations. Figure 2.1 clearly shows that maintenance is embedded in a given
business context to which it has to contribute. What is more, it shows that the
maintenance function needs to cope with multiple forces and requirements within
and outside the walls of the organization. Beyond any doubt, the tasks of main-
tenance are complex, enclosing a blend of management, technology, operations
and logistics support elements.


Legislation evolution

Total asset
Society Technology life cycle Operations e-business

Logistics Support
Outsourcing Information
Market Technology


Figure 2.1. Maintenance in context

To cope with and to coordinate the complex and changing characteristics that
constitute maintenance in the first place, a management layer is imperative.
Management is about “what to decide” and “how to decide”. In the maintenance
arena, a manager juggles with technology, operations and logistics elements that
mainly need to harmonize with production. Technology refers to the physical
assets which maintenance has to support with adequate equipment and tools.
Operations indicate the combination of service maintenance interventions with
Maintenance: An Evolutionary Perspective 23

core production activities. Finally, the logistics element supports the maintenance
activities in planning, coordinating and ultimately delivering, resources like spare
parts, personnel, tools and so forth. In one way or another, all these elements are
always present, but their intensity and interrelationships will vary from one
situation to another. For example, the elevator maintenance in a hospital vs. the
plant maintenance in chemical process industries stipulates a different maintenance
recipe tailored to the specific needs. Clearly, the choice of the structural elements
of maintenance is not independent from the environment. Besides, other factors
like the business context, society, legislation, technological evolution, outsourcing
market, will be important. Furthermore, relative new trends, such as the e-business
context, will influence the current and future maintenance management enor-
mously. A whole new era for maintenance is expected as communication barriers
are bridged and coordination opportunities of maintenance service become more

2.2.1 Changes in the Playing Field of Maintenance

One should expect that neither maintenance management nor its environment are
stationary. The constant changes in the field of maintenance are acknowledged to
have enabled new and innovative developments in the field of maintenance
The technological evolution in production equipment, an ongoing evolution
that started in the twentieth century, has been tremendous. At the start of the
twentieth century, installations were barely or not mechanized, had simple design,
worked in stand-alone configurations and often had a considerable overcapacity.
Not surprisingly, nowadays installations are highly automated and technologically
very complex. Often these installations are integrated with production lines that are
right-sized in capacity.
Installations not only became more complex, they also became more critical in
terms of reliability and availability. Redundancy is only considered for very critical
components. For example, a pump in a chemical process installation can be con-
sidered very critical in terms of safety hazards. Furthermore, equipment built-in
characteristics such as modular design and standardization are considered in order
to reduce downtime during corrective or preventive maintenance. However, pre-
dominantly only for some newer, very expensive installations, such as flexible
manufacturing systems (FMS), these principles are commonly applied. Fortunate-
ly, a move towards higher levels of standardization and modularization begins to
be witnessed at all level of the installations. As life cycle optimization concepts are
commendable, it becomes mandatory that at the early design stages supportability
and maintainability requirements are well thought-out.
Parallel to the technological evolution, the ever-increasing customer focus
causes even higher pressure, especially on critical installations. As customers’
service in terms of time, quality and choice becomes central to production deci-
sions, the more flexibility is required to cope with these varying needs. This calls
for well-maintained and reliable installations capable to fulfil shorter and more
reliable lead-times estimation. Physical assets are ever more important for business
24 L. Pintelon and A. Parodi-Herz

Maintenance does not escape from the (r)evolution in information communica-

tion technology (ICT), which has tremendously changed business practices. How-
ever, we comment further on this topic in Section 2.3, by illustrating the impact on
the role of the maintenance manager as such.
Furthermore, new production and management principles such as Just-in-time
(JIT) philosophy, Lean principles, total quality management (TQM) and so forth,
have emerged. These production trends intend, by all means, to reduce waste and
remove non-value added transactions. It is not surprising that work-in-process
(WIP) inventories are one of the key issues for improvement. Clearly, WIP inven-
tories incur high costs as a consequence of the capital immobilization, expensive
floor space, etc. As processes happen to be streamlined, WIP inventories are no
longer a buffer for problems; accordingly, asset availability and reliability are ever
more imperative. Albeit, these principles were initially inspired for production and
manufacturing environments are currently also applied and translated in service
Above all, the business environment has also changed. Competition has
become fierce and worldwide due to the globalization. The latter not only implies
that competitors are located all over the world, but also that decisions to move
production or service activities from a non-efficient site (e.g. due to high opera-
tions and maintenance costs) to another site are quickly taken, even if the other
location belongs to another continent. Obviously, with the advent of globalization
and intense competitive pressures, organizations are looking for every possible
source of competitive advantage. This implies that the nature of business environ-
ment has become more complex and dynamic requiring different competitive
strategies. Many companies are critically evaluating their value chain and often
decide to drastically reorganize it. This results in focusing on the core business.
Consequently outsourcing of some non-core business activities and the creation of
new partnerships and alliances are being considered by many organizations.
Not surprisingly, maintenance as a support function is no exception for out-
sourcing. Yet, it may not be so simple. Outsourcing maintenance of technical
systems can become a sensitive issue if it is not handled with diligence. Technical
systems are unique and situation specific. For example, outsourcing maintenance
of utilities or elevators can be relatively straightforward, but when it comes to
production floor equipment it can be a strategic issue that has to be handled with
extreme care. These circumstances suggest that outsourcing needs to be considered
at operational, tactical and strategic level; see Figure 2.2
The simplest, and also the most common, form of outsourcing is “operational
outsourcing”. At this level, a specific task is outsourced and the relationship
between supplier and customer is strictly limited to a sell-buy situation. The impact
on the internal organization of the customer is also limited. As outsourcing moves
up in the organizational pyramid the relationship between supplier and customer
changes and “tactical outsourcing” maybe required. At this level of outsourcing the
customer shares management responsibility with the supplier and a simple kind of
partnership is established. The impact on the internal organization is also greater.
Finally, moving towards the organization’s top and for more critical maintenance
services, a new form of outsourcing is created, the so-called “strategic out-
sourcing”. This type of outsourcing is also labelled as “transformational out-
Maintenance: An Evolutionary Perspective 25

sourcing” because of its impact on the customer’s internal organization. Here a

complete outsourcing is carried out, the maintenance department is cut away from
the customer and moved to the supplier. The relationship between customer and
supplier is a strong partnership: the customer has fully entrusted the supplier with
one of its strategic maintenance activities. This level of outsourcing is yet less
common than the former ones. The rationales of whether or not to outsource main-
tenance activities are complex and require a well-thought and structured outsourcing
process. As mentioned maintenance outsourcing can cover a lot of alternatives.
Fortunately, besides, traditional outsourcing of maintenance activities to equipment
suppliers or the use of some small local firms, there is nowadays a growing market
of medium sized and large outsourcing firms. These firms offer a range of
consulting support, specialized services and even full service to allow strategic
outsourcing to work.

“Transformational” service
To think with…
e.g. outsourcing of all
maintenance, BOT, ...

Service package
“Partnership” e.g. MRO, utilities, facilities, ... To manage…

e.g. renovation, shutdown, ...

Specialised services To organise…

e.g. high tech equipment, piping, insulation, ...
“Supplier – Customer”

Generic services To carry out…

e.g. temporary extra capacity (painting, welding, ...)

Figure 2.2. Outsourcing decision levels

Societal expectations concerning technology is also creating boundary condi-

tions for maintenance management. The attention paid to sustainability (3P: people,
profit, planet) is a clear sign of this. Legislation is getting more and more stringent.
This is especially important here because of its impact on occupational safety and
environmental standards.
Note that most of the above-mentioned trends for industrial installations can be
easily translated to the service sector. Think, for example, of automated warehouses
in distribution centre, hospital equipment or building utilities.
26 L. Pintelon and A. Parodi-Herz

2.3 Maintenance Practices Over Time

Consequent to the transformation the maintenance context, the maintenance
function has also drastically evolved from a non-issue into a strategic concern (see
Figure 2.3). At first maintenance was nothing more than an inevitable part of
production; it simply was a necessary evil. Repairs and replacements were tackled
when needed and no optimization questions were raised. Later on, it was conceived
that maintenance was a technical matter. This not only included optimizing
technical maintenance solutions, but it also involved attention of the organization
on the maintenance work. Further on, maintenance became a full-blown function,
instead of production sub-function. Clearly, now maintenance management has
become a complex function, encompassing technical and management skills, while
still requiring flexibility to cope with the dynamic business environment. Top
management recognizes that having a well thought out maintenance strategy
together with a careful implementation of that strategy could actually have a
significant financial impact. Nowadays, this has led to treating maintenance as a
mature partner in business strategy development and possibly at the same level as
production. In turn, these strategies formally consider establishing external
partnerships and outsourcing of the maintenance function.

“Necessary “Technical “Profit “Cooperative

evil” matter” contributor” partnership”

1940 1950 1960 1970 1980 1990 2000


Figure 2.3. The maintenance function in a time perspective

The fact that maintenance has become more critical implies that a thorough
insight into the impact of maintenance interventions, or the omission of these, is
indispensable. Per se, good maintenance stands for the right allocation of resources
(personnel, spares and tools) to guarantee, by deciding on the suitable combination
of maintenance actions, a higher reliability and availability of the installations.
Furthermore, good maintenance foresees and avoids the consequences of the
failures, which are far more important than the failures as such. Bad or no main-
tenance can appear to render some savings in the short run, but sooner or later it
will be more costly due to additional unexpected failures, longer repair times,
accelerated wear, etc. Moreover, bad or no maintenance may well have a signi-
ficant impact on customer service as delivery promises may become difficult to
fulfil. Hence, a well-conceived maintenance program is mandatory to attain busi-
ness, environmental and safety requirements.
Despite the particular circumstances, if one intends to compile or judge any
maintenance programme, some elementary maintenance terms need to be unam-
biguous and handled with consistency. Yet, both in practice and in the literature a
lot of confusion exists. For example, what for some is a maintenance policy others
refer to as a maintenance action; what some consider preventive maintenance
others will refer to as predetermined or scheduled maintenance. Furthermore, some
argue that some concepts can almost be considered strategies or philosophies, and
Maintenance: An Evolutionary Perspective 27

so on. Certainly there is a lot of confusion, which perhaps is one of the breathing
characteristics of such a dynamic and young management science. The terminol-
ogy used to describe precisely some maintenance terms can almost be taken as
philosophical arguments. However, the adoption of a rather simplistic, but truly
germane classification is essential. Not intending to disregard preceding terminol-
ogies, neither to impose nor dictate a norm, we draw attention, in particular, to
three of those confusing terms: maintenance action, maintenance policy and
maintenance concept. In the remainder of this chapter the following terminology is
Maintenance Action. Basic maintenance intervention, elementary task carried out
by a technician (What to do?)
Maintenance Policy. Rule or set of rules describing the triggering mechanism for
the different maintenance actions (How is it triggered?)
Mainenance Concept. Set of maintenance polices and actions of various types and
the general decision structure in which these are planned and supported. (The logic
and maintenance recipe used?)

2.3.1 Maintenance Actions

Basically, as depicted in Figure 2.4, maintenance actions or interventions can be of

two types. They are either corrective maintenance (CM) or precautionary main-
tenance (PM) actions. Corrective Maintenance Actions (CM)

CM actions are repair or restore actions following a breakdown or loss of function.
These actions are “reactive” in nature; this merely implies “wait until it breaks,
then fit it!”. Corrective actions are difficult to predict as equipment failure behavior
is stochastic and breakdowns are unforeseen. Maintenance actions such as
replacement of a failed light bulb, repair of a ruptured pipeline and the repair of a
stalled motor are some examples of corrective actions. Precautionary Maintenance Actions (PM)

PM actions can either be “preventive, predictive, proactive or passive” in nature.
These types of actions are moderately more complex than the former. To describe
fully each one of them, a book can be written on its own. Nonetheless, the
fundamental ideas aim at diminishing the failure probability of the physical asset
and/or to anticipate, or avoid if possible, the consequences if a failure occurs. Some
PM actions (preventive and predictive) are somewhat easier to plan, because they
can rely on fixed time schedules or on prediction of stochastic behaviours. How-
ever, other types of PM actions become ongoing tasks, originating from the attitude
concerning maintenance. Somehow they became part of the tacit knowledge of the
organization. Some precise examples of precautionary actions which can be
mentioned are lubrication, bi-monthly bearing replacements, inspection rounds,
vibration monitoring, oil analysis, design adjustments, etc. All these tasks are
considered to be precautionary maintenance actions; however, the underlying prin-
ciples may be different.
28 L. Pintelon and A. Parodi-Herz


Ad hoc LCC Customized

Optimizing concept
existing concept

preventive predictive

reactive T/UBM CBM

proactive passive

Corrective Precautionary

reactive Predictive, preventive,

proactive and passive

Figure 2.4. Actions, policies and concepts in maintenance1

Although it seems a very clear-cut way of defining elementary maintenance

interventions, it still may be difficult in practice to assign some interventions to
either class. An example here is routine maintenance on medical equipment such as
a breathing device. Cleaning and sterilizing this equipment can be called pre-
cautionary maintenance since the equipment is not defective at the moment of the
intervention. On the other hand, it is very difficult to predict when an intervention
will be needed, and this is a typical characteristic of a corrective intervention.
Furthermore, even within precautionary maintenance, it is not always simple to
classify certain actions into simple types. This is due to the changing perception on
maintenance and the fast evolution of its techniques. Acuity of Maintenance Actions

As maintenance knowledge is enhanced and more advance enabling technologies
are available, the perception on which maintenance action is “right” has changed a
lot during the last decennia. In the 1950s almost all maintenance actions were
corrective. Per se maintenance was considered as an annoying and unavoidable
cost, which could not be managed. Later on, in the 1960s many companies
switched to precautionary (preventive) maintenance programs as they could
recognize that some failures on mechanical component had a direct relation with
the time or number of cycles in use. This belief was mainly based on physical wear
of components or age-related fatigue characteristics. At that time, it was accepted

See abbreviations list at the end of this chapter
Maintenance: An Evolutionary Perspective 29

that preventive actions could avoid some of the breakdowns and would lead to cost
savings in the long run. The main concern was how to determine, based on
historical data, the adequate period to perform preventive maintenance. Certainly,
not enough was known about failure patterns, which, among other reasons, have
led to a whole separate branch of engineering and statistics: reliability engineering.
In the late 1970s and early 1980s, equipment became in general more complex.
As result, the super-positioning effect of the failure pattern of individual com-
ponents starts to alter the failure characteristics of simpler equipment. Hence, if
there is no dominant age-related failure mode, preventive maintenance actions are
of limited use in improving the reliability of complex items. At this point, the
effectiveness of applying preventive maintenance actions started to be questioned
and was considered more carefully. A common concern about “over-maintaining”
grew rapidly. Moreover, as the insidious belief on preventive maintenance benefits
was put at risk, new precautionary (predictive) maintenance techniques emerged.
This meant a gradual, though not complete, switch to predictive (inspection and
condition-based) maintenance actions. Naturally, predictive maintenance was, and
still is, limited to those applications where it was both technically feasible and
economically interesting. Supportive to this trend was the fact that condition-
monitoring equipment became more accessible and cheaper. Prior to that time,
these techniques were only reserved to high-risk applications such as airplanes or
nuclear power plants.
In the late 1980s and early 1990s a different footprint on maintenance history
occurred with the emergence of concurrent engineering or life cycle engineering.
Here maintenance requirements were already under consideration at earlier product
stages such as design or commission. As a result, instead of having to deal with
built in characteristics, maintenance turned out to be active in setting design
requirements for installations and became partly involved in equipment selection
and development. All this led to a different type of precautionary (proactive) main-
tenance, the underlying principle of which was to be proactive at earlier product
stages in order to avoid later consequences. Furthermore, as the maintenance
function was better appreciated within the organization, more attention was paid to
additional proactive maintenance actions. For example, as operators are in straight
and regular contact with the installations they could intuitively identify and “feel”
right or wrong working conditions of the equipment. Conditions such as noise,
smell, rattle vibration, etc., that at a given point are not really measured, represent
tacit knowledge of the organization to foresee, prevent or avoid failures and its
consequences in a proactive manner. Yet these actions are indeed typically not
performed by maintenance people themselves, but are certainly part of the
structural evolution of maintenance as a formal or informal partner within the
The last type of precautionary (passive) maintenance actions are driven by the
opportunity of other maintenance actions being planned. These maintenance
actions are precautionary since they occur prior to a failure, but are passive as they
“wait” to be scheduled depending on others probably more critical actions. Passive
actions are in principle low priority for the maintenance staff as, at a given moment
in time, they may not really be a menace for functional or safety failures. However,
these actions can save significant maintenance resources as they may reduce the
30 L. Pintelon and A. Parodi-Herz

number of maintenance interventions, especially when the set up cost of main-

tenance is high. For example, when maintenance actions are planned or need to be
carried out on offshore oil platforms or on windmills in remote locations, getting to
the equipment equipment can be costly. Therefore, optimizing the best combina-
tion of maintenance actions, at that point in time, is mandatory. This may invoke
replacing components with significant residual life that in different circumstances
would not be replaced.

2.3.2 Maintenance Policies

As new maintenance techniques happen to be available and the economic implica-

tions of maintenance action are comprehended, a direct impact on the maintenance
policies is expected. Several types of maintenance policies can be considered to
trigger, in one way or another, either precautionary or corrective maintenance
interventions. As described in Table 2.1, those policies are mainly failure-based
maintenance (FBM), time/used-based maintenance (TBM/UBM), condition-based
maintenance (CBM), opportunity-based maintenance (OBM) design-out main-
tenance (DOM), and e-maintenance.

Table 2.1. Generic maintenance policies

Policy Description

FBM Maintenance (CM) is carried out only after a breakdown. In case of CFR
behaviour and/or low breakdown costs this may be a good policy.

TBM / UBM PM is carried out after a specified amount of time (e.g. 1 month, 1000 working
hours, etc.). CM is applied when necessary. UBM assumes that the failure
behaviour is predictable and of the IFR type. PM is assumed to be cheaper than

CBM PM is carried out each time the value of a given system parameter (condition)
exceeds a predetermined value. PM is assumed to be cheaper than CM. CBM is
gaining popularity due to the fact that the underlying techniques (e.g. vibration
analysis, oil spectrometry,...) become more widely available and at better prices.
The traditional plant inspection rounds with a checklist are in fact a primitive
type of CBM.

OBM For some components one often waits to maintain them until the “opportunity”
arises when repairing some other more critical components. The decision
whether or not OBM is suited for a given component depends on the expectation
of its residual life, which in turn depends on utilization.

DOM The focus of DOM is to improve the design in order to make maintenance easier
(or even eliminate it). Ergonomic and technical (reliability) aspects are
important here.

CFR = Constant failure rate, IFR=Increasing failure rate

For the more common maintenance policies many models have been developed
to support tuning and optimization of the policy setting. It is not our intention to
explain the fundamental differences between these models, but rather to provide an
overview of types of policies available and why these have been developed. Much
Maintenance: An Evolutionary Perspective 31

has to do with the discussion in the previous section regarding the acuity of main-
tenance actions. Therefore, it is clear that policy setting and the understanding of its
efficiency and effectiveness continues to be fine-tuned as any other management
science. We advocate the reader, particularily interested in the underlying principles
and type of models, to review McCall (1965), Geraerds (1972), Valdez-Flores and
Feldman (1989), Cho and Parlar (1991), Pintelon and Gelders (1992), Dekker (1996),
Dekker and Scarf (1998) and Wang (2002) for a full overview on the state-of-the-art
The whole evolution of maintenance was based not solely on technical but
rather on techno-economic considerations. FBM is still applied providing the cost
of PM is equal to or higher than the cost of CM. Also, FBM is typically handy in
case of random failure behaviour, with constant failure rate, as TBM or UBM are
not able to reduce the failure probability. In some cases, if there exists a
measurable condition, which can signal the probability of a failure, CBM can be
also feasible. Finally, a FBM policy is also applied for installations where frequent
PM is impracticable and expensive, such as can be the maintenance of glass ovens.
Either TBM or UBM is applied if the CM cost is higher than PM cost, or if it is
necessary because of criticality due to the existence of bottleneck installation or
safety hazards issues. Also in case of increasing failure behaviour, like for example
wear-out phenomena, TBM and UBM policies are appropriate.
Typically, CBM was mainly applied in those situations where the investment in
condition monitoring equipment was justified because of high risks, like aviation
or nuclear power regeneration. Currently, CBM is beginning to be generally
accepted to maintain all type installations. Increasingly this is becoming a common
practice in process industries. In some cases, however, technical feasibility is still a
hurdle to overcome. Another reason that catches the attention of practitioners in
CBM is the potential savings in spare parts replacements thanks to the accurate and
timely forecasts on demand. In turn, this may enable better spare parts management
through coordinated logistics support.
Finding and applying a suitable CBM technique is not always easy. For example,
the analysis of the output of some measurement equipment, such as advanced
vibration monitoring equipment, requires a lot of experience and is often work for
experts. But there are also simpler techniques such as infrared measuring and oil
analysis suitable in other contexts. At the other extreme, predictive techniques can be
rather simple, as is the case of checklists. Although fairly low-level activity, these
checklists, together with human senses (visual inspections, detection of “strange”
noises in rotating equipment, etc.) can detect a lot of potential problems and initiate
PM actions before the situation deteriorates to a breakdown.
At present FBM, TBM, UBM and CBM accept and seize the physical assets
which they intend to maintain as a given fact. In contrast, there are more proactive
maintenance actions and policies which, instead of considering the systems as “a
given”, look at the possible changes or safety measures needed to avoid maintenance
in the first place. This proactive policy is referred to as DOM. This policy implies
that maintenance is proactively involved at earlier stages of the product life cycle to
solve potential related problems. Ideally, DOM policies intend to completely avoid
maintenance throughout the operating life of installations, though, this may not be
realistic. This leads one to consider a diverse set of maintenance requirements at the
32 L. Pintelon and A. Parodi-Herz

early stages of equipment design. As a consequence, equipment modifications are

geared either at increasing reliability by raising the mean-time-between-failures
(MTBF) or at increasing the maintainability by decreasing the mean-time-to-repair
(MTTR). Per se DOM aims to improve the equipment availability and safety. Some
equipment modifications may merely request ergonomic considerations to reduce
MTTR, others may need totally new designs. Often DOM projects are combined
with efforts to increase occupational safety or increase production capacity, such as
set up reduction programs.
A rather passive, but considerably important maintenance policy that needs to
be mentioned is OBM. Typically OBM is applied for non-critical components with
a relatively long lifetime. For these components no separate maintenance programs
are scheduled; maintenance happens if an opportunity arises due to a maintenance
intervention for another component of that machine.
More recently in the mid-1990s, with the emergence of the Internet as an
enabling technology and the growth of e-business as the standard on business
communication, e-maintenance also appeared in the radar of maintenance policies.
E-maintenance rather than a policy can also be considered as a means or enabler to
some, if not all, the previous policies. However, it is more than just an acronym; it
is a step forward to full-integrated maintenance techniques without the boundaries
of place. It is in fact a maintenance policy on its own that can support other
policies. In particular, academics and practitioners watch with anticipation the
great impact it may have on CBM. Conditions measured on site can be remotely
monitored, opening entirely new dimensions and opportunities for maintenance
services. Therefore, e-maintenance has captured much attention of maintenance re-
searchers given its great impact on business practice. An example of this evolution
is telemaintenance, which allows the diagnosis of installation and to perform
limited type of repairs from a remote location using ICT and sophisticated control
and knowledge tools.

2.3.3 Maintenance Concepts

The idea of an “optimized” maintenance program suggests that an adequate mix of

maintenance actions and policies needs to be selected and fine-tuned in order to
improve uptime, extend the total life cycle of physical asset and assure safe
working conditions, while bearing in mind limiting maintenance budgets and
environmental legislation. This does not seem to be straightforward, and may
require a holistic view. Therefore, a “maintenance concept” for each installation is
necessary to plan, control and improve the various maintenance actions and
policies applied. A maintenance concept may in the long term even become a
philosophy, tenet or attitude to perform maintenance. In some cases advance main-
tenance concepts are almost considered strategies on their own. What is certain is
that maintenance concepts determine the business philosophy concerning main-
tenance, and that they are needed to manage the complexity of maintenance per se.
In practice, it is clear that more and more companies are spending time and effort
determining the right maintenance concept.
As a matter of fact, maintenance concepts need to be formulated considering the
physical characteristics and the context within which installations operate. Not
Maintenance: An Evolutionary Perspective 33

surprisingly, as system complexity is increasing and maintenance requirements are

becoming more complex, maintenance concepts will require different levels of
complexity. Literature provides us with various concepts that have been developed
through a combination of theoretical insights and practical experiences. Choosing
and implementing the best concept in a given context is hard. To the question “what
concept is best for us?”, no short and straightforward answer exists. The right
answer to the question is determined by the context, with its complex interaction of
technology, business, organization, and so forth. Designing and implementing a
good concept will take time and effort. Many companies establish teams with
members from different areas (engineering, production, maintenance, ...) to
accomplish this difficult task. On the market, many consultants offer their services
to assist in this process. This outside help may be very useful to get started and to
obtain a better insight into own situation. However, it is useful to note that many
consultants have “their” concept (e.g. RCM) they are used to implementing, which
may bias their judgment on what concept is “right”. Nevertheless, some outside
guidance can be useful, but in order to have a good concept that fits all the
companies needs, this should be built by in-house people, using all the knowledge
Several times in this chapter, it has been suggested that next to increasing
systems complexity, maintenance has also evolved in time. This has led to three
generations of maintenance concepts with its respective transition points. In the
following paragraphs an overview is offered which is also portrayed in Table 2.2.
In the past, equipment was generally much simpler; hence the need for
maintenance decision support was moderate. For truly simple systems, even a
single maintenance policy may possibly be considered a concept on its own. This is
considered the simplest form, the “first generation”, of maintenance concepts.
Here, only one maintenance policy or even type of action was applied to certain
equipment. For a state-of-the-art review on this type of maintenance concepts see
Wang (2002). With the advent of automation, installations became highly
mechanized and the equipment turned out to be more complex and the inter-
dependencies of the multi-unit systems could no longer be ignored. To maintain
such installations efficiently a specific mixture of maintenance policies and actions
was required. The need for decision structures became crucial. These circum-
stances prompted, at first instance, the concept of simple quick and dirty (Q&D)
decision diagrams. Q&D charts could help to select adequate maintenance policies
as only ‘yes’ or ‘no’ answers can be given to a series of structured but simple
questions. The authors note that even though Q&D charts lack the holistic view
required for well-conceived and sophisticated maintenance concepts, they are still
widely used in practice on specific situations thanks to their simplicity. Examples
are reported in Pintelon et al. (2000) and Waeyenbergh and Pintelon (2002).
Eventually, superior maintenance concepts were claimed, as the complexity of
maintenance decisions increased. As a result, in the last 40 years a vast range of
maintenance concepts has been extensively documented in literature. This group of
concepts is considered the “second generation” of maintenance concepts and
provides a pool of knowledge for maintenance practitioners and researchers.
Typical examples, and perhaps the most important ones, are total productive
34 L. Pintelon and A. Parodi-Herz

maintenance (TPM), reliability-centred maintenance (RCM) and life cycle costing

(LCC) approaches.

Table 2.2. Description of the maintenance concepts generations

Main Main
Generation Concept Description
strengths weaknesses
1st Ad hoc Implementing FBM and UBM Simple Ad hoc
policies; rarely CBM, DOM, decisions

1st → 2nd Q&D Easy-to-use decision chart. It Consistent, Rough

helps to decide on the “right” Allows for questions, and
maintenance policy priorities answers

2nd LCC Detailed cost breakdown over the Sound basic Resource and
equipment’s lifetime helping to philosophy data intensive
plan the maintenance logistics

TPM Approach with an overall view Considers Time consuming

on maintenance and production. human/technical implementation
Especially successful in the aspects, fits in
manufacturing industry kaizen approach.
Extensive tool box

RCM Structured approach focused on Powerful Resource

reliability. Initially developed for approach, Step- intensive
high tech/high risk environment by-step procedure

2nd → 3rd RCM-based Approaches focused on Improved Sometimes an

remediating some of the performance oversimplifi-
perceived RCM shortcomings through e.g. use of cation
sound statistical
Example: streamlined RCM, analysis

3rd Customized In-house developed; cherry- Exploiting the Ensuring

picking from existing concepts company’s consistency and
strengths and quality in the
Examples: CIBOCOF, VDM considering the concept
specific business developed

All these concepts, as many others, enjoy several advantages and are doomed to
specific shortcomings. Correspondingly, new maintenance concepts are developed,
old ones are updated and methodologies to design customized maintenance
concepts are created. These concepts enjoy a lot of interest in their original form
and also give raise to many derived concepts. For example, streamlined RCM from
RCM. One may consider that customized maintenance concepts constitute the
“third generation” of this evolution. They have fundamentally emerged since it is
very difficult to claim a “one fits all” concept in the complex and still constantly
changing world of maintenance. They are inspired by the former concepts while
trying to aviod in the future previously experienced drawbacks. One way or
another, customized maintenance concepts mainly consist of a “cherry picking” of
useful techniques and ideas applied in other maintenance concepts. This important,
but relatively new concept is expected to grow in importance both in practice and
with academicians. Concepts that belong to this generation are, for example, value
driven maintenance (VDM) and CIBOCOF, which was developed at the Centre of
Maintenance: An Evolutionary Perspective 35

Industrial Management (CIB), K.U. Leuven, Belgium. Additionally, in-house main-

tenance concepts, mostly developed in organization with fairly high maintenance
maturity, also belong to this category of concepts. This, for example, was imple-
mented in a petrochemical company that developed a customised concept, which
was basically following the RCM logic. However, by extending RCM analysis
steps and introducing risk-based inspections (RBI), a more focused and better-
conceived maintenance plan could be developed. Moreover, the company bor-
rowed some elements from TPM and incorporated these in their maintenance
concept. For example, multi-skilled training programmes were implemented and
special tool kits were designed for a number of maintenance jobs using TPM prin-
Before the third generation of maintenance concepts was started, or actually
even earlier, they were perceived as necessary. In the literature, a middle step is
recognized to bridge the second generation with maintenance concepts such as
business-centred maintenance (BCM) and risk based centred maintenance (RBCM)
were developed. These concepts are merely RCM-related and still widely applied
in many organizations. However, a slow but steady movement towards more
customized maintenance concept is expected in the near future, as the maintenance
function matures.
Next, a straightforward description on the most important concepts is presented
and important references are provided for the interested reader. Quick & Dirty Decision Charts (Q&D)

A Q&D decision chart is a decision diagram with questions on several aspects
including; failure paterns, repair behaivours of the equipment, business context,
maintenance capabilities, cost structure etc. Answering the questions for a given
installation, the user proceeds through the branches of the diagram. The process
stops with the recommendation of the most appropriate policy for the specific
installation. The Q&D approach allows for a relatively quick determination of the
most advantageous maintenance policy. It ensures a consistent decision making for
all installations. Although some Q&D decision charts are available from literature
(e.g. Pintelon et al. 2000), most companies adopting this approach prefer to draw
up their own charts, which incorporate their experience and knowledge in the
decision process. This can be implemented in several ways. For instance by
defining specific questions, adding or deleting maintenance policies, establishing
preferred sequence in which the different policies should be considered, etc. This
approach however has the drawback of being rough (dirty). The questions are
usually put in the basic yes/no format, limiting the answering possibilities. More-
over, answering the questions is usually done on a subjective basis; for example the
question whether a given action or policy is feasible is answered based on ex-
perience rather than on a sound feasibility study. Life Cycle Costing (LCC) Approaches

LCC originated in the late 1960s and is now resurrecting. The basic principle of
LCC is sometimes summarised by “it is unwise to pay too much, but is foolish to
spend too little”. This refers to the two main underlying ideas of LCC. The first
concerns the cost iceberg structure presented by Blanchard (1992) by whom LCC
36 L. Pintelon and A. Parodi-Herz

was revived. Mainly he proposes that when considering maintenance or equipment

purchasing alternatives, one should not be limited to what momentarily can be
seen: “the top of the iceberg”, such as direct maintenance costs (material, labour,
etc.) or the purchase price. The indirectly relevant long run cost such as operational
expenses, trainning cost, spares inventory costs, etc. are at least of the same order
of magnitude. The second refers to the principle that the further one gets in the
design or construction cycle of equipment, the more costly it will be to make
modifications (e.g. DOM). Maintenance should be taken into account from the
very first moment of designing a machine or system. LCC is a methodology for
calculating or estimating the total cost of a system during the entire course of its
life. This LCC approach implies a synthesis of costing analysis and engineering
design principles that must satisfy life cycle requirements at minimum cost. In turn,
design decisions are based on total cost of ownership (TCO) principles.
In the literature, several LCC approaches can be distinguished. Among the more
important ones are Terotechnology, Integrated Logistic Support/Logistics Support
Analysis (ILS/LSA) and Capital asset management. During the 1970s, the Terotech-
nology concept originated in the UK and was the first formal attempt towards LCC
(Parkes 1970). It describes a total view of maintenance management that combines
management, technology, logistical support and financial control for industrial
systems. Terotechnology is concerned with the specification and design for reliability
and maintainability of physical assets. The application of Terotechnology also takes
into account the processes of installation, commissioning, operation, maintenance,
modification and replacement. Decisions are influenced by feedback of information
on design, performance and cost, throughout the life cycle of a project. Although
generally accepted as very useful, it was not until fairly recently that terotechnology
or similar LCC was adopted by large-scale industry. This was largely due to the
developments in ICT that made LCC easier.
In the 1980s a different LCC-approach, integrated logistic support/logistics
support analysis (ILS/LSA), originated in the military logistics support. Maintenance
is regarded as an important issue within the integral logistical support. ILS comprises
the spectrum of all activities related to the logistical support during its entire life
cycle. These logistical support activities refer to maintenance concept development,
the spare parts provisioning, the technical information, the maintenance crew, the
training programs, etc. The goal of ILS may be summarized as achieving minimum
life cycle costs. Furthermore, LSA is an iterative analytical process to identify and
evaluate the logistic support for a new system. LSA constitutes the integration and
application of various techniques and methods to ensure that supportability require-
ments are considered in the system design process. Finally, capital asset manage-
ment, an LCC-approach with real concern of the financial performance of asset, was
developed. Capital asset management provides information to make the financial and
operational decisions that optimize equipment performance, from deployment
through operations, maintenance and retirement. The key focus is not technical, but
financial. Asset management aims at maximizing the return on investment (ROI) in
capital assets so that they last longer, perform better and cost less to maintain.
Maintenance: An Evolutionary Perspective 37 Total Productive Maintenance (TPM)

TPM (Takahashi and Takeshi 1990) is much more than just a concept, actually it is
even considered a maintenance philosophy, which derives to the greater part of its
substance from a variety of non-Japanese management structures and practices,
which were adapted by the Japanese to fit their culture. TPM involves total
participation, at all levels of the organization. It aims at maximizing equipment
effectiveness and establishing a thorough system of preventive maintenance. TPM
fits entirely with the TQM philosophy and the JIT approach. The latter makes sure
that problems of various nature (material related, breakdown, training related, ...)
are tackled and solved one by one, instead of camouflaging them by using large
buffer stocks as was the case with MRP approaches. The TPM toolbox consists of
various techniques, some of which are universal ones such as 6sigma, Pareto or
ABC analysis, Ishikawa or fishbone diagrams, etc. Other concepts and techniques
such as SMED, poke yoke, jidoka, OEE, and the 5S are specific of the TPM
philosophy. The last two are of extreme importance and worthy to be explained
further. The overall equipment efectiveness (OEE) is a powerful tool to measure
the effective use of production capacity. The strength of the concept is the integra-
tion of production, maintenance and quality issues into what is called the “six big
losses” of useful capacity. Figure 2.5 illustrates this concept. On the other hand, the
5S form one of the basic principles of TPM: Seiri (or sorting out), Seiton (or
systematic arrangement), Seiso (or Spic and span), Seiketsu (or standardizing) and
Shitsuke (or self-discipline).

Total time

planning delays

Loading time planned maintenance


Operating time

set-up and adjustment

6 big losses

Net operating stoppages

loss of

time reduced speed

Valuable operating process defects


time reduced yields

Figure 2.5. The “big six losses” of overall equipment efectiveness Reliability Centered Maintenance (RCM)

RCM originates from the 1960s in North American aviation industry. Later on it
was adopted by military aviation, and afterwards it was only implemented at high
risk industrial plant such as nuclear power plants. Now it can be found in industry
38 L. Pintelon and A. Parodi-Herz

at large. Well known are the books by Nowlan and Heap (1978); Anderson and
Neri (1990) and Moubray (1997) who contributed to the adoption of RCM by in-
Note that today many versions of RCM are around, streamlined RCM being
one of the more popular ones. However, the Society for Automotive Engineers
(SAE) holds the RCM definition that is generally accepted. SAE puts forward the
following basic questions to be solved by the any RCM implementation; if any of
these is omitted, the method is incorrectly being refered to as an RCM. To answer
these seven questions a clear step-by-step procedure exists and decision charts and
forms are available:
• What are the functions and associated performance standards of asset in its
present operating context?
• How can it fail to fulfil its functions? (functional failures)
• What causes each failure? (failure modes)
• What happens when each failure occurs? (failure effects )
• In what way does each failure matter? (failure consequences)
• What should be done to predict or prevent each failure? (proactive tasks and
task intervals)
• What should be done if a suitable proactive task cannot be found? (default
RCM is undeniably a valuable maintenance concept. It takes into account
system functionality, and not just the equipment itself. The focus is on reliability.
Safety and environmental integrity are considered to be more important than cost.
Applying RCM helps to increase the asset’s lifetime and establish a more efficient
and effective maintenance. Its structured approach fits in the knowledge manage-
ment philosophy: reduced human error, more and better historical data and analy-
sis, exploitation of expert knowledge and so forth.
RCM is popular and many RCM implementations have started during the last
decade. Although RCM offers many benefits, there are also drawbacks. From the
conceptual point of view there are some weak points. For instance, the fact that the
original RCM does not offer a task packaging feature and thus does not automati-
cally offer a workable maintenance plan and the fact that the standard decision
charts and forms offered are helpful but also far from perfect. A serious remark,
mainly from the academic side, is about the scientific basis of RCM: the FMEA
analysis, which is the heart of the RCM analysis, is often done on a rather ad hoc
basis. Often available statistical data are insufficient or inaccurate, there is a lack of
insight in the equipment degradation process (failure mechanisms) and the physical
environment (e.g. corrosive or dusty environment) is ignored. The balance between
valuable experience and equally valuable, objective statistical evidence is often
absent. Many companies call in the (expensive) help of consultants to implement
RCM; some of these consultants however are not capable of offering the help
wanted and this – in combination with the lack of in-house experience with RCM –
discredits this methodology. RCM is in fact an on-going process, which often
causes reluctance to engage in a RCM project. RCM is undoubtedly a very
resource consuming process, which also makes it difficult to apply RCM to all
Maintenance: An Evolutionary Perspective 39 RCM-Related Concepts

RCM as such has proven to be a very valuable concept, focussing on reliability and
paying attention to safety and environment. Its structured approach ensures asset
sustainability. However, there are some drawbacks that should be kept in mind
and, if possible, remedied. In the literature one can find many RCM-related con-
cepts such as Gits, Coetzee, BCM, RBCM, streamlined RCM, and so forth. All of
them adopt RCM principles with the intention of solving some of its shortcomings.
These group of concepts constitute the bridging step to the third generation of
maintenance concepts.
Gits (1984) developed an RCM-like maintenance concept. The main difference
with the original RCM is the fact that the methodology delivers a workable main-
tenance plan. The focus of the concept is on technical and organizational aspects,
rather than on economic considerations. This three-phase approach establishes the
maintenance plan by quantifying and clustering basic maintenance rules. Those
rules are harmonised in operational entities that describe what exactly must be
done. Later on, Jones (1995) put forward risk based reliability centred maintenance
(RBCM), a new variance of basic RCM. Basically, RBCM can be described as
RCM, but with a strong statistical background. This tackles and eliminates the
drawback of the ad hoc FMEA of the traditional RCM approach. Risk based
inspections (RBI) are one of the core concepts here. The RBI methodology enables
the assessment of the likelihood and potential consequences of pressure equipment
failures. RBI provides companies with the opportunity to prioritize equipment in-
spections and optimize the inspection methods, frequencies and resources. Further-
more, RBI helps to develop specific equipment inspection plans and enable the
implementation of RCM as such. This results in improved safety, lower failure
risks, fewer forced shutdowns, and reduced operational costs. The risk-based
approach requires a systematic and integrated use of expertise from the different
disciplines that affect plant integrity. These include design, materials selection,
operating parameters and scenarios, and understanding of the current and future
degradation mechanisms and of the risks involved. So far, all preceding RCM
inspired concepts aimed at improving technical drawbacks of RCM by coverting
them into workable solutions.
It was not until Kelly (1997), with his business-centred maintenance BCM, a
full-fledged concept for determining a detailed maintenance plan, that the business
as such gained the focal point. Kelly emphasised the importance of identifying,
mapping and auditing the maintenance function. The BCM concept also pays
attention to the necessary administrative support. Kelly calls his approach a BUTD
approach, bottom-up/top-down approach. First, it is a top-down step that starting
from the business context, the exact objectives for maintenance are outlined
considering all corporate level. The second step is a bottom-up step. It aims at
establishing a life maintenance plan for all equipments. In a third and last step, all
item life plans are fitted in a maintenance strategy. Applying BCM thus results in a
detailed maintenance schedule, ready for use.
RCM implementation is complex, time consuming and is not straightforward.
Hence, it should be implemented in a controlled fashion with total support of all
levels of the organizations. Coetzee (2002) mentions that RCM is a core methodol-
ogy to ensure that the organization can achieve world-class results. However, to
40 L. Pintelon and A. Parodi-Herz

achieve this objective the traditional RCM should be enhanced. Coetzee proposes a
“new” RCM blending concept from different RCM authors’ related techniques. He
also puts forward some innovations like the funnelling approach to ensure that
RCM efforts are concentrated on the most important failure modes in the organiza-
Finally, there is a vast range of so-called “streamlined RCM” concepts. These
concepts claim to be derivations of RCM. It is consultants who mainly promote
streamlined RCM as the solution for the resource consuming character of RCM.
Although streamlining sounds attractive it should be carefully applied, in order to
keep the RCM benefits. Different streamlining approaches exist; however, very
few are acceptable as formal RCM methodologies. Based on Pintelon and Van
Puyvelde (2006), Table 2.3 provides a picture of popular streamlined RCM ap-

Table 2.3. Classification of streamlined RCM concepts

Example Characteristics Pitfalls

Retro-active Starts from the existing maintenance Quite time-consuming to find the
approach plan. Determines the failure mode failure modes for all tasks.Functions”
for all maintenance tasks and are detected on ad hoc basis. It Implies
implements the last RCM steps for that the existing maintenance plan is
these. good.

Generic Uses generic lists of failure modes, Ignores the operational context of the
approach or even generic analyses of technical technical systems and the current
systems maintenance practices. It assumes a
standard level of analysis detail for all

Skipping Omits one or more steps. Typically, Omits the first and essential step of
approach the first step (functions) is skipped RCM, i.e. the functional analysis and
and the analysis starts with listing the as such also does not allow for a sound
failure modes. performance standard setting

Criticality Limits the implementation to critical Often determines criticality on an ad

approach functions and/or failures for these a hoc basis or uses criticality tools
full RCM analysis is performed. which are less reliable than the RCM

Troublemaker Carries out a full RCM analysis for Idem as above, although here all RCM
approach critical equipment only. Critical steps are followed which guarantees a
equipment is defined here as complete “picture”.
bottleneck equipment, which had a
lot of maintenance problems in the
past or is critical in terms of safety
hazards. Customized Maintenance Concepts

The value driven maintenance (VDM) methodology proposed by Haarman and
Delahay (2004) builds a bridge between traditional maintenance philosophies and
the shareholders’ value. Not only does VDM simplify the boardroom discussion, it
also shows that far from being a cost center, maintenance is actually a major
economic value within the overall business performance. It is built on established
Maintenance: An Evolutionary Perspective 41

best maintenance practices and concepts such as TPM, RCM and RBI. It shows
where the added-value of maintenance lies and how an organisation can be best
structured to realise this value. One of the main contributions of VDM is that it
offers a common language to management and maintenance to discuss maintenance
matters. VDM identifies four value drivers in maintenance and provides concepts to
manage by those drivers. For all four value drivers, maintenance can help to in-
crease a company’s economic value. VDM makes a link between value drivers and
core competences. For each of the core competences, some managerial concepts are
Most recently, Waeyenbergh (2005) presents CIBOCOF as a framework to
developed customised maintenance concepts. CIBOCOF starts out from the idea
that although all maintenance concepts available from the literature contain
interesting ideas, none of them is suitable for implementation without further
customization. Companies have their own priorities in implementing a maintenance
concept and are likely to go for “cherry picking” from existing concepts. CIBOCOF
offers a framework to do this in an integrated and structured way. Figure 2.6
illustrates the steps that this concept structurally goes through. A particularly
interesting step is step 5, maintenance policy optimization, where a decision chart is
offered to determine which mathematical decision model can be used to optimize
the chosen policy (step 4). This decision chart guides the user through the vast
literature on the topic.


Policy decision
Start-up Maintenance making

M5 M4
Continuous Implementation
improvement & Evaluation

Figure 2.6. CIBOCOF logic

2.4 Maintenance Manager

As maintenance management evolved, so did the job of the maintenance manager.
Clearly maintenance management is no longer a pure technical function. Business
economics (cost-benefit considerations) and business context (how important are
the installations in question?, what are the functional requirements?, …) play an
important role. A good maintenance manager needs to have a technical background
in order to have an eye for the “big picture” and not lose any aspect out of sight.
42 L. Pintelon and A. Parodi-Herz

Nowadays, the decisions expected from the maintenance manager are complex and
sometimes can have far reaching consequences. He/she is (partly) responsible for
operational, tactical and strategical aspects of the company’s maintenance manage-
ment. This involves the final responsibility for operational decisions like the
planning of the maintenance jobs and tactical decisions concerning the long-term
maintenance policy to be adopted. More recently, maintenance managers are also
consulted in strategic decisions, e.g. purchases of new installations, design choices,
personnel policy, …
The career path of today’s maintenance manager starts out from a rather technical
content, but evolves over time into more financial and strategic responsibilities. This
career path can be horizontal or vertical. It is also important that the maintenance
manager is a good communicator and people manager, as maintenance remains a
labor-intensive function. The maintenance manager needs to be able to attract and
retain highly skilled technicians. On-going training for technicians is needed to keep
track of the rapidly evolving technology. Motivation of maintenance technicians
often requires special attention. Job autonomy in maintenance is more than in
production, instructions may be vague, immediate assessment of the quality of work
is mostly not possible, complaints are more often heard than compliments etc.
Aspects like safety and ergonomics are an indispensable element in current main-
tenance management. Besides people, materials are another important resource for
maintenance work. Maintenance material logistics mainly concerns the spare parts
management and the determination of finding the optimum trade-off between high
spare parts availability and the corresponding stock investments.
The above described evolution in maintenance management incurs a sharp need
for decision support techniques of various nature: statistical analysis tools for
predicting the failure behaviour of equipment, decision schemes for determining
the right maintenance concept, mathematical models to optimize the maintenance
policy parameters (e.g. PM frequency), decision criteria concerning e-maintenance,
decision aids for outsourcing decisions, etc. Table 2.4 illustrates the use of some
decision support techniques for maintenance management. These techniques are
available and have proven their usefulness for maintenance, but they are not yet
widely adopted.
In the 1960s most maintenance publications were very mathematically oriented
and mainly focussed on reliability. The 1970s and early 1980s publications were
more focused on maintenance policy optimization such as determination of opti-
mum preventive maintenance interval, planning of group replacements and inspec-
tion modelling. This was a step forward, although these models still often were too
focussed on mathematical tractability rather than on realistic assumptions and
hypotheses. This caused an unfortunate gap between academics and practitioners.
The former had the impression that industry and service sector were not “ready”
for their work, while the latter felt frustrated because the models were too
theoretical. Fortunately, this is changing. Academics pay more attention to the real-
life background of their subject and practitioners discover the usefulness of the
academic work. Moreover academic work gets broader and offers a more diverse
range of models and concepts, such as maintenance strategy design models,
e-maintenance concepts, service parts supply policies, and the like besides the
more traditional maintenance optimization models. With the introduction of main-
Maintenance: An Evolutionary Perspective 43

tenance software, the necessary data required for these models could be more
easily collected. There still is a big gap between practitioners and academics, but it
is already slowly closing.

Table 2.4. OR/OM techniques and its application in maintenance

Techniques Application examples in maintenance management

Statistics Describing failure behaviour
Reliability theory Reliability prediction of complex systems
Markov theory Availability studies of repairable systems
Renewal theory Replacement decisions (group or individual)
Math programming Maintenance policy parameter optimization
Decision theory Decisions under uncertainty
Queueing theory Trade-off personnel capacity - service level
Simulation Comparison of alternative maintenance policies
Inventory control MRO management: FMI, NMI, SMI and VSMI
Time and motion study Estimation of maintenance intervention times
Scheduling – rostering Daily planning of maintenance jobs
Project planning Planning of turnaround, large renovation projects
MCDM Selecting the best outsourcing partner
MRO = maintenance, repair and operating supplies, FMI = fast moving items, NMI = normal
moving items, SMI = slow moving items, VSMI = very slow moving items, MCDM = multi-criteria
decision making, OR/OM=Operations Research / Operations Management

The help from information technology (IT) is of special interest when dis-
cussing decision support for maintenance managers. Computerized maintenance
management systems (CMMS), also called computer aided maintenance manage-
ment (CAMM), maintenance management information systems (MMIS) or even
enterprise asset management systems (EAM), nowadays offer substantial support
for the maintenance manager. These systems too have evolved over time (Table
2.5). IT of course also supports the e-maintenance applications and offers splendid
opportunities for knowledge management implementations. At the beginning of the
knowledge management hype, knowledge management was mainly aimed at fields
like R&D, innovation management, etc. Later on the potential benefits of
knowledge management were also recognized for most business functions. For
maintenance management, a knowledge management programme helps to capture
the implicit knowledge and expertise of maintenance workers and secure this
information in information systems, so making it accessible for other technicians.
The benefits of this in terms of consistency in problem solving approach and
knowledge retention are obvious. Other knowledge management applications can
be, for example, expert systems, assisting in the diagnosis of complex equipment
44 L. Pintelon and A. Parodi-Herz

failures, or data mining on maintenance history records to learn about failure

causes. A knowledge management programme will also help to keep track of
individual skills and expertise and as such support personnel management over

Table 2.5. Evolution of CMMS

Business IT
CMMS Characteristics
1st generation Mainly registration and data administration (EDP).

Limited or no process support.

Low priority mainframe applications.
Limited software market, a lot of in-house development.

2nd generation Cost control and work order management;


MRO management most often included, ...

Link with company’s financial information module.
First MIS for maintenance
Many stand-alone microcomputer applications.
Dynamic, but not always reliable, software market.

3rd generation Broader, e.g. also asset utilization, and EHS module
External communication possible, e.g. e-MRO.
1990s ...

Enhanced analytical capabilities.

Multimedia and web enabled features.
Matured market for embedded (part of e.g. ERP) or BoB.

Clearly, the evolution in maintenance management offers a challenging job

environment for today’s maintenance manager. This maintenance manager needs
to be aware of “the big picture”, i.e. the business context and the maintenance
organization as a whole. Moreover, he/she needs to have a sound technological
background and be prepared to keep informed of technological evolutions. The
maintenance manager needs real management skills, to manage the resources –
personnel and materials – in an efficient and effective way, while keeping asset
utilization and asset life cycles in mind. Growing in the function of maintenance
manager, will also mean acquiring new skills, e.g. in financial management. Last
but not least, today’s maintenance manager needs to be flexible, flexible to face
threats and to grab opportunities in today’s dynamic business environment where
increasing globalisation, many mergers and acquisitions, growing outsourcing
markets and emerging e-maintenance technologies are part of daily life.

2.5 Conclusions and New Challenges of Maintenance

Maintenance management undoubtedly has undergone major changes during the
past decade. It has moved from being low profile, necessary but difficult to manage
problems, to be regarded a prominent business function, an important element in
business strategy. Not only practitioners have changed their mind about main-
tenance; academics did as well. Maintenance nowadays is a professional business
Maintenance: An Evolutionary Perspective 45

function and an area of intensive academic research. Efforts are aimed at ad-
vancing towards world class maintenance and providing methodologies to do so.
Pintelon et al. (2006) describes several maintenance maturity levels required to
achieve world class maintenance; these are illustrated in Figure 2.7.

Figure 2.7. Maturity levels of maintenance

Maintenance concept optimization has professionalized. Corrective and pre-

cautionary actions are combined in different policies, from reactive to preventive
and from predictive to proactive policies. A sound insight into the pros and cons of
each of these policies is available in practice and research supports the selection
and optimization of these policies. These policies are no longer ad hoc and lose
elements within maintenance management but policies are also embedded in
maintenance concepts, focussing on reliability and productivity. These concepts
ensure consistent decision making for all equipment and at the same time allow for
individualized installation maintenance concepts. Decision tools are available to
support this process.
Top management nowadays, at least in most companies, recognizes the im-
portance of maintenance as an element of their business strategy. Expectations for
maintenance are no longer formulated as “keep things running”, but are based upon
the overall business strategy. This strategy can be based on flexibility, quality and
low cost. The maintenance organization, with its structural and infrastructural
elements, is built accordingly.
The previous paragraph may give the impression that all problems for
maintenance management are already solved; this however is not the case. New
opportunities in terms of, for example, outsourcing and e-maintenance exist.
Moreover, there is a threatening gap between the top management level and the
overall maintenance strategy determination and the tactical level on which the
maintenance concepts are designed, detailed and implemented (Figure 2.8). The
gap, however, is there between the alignment of the tactical and subsequent
operational phase on the one hand and the strategic phase on the other. While both
aspects are well studied, the link between the two is often not well established.
This leads to disappointments with top management as well as frustration with
maintenance managers. Research shows a similar gap. There is some — though
46 L. Pintelon and A. Parodi-Herz

still not enough — research on the link between maintenance and business
strategy. The main focus of maintenance management research is still on the
tactical and operational planning. Links between the former and the latter part of
research however are still very rare. Closing this gap by linking maintenance and
business throughout all decision levels is one of the major challenges for the
future; every step taken brings us closer to real world-class maintenance.

Figure 2.8. Gap between maintenance and business strategy

2.6 List of Abbreviations

BCM: Business-centred maintenance CIBOCOF: Center Industrieel Beleid

BoB: Best-of-breed Onderhoudsontwikkelingsframework
BUTD: Bottom-up/top-down analysis CM: Corrective maintenance
CAMM: Computer aided maintenance CMMS: Computerized maintenance
management management systems
CBM: Condition-based maintenance DOM: Design-out of Maintenance
CFR: Constant failure rate DSS: Decision support systems
Maintenance: An Evolutionary Perspective 47

EAM: Enterprise asset management MTBF: Mean-time-between-failures

(system) MTTR: Mean-time-to-repair
EHS: Energy, health and safety NMI: Normal moving items
EDP: Electronic data processing OBM: Opportunity-based maintenance
EUC: End user computing OEE: Overall equipment effectiveness
FBM: Failure-based maintenance OM: Operations management
FMEA: Failure modes and effect OR: Operations research
analysis PM: Precautionary maintenance
FMI: Fast moving items Q&D: Quick & dirty decision charts
FMS: Flexible manufacturing systems R&D: Research & development
GUI: Graphical user interface RBI: Risk-based inspections
ICT: Information communication RCBM: Risk-based centred
technology maintenance
IFR: Increasing failure rate RCM: Reliability-centred maintenance
ILS: Integrated logistics support ROI: Return on investment
IT: Information technology SAE: Society of automotive
JIT: Just-in-time engineering
LCC: Life-cycle costing SMED: Single minute exchange of dies
LSA: Logistics support analysis SMI: Slow moving items
MCDM: Multi-criteria decision- TBM: Time-based maintenance
making TCO: Total cost of ownership
MIS: Management information TPM: Total productive maintenance
systems TQM: Total quality management
MMIS: Maintenance management UBM: Use-based maintenance
information system VDM: Value-driven maintenance
MRO: Maintenance repair and VSMI: Very slow moving items
operating supplies WIP: Work in progress

2.7 References
Anderson, R.T., Neri, L., (1990), Reliability Centred Maintenance: Management and
Engineering Methods, Elsevier Applied Sciences, London
Blanchard, B.S., (1992), Logistics Engineering and Management, Prentice Hall, Englewood
Cliffs, New Jersey
Cho, I.D, Parlar, M., (1991), A survey on maintenance models for multi-unit systems.
European Journal of Operational Research, 51:1–23
Coetzee, J.L., (2002), An Optimized Instrument for Designing a Maintenance Plan: A Sequel
to RCM. PhD thesis, University of Pretoria, South-Africa
Dekker, R., (1996) Applications of maintenance optimization models: A review and
analysis. Reliability Engineering and System Safety, 52(3):229–240
Dekker, R., and Scarf, P.A., (1998) On the impact of optimisation models in maintenance
decision making: the state of the art. Reliability Engineering and System Safety,
Geraerds, W.M.J., (1972), Towards a Theory of Maintenance. The English University Press.
48 L. Pintelon and A. Parodi-Herz

Gits, C.W., (1984), On the Maintenance Concept for a Technical System: A Framework for
Design, Ph.D.Thesis, TUEindhoven, The Netherlands
Haarman, M. and Delahay, G., (2004), Value Driven Maintenance – New Faith in Main-
tenance, Mainnovation, Dordrecht, The Nederlands
Jones, R.B., (1995), Risk-Based Maintenance, Gult Professional Publishing (Elsevier),
Kelly, A., (1997), Maintenance Organizations & Systems: Business-Centred Maintenance,
Butterworth-Heinemann, Oxford
McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey.
Management Science, 11 (5):493–524
Moubray, J., (1997), Reliability-Centred Maintenance. Second Edition. Butterworth-
Heinemann, Oxford
Nowlan, F.S., Heap, H.F., (1978), Reliability Centered Maintenance, United Airlines
Publications, San Fransisco
Parkes, D. in Jardine, A.K.S., (1970), Operational Research in Maintenance, University of
Manchester Press, Manchester
Pintelon, L., Gelders, L., Van Puyvelde, F., (2000), Maintenance Management, Acco Leuven/
Pintelon, L., Gelders, L., (1992) Maintenance management decision making. European
Journal of Operational Research, 58:301–317
Pintelon, L., Pinjala, K., Vereecke, A., (2006), Evaluating the Effectiveness of Maintenance
Strategies, Journal of Quality in Maintenance Engineering (JQME), 12(1):214–229
Pintelon, L., Van Puyvelde, F., (2006), Maintenance Decision Making, Acco, Leuven,
Takahashi, Y. and Takashi, O., (1990) TPM: Total Productive Maintenance. Asian
Productivity Organization, Tokyo
Valdez-Flores, C., Feldman, R.M., (1989) A survey of preventive maintenance models for
stochastically deteriorating single-unit systems. Naval Research Logistics, 36:419–446
Waeyenbergh, G., (2005), CIBOCOF – A Framework for Industrial Maintenance Concept
Development, PhD thesis, Centre for Industrial Management – K.U.Leuven, Leuven,
Waeyenbergh, G., Pintelon, L., (2002) A framework for maintenance concept development.
International Journal of Production Economics, 77:299–313
Wang H., (2002), A survey of maintenance policies of deteriorating systems. European
Journal of Operational Research, 139:469–489

New Technologies for Maintenance

Jay Lee and Haixia Wang

3.1 Introduction
For years, maintenance has been treated as a dirty, boring and ad hoc job. It’s seen as
critical for maintaining productivity but has yet to be recognized as a key component
of revenue generation. The question most often asked is “Why do we need to main-
tain things regularly?” The answer is “To keep things as reliable as possible.” How-
ever, the question that should be asked is “How much change or degradation has
occurred since the last round of maintenance?” The answer to this question is “I
don’t know.” Today, most machine field services depend on sensor-driven manage-
ment systems that provide alerts, alarms and indicators. The moment the alarm
sounds, it’s already too late to prevent the failure. Therefore, most machine main-
tenance today is either purely reactive (fixing or replacing equipment after it fails) or
blindly proactive (assuming a certain level of performance degradation, with no input
from the machinery itself, and servicing equipment on a routine schedule whether
service is actually needed or not). Both scenarios are extremely wasteful.
Rather than reactive maintenance, “fail-and-fix,” world-class companies are
moving forwards towards “predict-and-prevent” maintenance. A maintenance
scheme, referred to as condition based maintenance (CBM), was developed by
considering current degradation and its evolution. CBM methods and practices
have been continuously improved for the last decades; however, CBM is conducted
at equipment level − one piece of equipment at a time, and the developed prognos-
tics approaches are application or equipment specific.
Holistic approach, real-time prognostics devices, and rapid implementation
environment are potential future research topics in product and system health
assessment and prognostics. With the level of integrated network systems develop-
ment in today’s global business environment, machines and factories are net-
worked, and information and decisions are synchronized in order to maximize a
company’s asset investments. This generates a critical need for a real-time remote
machinery prognostics and health management (R2M-PHM) system. The unmet
needs in maintenance can be categorized into the following:
50 J. Lee and H. Wang

1. Machine intelligence: intelligent monitoring, predict and prevent, and

compensation, reconfiguration for sustainability (self-maintenance).
2. Operations intelligence: prioritize, optimize, and responsive maintenance
scheduling for reconfiguration needs.
3. Synchronization intelligence: autonomous information flow from market
demand to factory asset utilization.

Based on the unmet needs in maintenance, many research and development

questions concerning next generation maintenance systems can be raised. Some of
them are the following:

1. How to adapt maintenance schedules to cope dynamically with shop-floor

2. How to feed back information and knowledge gathered in maintenance to
the designers of the process?
3. How to link maintenance policies to corporate strategy and objectives?
4. How to synchronize production scheduling based on maintenance perfor-

The rest of this chapter is organized as follows. Section 2 gives a state-of-the-

art review on maintenance technologies, which includes a maintenance paradigm
overview and CBM prognostics approaches. Section 3 presents the newly de-
veloped platform of Watchdog Agent®-based real-time remote machinery prognos-
tics and health management (R2M-PHM) system, the Watchdog Agent® toolbox
method for multi-sensor performance assessment and prognostics, and real-life
industrial case studies. Section 4 summarizes new developments and discusses
future work.

3.2 State-of-the-art Reviews on Maintenance Technologies

3.2.1 Maintenance Paradigm Overview

Looking back on the development history and forecasting the development

tendency of maintenance technologies, the roadmap to excellence in maintenance
can be illustrated as in Figure 3.1. No Maintenance
There are two kinds of situations in which no maintenance will occur:
• No way to fix it: the maintenance technique is not available for a special
application, or the maintenance technique is at too early stage of develop-
• Isn’t worth it to fix it: some machines were designed to be used only once.
When compared to maintenance cost, it may be more cost-effective just to
discard it.
Neither of the scenarios above is within the scope of the discussion here.
New Technologies for Maintenance 51

Machine Performance
or Maintenance-free
and uptime Machine
(Failure Root
causes analysis)
Preventive Maintenance
Reactive Maintenance)
No (Fire Fighting)

Figure 3.1. The development of maintenance technologies Reactive Maintenance

The aim of reactive maintenance is just to “fix it after it’s broken”, since most of
the time a machine breaks down without warning and it is urgent for the
maintenance crew to put it back to work: this is also referred to as “fire-fighting”.
This fire-fighting mode of maintenance is still present in many maintenance
operations today because accurate knowledge of the equipment behavior is lacking.
Essentially, little to no maintenance is conducted and the machinery operates until a
failure occurs. At this time, appropriate personnel are contacted to assess the situation
and make the repairs as expeditiously as possible. In a situation where the damage to
equipment is not a critical factor, plenty of downtime is available, and the values of
the assets are not a concern, the fire-fighting mode may prove to be an acceptable
option. Of course, one must consider the additional cost of making repairs on an
emergency basis since soliciting bids to obtain reasonable costs may not be
applicable in these situations. Due to market competition and environmental/safety
issues, the trend is toward appropriating an organized and efficient maintenance
program as opposed to firefighting. Preventive Maintenance

Preventive maintenance (PM) is an equipment maintenance strategy based on
replacing, overhauling or remanufacturing an item at fixed or adaptive intervals,
regardless of its condition at the time. These maintenance operations models can be
characterized as long term maintenance policies (Wang 2002) that do not take into
account instantaneous equipment status. Scheduled restoration tasks and scheduled
discard tasks are both examples of preventive maintenance tasks.
In preventive maintenance, breakdowns are tracked and recorded in a database,
and the information accumulated provides a base for general preventive actions.
The age-dependent PM policy can be considered as the most common maintenance
policy in which a unit’s PM times are based on the age of the unit. The basic idea
is to replace or repair a unit at its age T or failure whichever occurs first (Badia et
al., 2002; Mijailovic 2003). Commonly used equipment reliability indices such as
mean time between failure (MTBF) and mean time to repair (MTTR) are extracted
52 J. Lee and H. Wang

from the historical databases of equipment behavior over time. These two indices
provide a rough estimate of the time between two adjacent breakdowns and the
mean time needed to restore a system when such breakdowns happen. Although
equipment degradation processes vary from case to case, and the causes of failure
can be different as well, the information contained in MTBF and MTTR can still
be informative. Other indices can also be extracted and used, including the mean
lifetime, mean time to first failure, and mean operational life, as discussed by Pham
et al. (1997). With the introduction of minimal repair and imperfect maintenance,
various extensions and modifications to the age-dependent PM policy have been
proposed (Bruns 2002; Chen et al. 2003). Another preventive maintenance policy
that received much attention is the periodic PM policy, in which degraded
machines are repaired or replaced at fixed time intervals independent of the
equipment failures. Various modifications and enhancements to this maintenance
policy have also been proposed recently (Cavory et al. 2001).
The preventive maintenance schemes are time-based without considering the
current health state of the product, and thus are inefficient and less valuable for a
customer whose individual asset is of the most concern. For the case of helicopter
gearboxes, it was found that almost half of the units were removed for overhaul
even though they were in a satisfactory operating condition. Therefore techniques
for more economical and reliable maintenance are needed. Predictive Maintenance

Predictive maintenance (PdM) is a right-on-time maintenance strategy. It is based on
the failure limit policy in which maintenance is performed only when the failure rate,
or other reliability indices, of a unit reaches a predetermined level. This maintenance
strategy has been implemented as condition based maintenance (CBM) in most
production systems, where certain performance indices are periodically (Barbera et
al. 1996; Chen and Trivedi 2002) or continuously monitored (Marseguerra et al.
2002). Whenever an index value crosses some predefined threshold, maintenance
actions are performed to restore the machine to its original state, or to a state where
the changed value is at a satisfactory level in comparison to the threshold.
Predictive maintenance can be best described as a process that requires both
technology and human skills, while using a combination of all available diagnostic
and performance data, maintenance history, operator logs and design data to make
timely decisions about maintenance requirements of major/critical equipment. It is
this integration of various data, information and processes that leads to the success
of a PdM program. It analyzes the trend of measured physical parameters against
known engineering limits for the purpose of detecting, analyzing and correcting a
problem before a failure occurs. A maintenance plan is devised based on the
prediction results derived from condition based monitoring. This method can cost
more up front than PM because of the additional monitoring hardware and software
investment, cost of manning, tooling, and education that is required to establish a
PdM program. However, it provides a basis for failure diagnostics and maintenance
operations, and offers increased equipment reliability and a sufficient advance in
information to improve planning, thereby reducing unexpected downtime and
operating costs.
New Technologies for Maintenance 53 Proactive Maintenance

Proactive maintenance (PaM) is a new maintenance concept that is emerging along
with the development of business globalization. It encompasses any tasks that seek
to realize the seamless integration of diagnosis and prognosis information and
maintenance decision making via a wireless internet or satellite communication
network. Machine health information should represent a trend, not just a status, so
that a company’s productivity can be focused on asset-level utilization, not just
production rates. Moreover, through integrated life-cycle management, such
degradation information can be used to make improvements in every aspect of a
product’s life-cycle. Intelligent maintenance systems (IMS) presented by Lee
(1996) is a PaM representative. Specifically, it has three main working directions
as follows:
• Develop intertwined embedded informatics and electronic intelligence in a
networked and tether-free environment and enable products and systems to
intelligently monitor, predict, and optimize their performance.
• Change “failure reactive” to “failure proactive” by avoiding the underlying
conditions that lead to machine faults and degradation. Focus on analyzing
the root cause, not just the symptoms. That is, seek to prevent or to fix
failure from its source.
• Feed the maintenance information back to the product, process and machine
design, and ultimately make improvements in every aspect of product life-
cycle. Self-maintenance
Self-maintenance is a new design and system methodology. Self-maintenance
machines are expected to be able to monitor, diagnose, and repair themselves in
order to increase their uptime.
One system approach to enabling self-maintenance is based on the concept of
functional maintenance (Umeda et al. 1995). Functional maintenance aims to
recover the required function of a degrading machine by trading off functions,
whereas traditional repair (physical maintenance) aims to recover the initial
physical state by replacing faulty components, cleaning, etc. The way to fulfil the
self-maintenance function is by adding intelligence to the machine, making it
clever enough for functional maintenance, so that the machine can monitor and
diagnose itself, and it can still maintain its functionality for a while if any kind of
failure or degradation occurs. In other words, self-maintainability would be
appended to an existing machine as an additional embedded reasoning system. The
required capabilities of a self-maintenance machine (SMM) are defined as follows
(Labib 2006):
• Monitoring capability: SMM must have the ability of on-line condition
monitoring using sensor fusion. The sensors send the raw data of machine
condition to a processing unit.
• Fault judging capability: from the sensory data, the SMM can judge
whether the machine condition is at normal or abnormal state. By judging
the condition of the machines, we can know the current condition and time
left to failure of the machines.
54 J. Lee and H. Wang

• Diagnosing capability: if the machine condition is at abnormal state, the

causes of faults must be diagnosed and identified to allow repair planning
action to be carried out.
• Repair planning capability: the machine is able to propose repair actions
based on the result of diagnosis and functional maintenance. The repair
planning action is performed using knowledge from the experts which is
stored in the data base system. There may be more than one repair action
proposed; however, the optimized one will be selected to be implemented.
• Repair executing capability: the maintenance is carried out by the machine
itself without any human intervention. This can be achieved through
computer control system and actuators in the machines.
• Self-learning and improvement: when faced with unfamiliar problems, the
machine is able to repair itself and it is expected that if such problems
occur again, the machine will take a shorter time for repairing itself and the
outcome of maintenance will be more effective and efficient.
Efforts towards realizing self-maintenance have been mainly in the form of
intelligent adaptive control, where investigation of control was achieved using
fuzzy logic control. In order to realize self-maintenance, one needs to develop and
implement an adaptive artificial neuron-fuzzy inference system which allows the
fuzzy logic controller to learn from the data it is modeling and automatically
produce appropriate membership functions and the required rules. Such a controller
must be able to cater for sensor degradation and this leads to self-learning and im-
provement capabilities.
Another system approach to enabling self-maintenance is to add the self-service
trigger function to a machine. The machine self-monitors, self-prognoses and self-
triggers a service request before a failure actually occurs. The maintenance task
may still be conducted by a maintenance crew, but the no gap integration of
machine, maintenance schedule, dispatch system and inventory management
system will minimize maintenance costs and raise customer satisfaction.

3.2.2 Prognostics Approaches for Condition Based Maintenance

Condition based maintenance (CBM) was presented as a maintenance scheme to

provide sufficient warning of an impending failure on a particular piece of equip-
ment, allowing that equipment is to be maintained only when there is objective
evidence of an impending failure. CBM methods and practices have been con-
tinuously improved in recent decades. Sensor fusion techniques are now commonly
in use due to the inherent superiority in taking advantage of mutual information
from multiple sensors (Hansen et al. 1994; Reichard et al. 2000; Roemer et al.
2001). A variety of techniques in vibration, temperature, acoustic emissions,
ultrasonic, oil debris, lubricant condition, chip detectors, and time/stress analyses
has received considerable attention. For example, vibration signature analysis, oil
analysis and acoustic emissions, because of their excellent capability for describing
machine performance, have been successfully employed for prognostics for a long
time (Kemerait 1987; Wilson et al. 1999; Goodenow et al. 2002). Current
prognostic approaches can be classified into three basic groups: model-based
New Technologies for Maintenance 55

approach, data-driven approach, and hybrid approach. The model-based approach

requires detailed knowledge of the physical relationships between, and characteris-
tics of, all related components in a system. It is a quantitative model used to
identify and evaluate the difference between the actual operating state determined
from measurements, and the expected operating state derived from the values of
the characteristics obtained from the physical model. Bunday (1991) presented the
theory and methodology of obtaining reliability indices from historical data. In
direct implementation in maintenance, the reliability of the system is kept at a
defined level, and whenever the reliability falls below the defined level, main-
tenance actions should take place to restore it back to its proper level. However, it
is usually prohibitive to use the model-based approach since relationships and
characteristics of all related components in a system and its environment are often
too complicated to build a model with a reasonable amount of accuracy. In some
cases, values of some process parameters/factors are not readily available. A poor
model leads to poor judgment. The data-driven approach requires a large amount
of history data representing both normal and “faulty” operations. It uses no a priori
knowledge of the process but, instead, derives behavioral models only from
measurement data from the process itself. Pattern recognition techniques are
widely used in this approach. General knowledge of the process can be used to
interpret data analysis results, based on which qualitative methods such as fuzzy
logic, and artificial intelligence methods can be used for decision making to realize
fault prevention. The hybrid approach fuses the model-based information and
sensor-based information and takes advantage of both model-driven and data-
driven approaches through which more reliable and accurate prognostic results can
be generated (Hansen et al. 1994). Garga et al. (2001) introduced a hybrid
reasoning method for prognostics, which integrated explicit domain knowledge and
machinery data. In this approach, a feed-forward neural network was trained using
explicit domain knowledge to get a parsimonious representation of the explicit
domain knowledge.
However, a major breakthrough has not been made since. Existing prognostic
methods are application or equipment specific. For instance, the development of
neural networks has added new dimensions to solving existing problems in con-
ducting prognostics of a centrifugal pump case (Liang et al. 1988). A comparison
of the results using the signal identification technique shows various merits of
employing neural nets including the ability to handle multivariate wear parameters
in a much shorter time. A polynomial neural network was conducted in fault detec-
tion, isolation, and estimation for a helicopter transmission prognostic application
(Parker et al. 1993). Ray and Tangirala (1996) built a stochastic model of fatigue
crack dynamics in mechanical structures to predict remaining service time. Fuzzy
logic-based neural networks have been used to predict paper web breakage in a
paper mill (Bonissone 1995) and the failure of a tensioned steel band with seeded
crack growth (Swanson 2001). Yet another prognostic application presented an
integrated system in which a dynamically linked ellipsoidal basis function neural
network was coupled with an automated rule extractor to develop a tree-structured
rule set which closely approximates the classification of the neural network
(Brotherton et al. 2000). That method allowed assessment of trending from the
nominal class to each of the identified fault classes, which means quantitative
56 J. Lee and H. Wang

prognostics were built into the network functionality. Vachtsevanos and Wang
(2001) gave an overview of different CBM algorithms and suggested a method to
compare their performance for a specific application.
Prognostic information, obtained through intelligence embedded into the
manufacturing process or equipment, can also be used to improve manufacturing
and maintenance operations in order to increase process reliability and improve
product quality. For instance, the ability to increase reliability of manufacturing
facilities using the awareness of the deterioration levels of manufacturing equipment
has been demonstrated through an example of improving robot reliability (Yamada
and Takata 2002). Moreover, a life cycle unit (LCU) (Seliger et al. 2002) was
proposed to collect usage information about key product components, enabling one
to assess product reusability and facilitating the reuse of products that have
significant remaining useful life.
In spite of the progresses in CBM, many fundamental issues still remain. For

1. Most research is conducted at the single equipment level, and no infra-

structure exists for employing a real-time remote machinery diagnosis and
prognosis system for maintenance.
2. Most of the developed prognostics approaches are application or equipment
specific. A generic and scalable prognostic methodology or toolbox doesn’t
3. Currently, methods are focused on solving the failure prediction problem.
The need for tools for system performance assessment and degradation
prediction has not been well addressed.
4. The maintenance world of tomorrow is an information world for feature-
based monitoring. Features used for prognostics need to be further de-
5. Many developed prediction algorithms have been demonstrated in a labo-
ratory environment, but are still without industry validation.

To address the afore-mentioned unmet needs, Watchdog Agent®-based intelligent

maintenance systems (IMS) has been presented by the IMS Center with a vision to
develop a systematic approach in advanced prognostics to enable products and
systems to achieve near-zero breakdown reliability and performance.

3.3 Watchdog Agent®-based Intelligent Maintenance Systems

Today most state-of-the-art manufacturing, mining, farming, and service machines
(e.g., elevators) are actually quite “smart” in themselves. Many sophisticated sen-
sors and computerized components are capable of delivering data concerning a
machine’s status and performance. The problem is that little or no practical use is
made of most of this data. We have the devices, but we do not have a continuous
and seamless flow of information throughout entire processes. Sometimes this is
because the available data is not rendered in a useable, or instantly understandable,
New Technologies for Maintenance 57

form. More often, no infrastructure exists for delivering the data over a network, or
for managing and analyzing the data, even if the devices were networked.
Watchdog Agent®-based real-time remote machinery prognostics and health
management (R2M-PHM) system has been recently developed by the IMS Center.
It focuses on developing innovative prognostics algorithms and tools, as well as
remote and embedded predictive maintenance technologies to predict and prevent
machine failures, as illustrated in Figure 3.2.

Figure 3.2. Key focus and elements of the Intelligent Maintenance Systems

The rest of the section is organized as follows. Section 3.1 deals with the
platform of Watchdog Agent®-based real-time remote machinery prognostics and
health management (R2M-PHM) system. Section 3.2 presents a generic and
scalable prognostic methodology or toolbox, i.e., the Watchdog Agent® toolbox;
and Section 3.3 illustrates the effectiveness and potentials of this new development
using several real industry case studies.

3.3.1 Watchdog Agent®-based R2M-PHM Platform

A generic and scalable prognostics framework was presented by Su et al. (1999) to

integrate with embedded diagnostics to provide “total health management”
capability. A reconfigurable and scalable Watchdog Agent®-based R2M-PHM
platform is being developed by the IMS Center, which expands the well known
open system architecture for condition-based maintenance (OSA-CBM) standard
(Thurston and Lebold 2001) by including real-time remote machinery diagnosis
and prognosis systems and embedded Watchdog Agent® technology. As illustrated
in Figure 3.3, the Watchdog Agent® (hardware and software) is embedded onto
machines to convert multi-sensory data to machine health information. The
extracted information is managed and transferred through wireless internet or a
satellite communication network, and service is automatically triggered.
58 J. Lee and H. Wang

Figure 3.3. Illustration of IMS real-time remote machinery diagnosis and prognosis system System Architecture

The system architecture of the Watchdog Agent®-based R2M-PHM platform is
shown in Figure 3.4. In most products or systems, different sensors measure
different aspects of the same physical phenomena. For example, sensor signals,
such as vibrations, temperature, pressure, etc. are collected. A “digital doctor”
inspired by biological perceptual systems and machine psychology theory, the
Watchdog Agent® consists of embedded computational prognostic algorithms and
a software toolbox for predicting degradation of devices and systems. It is being
built to be extensible and adaptable to most real-world machine situations. The
health related information is saved to the database. The diagnostic and prognostic
outputs of the Watchdog Agent®, which is mounted on all the machinery of
interest, can then be fed into the decision support tools. Decision support tools help
the operation personnel balance and optimize their resources, when one or more
machines are likely to fail, by constantly looking ahead. For example, if a
production line has three processes A, B and C, such that A has one machine, B
has three machines, and C has one machine, what would we do if we could
anticipate that one of the machines at station B is not behaving normally. Perhaps
we would arrange a staging area for output from A, or perhaps we would ramp up
production on the other two machines at station B. Whatever the case, we would be
making our decision before experiencing the impending breakdown. These tools
are critical to maintenance and process personnel, enabling them to stay ahead of
the game, balancing limited resources with constant change in demand. Decision
support tools also help minimize losses in productivity caused by downtime, and
help production and logistics managers optimize their maintenance schedule to
minimize downtime costs. The lean and necessary information for maintenance can
then be determined and published to the internet through an embedded web server.
New Technologies for Maintenance 59

Sensor signals Embedded software

Watchdog Decision Client
Temperature Database Web
Agent® support software
Pressure toolbox tools
Embedded operating system Remote
… I/O cards
Embedded computer

Figure 3.4. System architecture of a reconfigurable Watchdog Agent®

The rapid development of web-enabled and cyber-infrastructure technologies is

important in providing enablers for remote monitoring and prognostics. One of the
major barriers is that most manufacturers adopt proprietary communication
protocols which lead to difficulties in connecting diverse machines and products.
Currently, the IMS Center is developing a web-enabled remote monitoring Device-
to-Business (D2B)™ platform for remote monitoring and prognostics of diversified
products and systems. A system methodology and infotronics platform has been
developed that enables the transformation of product condition data into more a
useful health information format for remote and network-enabled prognostics
applications. The MIMOSA (maintenance information management open system
architecture) organization has adopted the IMS infotronic platform as one of its
standard platforms and will use an IMS testbed to demonstrate MIMOSA standards
in its future activities. As shown in Figure 3.5, the IMS infotronics platform
includes the Watchdog Agent® toolbox (which contains adaptive algorithms for
different situations and applications), decision support tools, data storage, and
D2BTM (device-to-business) system level connectivity. The Watchdog Agent® tool-
box includes signal processing, feature extraction, performance assessment,
autonomous learning, prediction and prognostics functions. The lean and necessary
information for maintenance from decision support tools can then be determined
and sent out through D2BTM system level connectivity to remote workstations or
60 J. Lee and H. Wang

Figure 3.5. Integrated infotronics platform Hardware Requirements

For a certain industry application, the selection of Watchdog Agent® hardware
depends on characteristics of the input/output signals (for example, what type of
input/output signal and how many channels needed), which tools or algorithms are
selected (for example, different algorithms require different hardware computation
and storage capacities), and the hardware’s working environment (for example,
which decides the hardware’s storage type, temperature range, etc.). The hardware
prototype currently used in the IMS Center is based on PC104 architecture, as
shown in Figure 3.6a. PC104 architecture enables the hardware to be easily
expanded to a multi-board system, which includes multiple CPUs and a large
amount of input channels. It has a powerful VIA Eden 400MHz CPU and 128MB
New Technologies for Maintenance 61

of memory since all of the tools are embedded into the hardware. It has 16 high
speed analog input channels to deal with highly dynamic signals. It also has
various peripherals that can acquire non-analog sensor signals such as RS-
232/485/432, parallel and USB. The prototype uses a compact flash card for
storage, so it can be placed on top of machine tools and is suitable for withstanding
vibrations in a working environment. Once a certain set of tools/algorithms is
determined for a certain industry application, commercially available hardware,
such as Advantech and National Instruments (NI) as illustrated in Figure 3.6b and
c, respectively, will be further evaluated for customized Watchdog Agent® applica-

Figure 3.6a–c. Options of hardware prototypes for Watchdog Agent® application Software Development

The software system of the Watchdog Agent®-based IMS platform consists of two
parts: the embedded side software and the remote side software, as shown in
Figure 3.7. The embedded side software is the software running on the Watchdog
Agent® hardware, which includes a communication module, a command analysis
module, a task module, an algorithm module, a function module, and a DAQ
module. The communication module is responsible for communicating with the
remote side via TCP/IP protocol. The command analysis module is used to analyze
different commands coming from the remote side. The task module includes multi-
thread scheduling and management. The algorithm module contains specific
watchdog agent tools. The function module has several auxiliary functions such as
channel configuration, security configuration, and email list and so on. The DAQ
module performs A/D conversion using either interrupt or software trigger to get
data from different sensors. The remote side software is the software running on
the remote computers. It is implemented by ActiveX control technology and can be
used as a component of the Internet Explorer Browser. The remote side software is
mainly composed of a communication module and a user interface module. The
communication module is used for communicating with the embedded site via
TCP/IP protocol. The user interface has a health information display, an ATC
status display, and a discrete event display. It also possess an algorithm module, as
well as error log database and data format interface.
62 J. Lee and H. Wang

Figure 3.7. Software structure of Watchdog Agent® Remote Monitoring Architecture and Human Machine Interface Standards

A four-layer infrastructure for remote monitoring and human machine interface
standards is illustrated in Figure 3.8. The data acquisition layer consists of multiple
sensors which obtain raw data from the components of a machine or machines in
different locations. The Network layer will use either traditional Ethernet connec-
tions, or wireless connections for communication between the Watchdog Agent®s,
or for sending short messages (SM) to an engineer’s mobile phone via GPRS ser-
vices. The Application layer functions as a control server to save related information
and control the behavior of the Watchdog Agent®s in the network. The Enterprise
layer offers a user-friendly interface for maintenance-related engineers to access
information either via an Internet browser or a mobile phone.

Figure 3.8. Illustration of Watchdog Agent®-based remote monitoring architecture

New Technologies for Maintenance 63

3.3.2 Watchdog Agent® Toolbox for Multi-sensor Performance Assessment

and Prognostics

The Watchdog Agent® toolbox, with autonomic computing capabilities, is able to

convert critical performance degradation data into health features and quantitatively
assess their confidence value to predict further trends so that proactive actions can
be taken before potential failures occur. Figure 3.9 illustrates one of the developed
enabling prognostics tools that can assess and predict the performance degradation
of products, machines and complex systems.

Figure 3.9. MS innovation in advanced prognostics

The Watchdog Agent® toolbox enables one to assess and predict quantitatively
performance degradation levels of key product components, and to determine the
root causes of failure (Casoetto et al. 2003; Djurdjanovic et al. 2000; Lee 1995,
1996), thus making it possible to realize physically closed-loop product life cycle
monitoring and management. The Watchdog Agent® consists of embedded
computational prognostic algorithms and a software toolbox for predicting de-
gradation of devices and systems. Degradation assessment is conducted after the
critical properties of a process or machine are identified and measured by sensors. It
is expected that the degradation process will alter the sensor readings that are being
fed into the Watchdog Agent®, and thus enable it to assess and quantify the
degradation by quantitatively describing the corresponding change in sensor
signatures. In addition, a model of the process or piece of equipment that is being
considered, or available application specific knowledge can be used to aid the
degradation process description, provided that such a model and/or such knowledge
exist. The prognostic function is realized through trending and statistical modeling
of the observed process performance signatures and/or model parameters.
In order to facilitate the use of Watchdog Agent® in a wide variety of applications
(with various requirements and limitations regarding the character of signals,
available processing power, memory and storage capabilities, limited space, power
consumption, the user’s preference etc.) the performance assessment module of the
64 J. Lee and H. Wang

Watchdog Agent® has been realized in the form of a modular, open architecture
toolbox. The toolbox consists of different prognostics tools, including neural
network-based, time-series based, wavelet-based and hybrid joint time-frequency
methods, etc., for predicting the degradation or performance loss on devices, process,
and systems. The open architecture of the toolbox allows one easily to add new
solutions to the performance assessment modules as well as to easily interchange
different tools, depending on the application needs. To enable rapid deployment, a
quality function deployment (QFD) based selection method had been developed to
provide a general suggestion to aid in tool selection; this is especially critical for
those industry users who have little knowledge about these algorithms. The current
tools employed in the signal processing and feature extraction, performance assess-
ment, diagnostics and prognostics modules of Watchdog Agent® functionality are
summarized in Figure 3.10.
Each of these modules is realized in several different ways to facilitate the use
of the Watchdog Agent® in a wide variety of products and applications.

Figure 3.10. Watchdog Agent® prognostics toolbox Signal Processing and Feature Extraction Module

The signal processing module transforms multiple sensor signals into domains that
are the most informative of a product’s performance. Time-series analysis (Pandit
and Wu 1993) or frequency domain analysis (Marple 1987) can be used to process
stationary signals (signals with time invariant frequency content), while wavelet
(Burrus et al. 1998; Yen and Lin 2000), or joint time-frequency analysis (Cohen
1995; Djurdjanovic et al. 2002) could be used to describe non-stationary signals
(signals with time-varying frequency content). Most real life signals, such as
speech, music, machine tool vibration, acoustic emission etc. are non-stationary
New Technologies for Maintenance 65

signals, which place a strong emphasis on the need for development and utilization
of non-stationary signal analysis techniques, such as wavelets, or joint time-
frequency analysis. The feature extraction module extracts features most relevant
to describing a product’s performance. Those features are extracted from the time
domain into which the sensory processing module transforms sensory signals,
using expert knowledge about the application, or automatic feature selection
methods such as roots of the autoregressive time-series model, or time-frequency
moments and singular value decomposition.
Currently the following signal processing and feature extraction tools are used
in the Watchdog Agent® toolbox:
• The Fourier transformation method has been widely used in de-noising and
feature extraction. Noise component in the signal can be distinguished after
it is transformed, and feature components can be identified after the
removal of noise. However, Fourier transformation is applicable to non-
stationary signals only since frequency-band energies for applications are
characterized by time-invariant frequency content.
• The autoregressive modeling method calculates frequency peak locations
and intensities using autoregressive oscillation modes of sensor readings
and bares significant information about the process (usually, mechanical
systems are well described by the modes of oscillations).
• The wavelet/wavelet packet decomposition method enables the rapid
calculation of non-stationary signal energy distribution at the expense of
loosing some of the desirable mathematical properties.
• The time-frequency analysis method provides both temporal and spectral
information with good resolution, and is applicable to highly non-stationary
signals (e.g. impacts or transient behaviors). However, it is not applicable if
a large amount of data has to be considered and calculation speed is a
• The application specific features extraction method is applicable in cases
when one can directly extract performance-relevant features out of the
time-series of sensor readings. Performance Assessment Module

The performance assessment module evaluates the overlap between the most
recently observed signatures and those observed during normal product operation.
This overlap is expressed through the so-called confidence value (CV), ranging
between zero and one, with higher CVs signifying a high overlap, and hence
performance closer to normal (Lee 1995, 1996). In case data associated with some
failure mode exist, most recent performance signatures obtained through the signal
processing and feature extraction module can be matched against signatures
extracted from faulty behavior data as well. The areas of overlap between the most
recent behavior and the nominal behavior, as well as the faulty behavior, are
continuously transformed into CV over time for evaluating the deviation of the
recent behavior from nominal to faulty.
Realization of the performance evaluation module depends on the character of
the application and extracted performance signatures. If significant application
66 J. Lee and H. Wang

expert knowledge exists, simple but rapid performance assessment based on the
feature-level fused multi-sensor information can be made using the relative number
of activated cells in the neural network, or by using the logistic regression
approach. For products with open-control architecture, the match between the
current and nominal control inputs and the performance criteria can also be utilized
to assess the product’s performance. For more sophisticated applications with
intricate and complicated signals and performance signatures, statistical pattern
recognition methods, or the feature map based approach can be employed.
The following performance assessment tools are currently being used in the
Watchdog Agent® toolbox:
• The logistic regression method allows one to predict a discrete outcome,
such as group membership, from a set of variables that may be continuous,
discrete, dichotomous, or a mix of any of these. It can quantitatively
represent the proximity of current operating conditions to the region of
desirable or undesirable behavior. However, it is applicable when a good
feature domain description of unacceptable behavior is available.
• The feature map method assesses the overlap between the normal and most
recent process behavior, and is applicable in cases when the Gaussianness
of extracted features cannot be guaranteed.
• The statistical pattern recognition method calculates overlap of feature
distributions based on the assumption of Gaussian distribution of the
features, and is applicable to a repeatable and stable process. However, it is
not applicable to the highly dynamic systems in which feature distribution
cannot be approximated as Gaussian
• The hidden Markov model method is applicable to highly dynamic phe-
nomena when a sequence of process observations rather than a single
observation is needed to describe adequately the behavior of process
• The particle filters performance assessment is able to describe quantitatively
process performance, and is applicable in cases of complex systems that
display multiple regimes of operation (both normal and faulty). In this case a
hybrid description of the system is needed, incorporating both discrete and
continuous states. Diagnostics Module

The diagnostics module tells not only the level of behavior degradation (the extent
to which the newly arrived signatures belong to the set of signatures describing
normal system behavior), but also how close the system behavior is to any of the
previously observed faults (overlap between signatures describing the most recent
system behavior with those characterizing each of the previously observed faults).
This matching allows the Watchdog Agent® to recognize and forecast a specific
fault behavior, once a high match with the failure associated signatures is assessed
for the current process signatures, or forecasted based on the current and past
product’s performance. Figure 3.11 illustrates this signature matching process for
performance evaluation.
New Technologies for Maintenance 67

Figure 3.11. Performance evaluation using Confidence Value (CV)

• The support vector machine method establishes a non-linear maximum

margin classifier that infers the machine condition from a new set of
measurements. It works by using a non-linear kernel to transform the input
vector space (which is a set of measurements believed to be correlated with
machine condition) to a much higher dimension feature space, and drawing
a linear hyper-plane classifier there. It is especially applicable to the situa-
tion when Gaussianity of the performance related features cannot be
guaranteed and when a process may display multiple normal and faulty
modes of behavior (multiple regimes of operation and/or multiple possible
faults in the process). The main drawback to using this method is that the
choice of a kernel in real applications is usually based on experience or
trial-and-error test.
• The hidden Markov model method is especially applicable to a situation in
which multiple signals exist and the system may have multiple failure
modes. It is applicable to both stationary and non-stationary signals.
• The Bayesian belief network is a compact representation of cause-and-effect
for a complex system, and is especially applicable to situations where there
are multiple faults with multiple symptoms. The main drawback of this
method is that no standard procedure exists to determine network structure
and expert knowledge is needed to identify the node state.
• Condition diagnosis based on analytically calculated overlaps of Gaussians
that describe the signatures corresponding to the current process behavior
and the signatures corresponding to various modes of normal or faulty
equipment behavior, is applicable to the cases in which performance
related features approximately behave as Gaussians. Prediction and Prognostics Module

The prediction and prognostics module is aimed at extrapolating the behavior of
process signatures over time and predicting their behavior in the future.
autoregressive moving average (ARMA) (Pandit and Wu 1993) modeling and
match matrix (Liu et al. 2004) methods are used to forecast the performance
behavior. Currently, autoregressive moving-average (ARMA) modeling and match
matrix methods are used to forecast the performance behavior. Over time, as new
68 J. Lee and H. Wang

failure modes occur, performance signatures related to each specific failure mode
can be collected and used to teach the Watchdog Agent® to recognize and diagnose
those failure modes in the future. Thus, the Watchdog Agent® is envisioned as an
intelligent device that utilizes its experience and human supervisory inputs over
time to build its own expandable and adjustable world model.
Performance assessment, prediction and prognostics can be enhanced through
feature-level or decision-level sensor fusion, as defined by Hall and Llinas (2000)
(Chapter 2). Feature-level sensor fusion is accomplished through concatenation of
features extracted from different sensors, and the joint consideration of the con-
catenated feature vector in the performance assessment and prediction modules.
Decision-level sensor fusion is based on separately assessing and predicting pro-
cess performance from individual sensor readings and then merging these indi-
vidual sensor inferences into a multi-sensor assessment and prediction through
some averaging technique.
In summary, the following performance forecasting tools are currently used in
the Watchdog Agent®:
• The autoregressive moving average (ARMA) method is applicable to linear
time-invariant systems whose performance features display stationary
behavior. ARMA utilizes a small amount of historic data and can provide
good short term predictions.
• The compound match matrix/ARMA prediction method is applicable to
cases when abundant records of multiple maintenance cycles exist for non-
linear processes. It excels at dealing with high dimension data and can
provide good long term prediction by converting vector-based feature
prediction to scalar-based prediction.
• The fuzzy logic prediction method is applicable to complex systems whose
behavior is unknown and no model, function or numerical technique to
describe the system is readily available. It utilizes linguistic vagueness or
form and allows imprecision, to some extent, in formulating approximations.
Fuzzy logic can give fast approximate solutions.
• The Elman recurrent neural network (ERNN) prediction method is appli-
cable to non-linear systems and can give long term predictions when given
a large amount of training data. However, no standard methodology exists
to determine ERNN structure, and trial-and-error is usually used in the
modeling process.
New tools will be continuously developed and added to the modular, open
architecture Watchdog Agent® toolbox based on the development procedure as
shown in Figure 3.12.
New Technologies for Maintenance 69

Problem definition &


Tool selection Parameter & tool


Prototyping &

No No
Accepted Evaluation

Figure 3.12. Flowchart for developing Watchdog Agent® tools

3.3.3 Case Studies

Several Watchdog Agent® tools for on-line performance assessment and prediction
have already been implemented as stand alone applications in a number of in-
dustrial and service facilities. Listed below are several examples to illustrate the
developed tools. Example 1: Prognostics of an AS/RS Materials Handling Systems

A time-frequency based method (Cohen 1995) has been implemented for per-
formance assessment of a gearbox in an AS/RS material handling system shown in
Figure 3.13. Four vibration sensor readings have been fused to evaluate auto-
nomously its performance while it is on-line. The vibration signals were processed
into joint time-frequency energy distributions (Cohen 1995) and a set of time-shift
invariant time-frequency moments (Zalubas et al. 1996; Djurdjanovic et al. 2000;
Tacer and Loughlin 1996) were extracted. Since those moments asymptotically
follow a Gaussian distribution (Zalubas et al. 1996), statistical reasoning was utilized
to evaluate the overlap between signatures describing normal process behavior (used
for training) and those describing the most recent process behavior. Figure 3.14
shows a screenshot of the software application housing this time-frequency based
Watchdog Agent® used for performance assessment of a material handling system.
The CV was generated by fusing multiple signal features for performance assess-
70 J. Lee and H. Wang

Figure 3.13. Material handling system for mail staging

Figure 3.14. Screenshot of the time-frequency based Watchdog Agent ® Example 2: Roller Bearing Prognostics Testbed

Most bearing diagnostics research involves studying the defective bearings recovered
from the field or from laboratory experiements where the bearings exhibit mature
faults. Experiments using defective bearings have a lower capability for discovering
natural defect propagation in its early stages. In order truly to reflect real defect
propagation processes, bearing run-to-failure tests were performed under normal load
conditions on a specially designed test rig sponsored by Rexnord Technical Service.
The bearing test rig hosts four test bearings on one shaft. Shaft rotation speed
was kept constant at 2000rpm. A radial load of 6000lbs was added to the shaft and
bearing by a spring mechanism. A magnetic plug installed in the oil feedback pipe
collected debris from the oil as evidence of bearing degradation. The test stopped
when the accumulated debris that adhered to the magnetic plug exceeds a certain
Four double row bearings were installed on one shaft as shown in Figure 3.15.
A high sensitivity accelerometer was installed on each bearing house. Four thermo-
couples were attached to the outer race of each bearing to record bearing tempera-
ture (that is relevant to bearing lubrication condition). Several sets of tests ending
with various failure modes were carried out. The time domain feature shows that
most of the bearing fatigue time is consumed during the period of material
accumulative damage, while the period of crack propagation and development is
relatively short. This means that if the traditional threshold-based condition
monitoring approach is used, the response time available for the maintenance crew
to respond prior to catastrophic failure after a defect is detected in such bearings is
very short. A prognostic approach that can detect the defect at an early stage is
demanded so that enough buffer time is available for maintenance and logistical
New Technologies for Maintenance 71

Figure 3.15. The bearing test rig sponsored by Rexnord Technical Service

Figure 3.16 presents the vibration waveform collected from bearing 4 at the last
stage of the bearing test. The signal exhibits strong impulses periodicity because of
the impacts generated by a mature outer race defect. However, when examining the
historical data and observing the vibration signal three days before the bearing
failed, there is no sign of periodic impulses as shown in Figure 3.17a. The periodic
impulse feature is completely masked by the noise.

Figure 3.16. The vibration signal waveform of a faulty bearing

An adaptive wavelet filter is designed to de-noise the raw signal and enhance
degradation detection. The adaptive wavelet filter is yielded in two steps. First the
optimal wavelet shape factor is found by the minimal entropy method. Then an
optimal scale is identified by maximizing the signal periodicity. By applying the
designed wavelet filter to the noisy raw signal, the de-noised signal can be obtained
as shown in Figure 3.17b. The periodic impulse feature can then be clearly dis-
covered, which serves as strong evidence of bearing outer race degradation. The
wavelet filter-based de-noising method successfully enhanced the signal feature
and provided potent evidence for prognostic decision-making.
72 J. Lee and H. Wang

a Raw Signal b De-noised signal using the wavelet filter

Figure 3.17a, b. The vibration waveform with early stage defect Example 3: Bearing Risk of Failure and Remaining Useful Life Prediction
An important issue in prognostic technology is the estimation of the risk of failure,
and of the remaining useful life of a component, given the component’s age and its
past and current operating condition. In numerous cases, failures were attributed to
many correlated degradation processes, which could be reflected by multiple
degradation features extracted from sensor signals. These features are the major
information regarding the health of the component under monitoring; however, the
failure boundary is hard to define using these features. In reality, the same feature
vector could be attributed to totally different combinations of the underlying
degradation processes and their severity levels. There is only a probabilistic
relationship between the component failure and the certain level of degradation
features. A typical example can be found during bearing operation. Two bearings
of the same type could fail at different levels of RMS and Kurtosis of vibration
signal. To capture the probabilistic relationship between the multiple degradation
features and the component failure as well as to predict the risk of failure and the
remaining useful life, IMS has developed a Proportional Hazards (PH) approach
(Liao et al. 2005) based on the PH model proposed by Cox (1972). The PH model
involving multiple degradation features is given as

λ (t ; Z ) = λ0 (t ) exp( β ' Z ) (3.1)

where λ (t ; Z ) is the hazard rate of the component given the current age t and the
degradation feature vector Z ; λ0 (t ) is called the baseline hazard rate function; β
is the model parameter vector. This formulation relates the working age and
multiple degradation feature to the hazard rate of the component. To estimate the
parameters, the maximum likelihood approach could be utilized using offline data,
including the degradation features over time of many components and their failure
times. Afterwards, the established model can be used for predicting the risk of
failure for the component by plugging in the working age and the degradation
features extracted from the on-line sensor signals. In addition, the remaining useful
life L(tcurrent ) given the current working age and the history of degradation features
can be estimated as
New Technologies for Maintenance 73

∞ ⎛ τ ⎞
L(tcurrent ) ≈ ∫ exp ⎜ − ∫ λ (v; zˆ (v)) dv ⎟ dτ (3.2)
t current ⎝ t current ⎠

where zˆ (v) is the predicted feature vector.

Consider the vibration data obtained from the test rig in Example 2. To
facilitate on-line implementation, root-mean-square (RMS) and Kurtosis are
calculated and used as degradation features. Figure 3.18 shows the predicted
hazard rate over time based on these degradation features. This quantity can be
utilized to trigger maintenance when the risk level crosses a predetermined
threshold level. Table 3.1 provides the remaining useful life predictions given the
current bearing age and the feature observations. The predictions are in accordance
with the actual life of the studied bearing ( ≈ 32 days) with minor prediction errors
as the degradation progresses.

Figure 3.18. Hazard rate prediction of bearing 3 in Test 1

Table 3.1. Estimates of expected remaining useful life – Test 1, Bearing 3 (unit: day)

Time 26 29 31
Estimated expected remaining useful life 3.5549 3.3965 1.5295
True remaining useful life 6.5278 3.5278 1.5278
Error 2.9729 0.1313 0.0017

3.4 Conclusions and Future Research

This chapter addresses the paradigm shift in modern maintenance systems from the
traditional “fail and fix” practices to a “predict and prevent” methodology. A re-
configurable and scalable Watchdog Agent®-based intelligent maintenance system
74 J. Lee and H. Wang

has been developed, which serves as a baseline system for researchers and
companies to develop next-generation e-maintenance systems. It enables machine
makers and users to predict machine health degradation conditions, diagnose fault
sources, and suggest maintenance decisions before a fault actually occurs. The
Watchdog Agent®-based R2M-PHM platform expands the OSA-CBM architecture
topology by including real-time remote machinery diagnosis and prognosis
systems and embedded Watchdog Agent® technology. The Watchdog Agent® is an
embedded algorithm toolbox which converts multi-sensory data to machine health
information. Innovative sensory processing and autonomous feature extraction
methods are developed to facilitate the plug-and-play approach in which the
Watchdog Agent® can be setup and run without any need for expert knowledge or
Future work will be the further development of the Watchdog Agent®-based
IMS platform. Smart software and NetWare will be further developed for proactive
maintenance capabilities such as performance degradation measurement, fault
recovery, self-maintenance and remote diagnostics. For the embedded Watchdog
Agent® application, we need to harvest the developed technologies and tools and to
accelerate their deployment in real-world applications through close collaboration
between industrial and academic researchers. Specifically, future work will include
the following aspects: (i) evaluate the existing Watchdog Agent® tools and identify
the application needs from the smart machine testbed; (ii) develop a configurable
prognostics tools platform for rotary machinery elements such as bearings, motors,
and gears, etc., so that several of most frequently used prognostics tools can be pre-
tested and deposited into a ready-to-use tool library; (iii) develop a user interface
system for tool selection, which allows users to use the right tools effectively for
the right applications and achieve “the first tool correct” accuracy; (iv) validate the
reconfiguration of these tools to a variety of similar applications (to be defined by
the company participants); and (v) explore research in a ‘‘peer-to-peer’’ (P2P)
paradigm in which Watchdog Agent®s embedded on identical products operating
under similar conditions could exchange information and thus assist each other in
machine health diagnosis and prognosis.
To predict, prioritize, and plan precision maintenance actions to achieve an
“every action correct” objective, the IMS Center is creating advanced maintenance
simulation software for maintenance schedule planning and service logistics cost
optimization for transparent decision making. At the same time, the Center is
exploring the integration of decision support tool and optimization techniques for
proactive maintenance; this integration will facilitate the functionalities of the
Watchdog Agent®-based R2M-PHM in which an intelligent maintenance systems
can operate as a near-zero down-time, self-sustainable and self-aware artificially
intelligent system that learns from its own operation and experience.
Embedding is crucial for creating an enabling technology that can facilitate
proactive maintenance and life cycle assessment for mobile systems, transportation
devices and other products for which cost-effective realization of predictive perform-
ance assessment capabilities cannot be implemented on general purpose personal
computers. The main research challenge will be to accomplish sophisticated perform-
ance evaluation and prediction capabilities under the severe power consumption,
processing power and data storage limitations imposed by embedding. The Center
New Technologies for Maintenance 75

will develop a wireless sensor network made of self-powered wireless motes for
machine health monitoring and embedded prognostics. These networked smart motes
can be easily installed in products and machines with ad hoc communications. In
addition, the Center is investigating the feasibility of harvesting energy by using
vibration in an environment equipped with wireless motes for remote monitoring of
equipment and machinery. In conjunction with that investigation, the Center is
looking at ways of developing communication protocols that require less energy for
communication. Power converter circuitry has been designed by using vibration
signals in order to convert vibration energy into useful electric energy. These tech-
nologies are very critical for monitoring equipment or systems in a complex environ-
ment where the availability of power is the major constraint.
In the area of collaborative product life cycle design and management, the
Watchdog Agent® can serve as an infotronics agent to store product usage and end-
of-life (EOL) service data and to send feedback to designers and life cycle
management systems. Currently, an international intelligent manufacturing systems
consortium on product embedded information systems for service and EOL has
been proposed. The goal is to integrate Watchdog Agent® capabilities into products
and systems for closed-loop design and life cycle management, as illustrated in
Figure 3.19.

Figure 3.19. Embedded and tether-free product life cycle monitoring

The Center will continue advancing its research to develop technologies and tools
for closed-loop life cycle design for product reliability and serviceability, as well as
explore research in new frontier areas such as embedded and networked agents for
self-maintenance and self-healing, and self-recovery of products and systems. These
new frontier efforts will lead to a fundamental understanding of reconfigurability and
allow the closed-loop design of autonomously reconfigurable engineered systems
that integrate physical, information, and knowledge domains. These autonomously
reconfigurable engineered systems will be able to sense, perform self-prognosis, self-
76 J. Lee and H. Wang

diagnose, and reconfigure the system to function uninterruptedly when subject to

unplanned failure events, as illustrated in Figure 3.20.

Closed-Loop Near
Cycle Downtime
Design for
Reliability and Product or Health Monitoring Service
Serviceability System Sensors & Embedded
In Use Intelligence

Product Product
Center Redesign Degradation Watchdog • Web-enabled Monitoring &
Agent® Prognostics

Smart • Decision Support Tools for

Design Optimized Maintenance
Self-Maintenance Communications
•Tether-Free Maintenance
• Business and Service
•Active (Bluetooth) (CBM)
•Passive • Internet
Enhanced •TCP/IP
• Asset Optimization

Web-enabled D2B™ Platform


Watchdog Agent and Device-to-Business (D2B) are Trademarks of IMS Center

Figure 3.20. Intelligent maintenance systems and its key elements

Reliability Centred Maintenance

Marvin Rausand and Jørn Vatn

4.1 Introduction
Reliability centred maintenance (RCM) is a method for maintenance planning that
was developed within the aircraft industry and later adapted to several other
industries and military branches. A high number of standards and guidelines have
been issued where the RCM methodology is tailored to different application areas,
e.g., IEC 60300-3-11, MIL-STD-217, NAVAIR 00-25-403 (NAVAIR 2005), SAE
JA 1012 (SAE 2002), USACERL TR 99/41 (USACERL 1999), ABS (2003, 2004),
NASA (2000) and DEF-STD 02-45 (DEF 2000). On a generic level, IEC 60300-3-11
(IEC 1999) defines RCM as a “systematic approach for identifying effective and
efficient preventive maintenance tasks for items in accordance with a specific set of
procedures and for establishing intervals between maintenance tasks.” A major ad-
vantage of the RCM analysis process is a structured, and traceable approach to deter-
mine the optimal type of preventive maintenance (PM). This is achieved through a
detailed analysis of failure modes and failure causes. Although the main objective of
RCM is to determine the preventive maintenance, the results from the analysis may
also be used in relation to corrective maintenance strategies, spare part optimization,
and logistic consideration. In addition, RCM also has an important role in overall
system safety management.
An RCM analysis process, when properly conducted, should answer the
following seven questions:

1. What are the system functions and the associated performance standards?
2. How can the system fail to fulfil these functions?
3. What can cause a functional failure?
4. What happens when a failure occurs?
5. What might the consequence be when the failure occurs?
6. What can be done to detect and prevent the failure?
7. What should be done when a suitable preventive task cannot be found?
80 M. Rausand and J. Vatn

The main objectives of an RCM analysis process are to:

• Identify effective maintenance tasks
• Evaluate these tasks by some cost–benefit analysis
• Prepare a plan for carrying out the identified maintenance tasks at optimal
The RCM analysis process is carried out as a sequence of activities. Some of
these activities, or steps, overlap in time. The structuring of the RCM process is
slightly different in the various standards, guidelines, and textbooks. In this chapter
we split the RCM analysis process into the following 12 steps:

1. Study preparation
2. System selection and definition
3. Functional failure analysis (FFA)
4. Critical item selection
5. Data collection and analysis
6. Failure modes, effects, and criticality analysis (FMECA)
7. Selection of maintenance actions
8. Determination of maintenance intervals
9. Preventive maintenance comparison analysis
10. Treatment of non-critical items
11. Implementation
12. In-service data collection and updating

The rest of the chapter is structured as follows: In Section 4.2 we describe and
discuss the 12 steps of the RCM process. The concepts of generic and local RCM
analysis are introduced in Section 4.3. These concepts have been used in a novel
RCM approach to improve and speed up the analyses in a railway application.
Models and methods for optimization of maintenance intervals are discussed in
Section 4.4. Some main features of a new computer tool, OptiRCM, are briefly
introduced. Concluding remarks are given in Section 4.5. The RCM analysis
approach that is described in this chapter is mainly in accordance with accepted
standards, but also contains some novel issues, especially related to steps 6 and 8
and the approach chosen in OptiRCM. The RCM approach is illustrated with
examples from railway applications. Simple examples from the offshore oil and
gas industry are also mentioned.

4.2 Main Steps of the RCM Analysis Process

4.2.1 Step 1: Study Preparation

Before the actual RCM analysis process is initiated, an RCM project group must be
established. The group should include at least one person from the maintenance
function and one from the operations function, in addition to an RCM specialist.
In Step 1 the RCM project group should define and clarify the objectives and
the scope of the analysis. Requirements, policies, and acceptance criteria with
Reliability Centred Maintenance 81

respect to safety and environmental protection should be made visible as boundary

conditions for the RCM analysis.
The part of the plant to be analyzed is selected in Step 2. The type of conse-
quences to be considered should, however, be discussed and settled on a general
basis in Step 1. Possible consequences to be evaluated may comprise:
• Human injuries and/or fatalities
• Negative health effects
• Environmental damage
• Loss of system effectiveness (e.g. delays, production loss)
• Material loss or equipment damage
• Loss of market shares
All consequence classes cannot usually be measured in a common unit. It is
therefore necessary to prioritize between means affecting the various consequence
classes. Such a prioritization is not an easy task and will not be discussed in this
chapter. The trade-off problems can to some extent be solved within a decision
theoretical framework (Vatn et al. 1996).
RCM analyses have traditionally concentrated on PM strategies. It is, however,
possible to extend the scope of the analysis to cover topics like corrective
maintenance strategies, spare part inventories, logistic support problems, and input
to safety management. The RCM project group must decide what should be part of
the scope and what should be outside.
The resources that are available for the analysis are usually limited. The RCM
project group should therefore be realistic with respect to what to look into,
realizing that analysis cost should not dominate potential benefits.
In many RCM applications the plant already has effective maintenance
programs. The RCM project will therefore be an upgrade project to identify and
select the most effective PM tasks, to recommend new tasks or revisions, and to
eliminate ineffective tasks. Further to apply those changes within the existing
programs in a way that will allow the most efficient allocation of resources.
When applying RCM to an existing PM program, it is best to utilize, to the
greatest extent possible, established plant administrative and control procedures in
order to maintain the structure and format of the current program. This approach
provides at least three additional benefits:
• It preserves the effectiveness and successfulness of the current program
• It facilitates acceptance and implementation of the project’s recommenda-
tions when they are processed
• It allows incorporation of improvements as soon as they are discovered,
without the necessity of waiting for major changes to the PM program or
analysis of every system

4.2.2 Step 2: System Selection and Definition

Before a decision to perform an RCM analysis is taken, two questions should be

82 M. Rausand and J. Vatn

• To which systems is an RCM analysis beneficial compared with more

traditional maintenance planning?
• At what level of assembly (plant, system, subsystem) should the analysis
be conducted?
All systems may in principle benefit from an RCM analysis. With limited
resources we must, however, set priorities, at least when introducing RCM in a
new plant. We should start with the systems we assume will benefit most from the
analysis. The following criteria may be used to prioritize systems for an RCM
• The failure effects of potential system failures must be significant in terms of
safety, environmental consequences, production loss, or maintenance costs
• The system complexity must be above average
• Reliability data or operating experience from the actual system, or similar
systems, should be available
Most operating plants have developed an assembly hierarchy, i.e. an organization
of the system hardware elements into a structure that looks like the root system of a
tree. In the offshore oil and gas industry this hierarchy is usually referred to as the tag
number system. Several other names are also used. Moubray (1997) refers to the
assembly hierarchy as the plant register. In railway infrastructure maintenance it is
common to use the disciplinary areas as the next highest level in the plant register.
These are typically:
• Superstructure
• Substructure
• Signalling
• Telecommunications
• Power supply (overhead line with supporting systems)
• Low voltage systems
In this chapter, the following terms are used for the levels of the assembly
Plant: A logical grouping of systems that function together to provide an output
or product by processing and manipulating various input raw materials and feed
stock. An offshore gas production platform may, e.g., be considered as a plant. For
railway application a plant might be a maintenance area, where the main function
of that “plant” is to ensure satisfactory infrastructure functionality in that area.
Moubray (1997) refers to the plant as a cost centre. In railway application a plant
corresponds to a train set (rolling stock), or a line (infrastructure).
System: A logical grouping of subsystems that will perform a series of key
functions, which often can be summarized as one main function, that is required of
a plant (e.g., feed water, steam supply, and water injection). The compression
system on an offshore gas production platform may, e.g., be considered as a
system. Note that the compression system may consist of several compressors with
a high degree of redundancy. Redundant units performing the same main function
should be included in the same system. It is usually easy to identify the systems in
a plant, since they are used as logical building blocks in the design process.
Reliability Centred Maintenance 83

The system level is usually recommended as the starting point for the RCM
process. This is further discussed and justified, e.g., by Smith (1993) and in MIL-
STD 2173 (MIL-STD 1986). This means that on an offshore oil/gas platform the
starting point of the analysis should be the compression system, the water injection
system or the fire water system, and not the whole platform. In railway application
the systems were defined above as the next highest level in the plant hierarchy.
The systems may be further broken down into subsystems, and sub-subsystems,
and so on. For the purpose of the RCM analysis process the lowest level of the
hierarchy should be what we will call an RCM analysis item.
RCM analysis item: A grouping or collection of components, which together
form some identifiable package that will perform at least one significant function
as a stand-alone item (e.g., pumps, valves, and electric motors). For brevity, an
RCM analysis item will in the following be called an analysis item. By this
definition, a shutdown valve, e.g., is classified as an analysis item, while the valve
actuator is not. The actuator is supporting equipment to the shutdown valve, and
only has a function as a part of the valve. The importance of distinguishing the
analysis items from their supporting equipment is clearly seen in the FMECA in
Step 6. If an analysis item is found to have no significant failure modes, then none
of the failure modes or causes of the supporting equipment are important, and
therefore do not need to be addressed. Similarly, if an analysis item has only one
significant failure mode, then the supporting equipment only needs to be analyzed
to determine if there are failure causes that can affect that particular failure mode
(Paglia et al. 1991). Therefore, only the failure modes and effects of the analysis
items need to be analyzed in the FMECA in Step 6. An analysis item is usually
repairable, meaning that it can be repaired without replacing the whole item. In the
offshore reliability database OREDA (2002) the analysis item is called an
equipment unit. The various analysis items of a system may be at different levels
of assembly. On an offshore platform, for example, a huge pump may be defined
as an analysis item in the same way as a small gas detector. If we have redundant
items, e.g., two parallel pumps; each of them should be classified as analysis items.
When in Step 6 we identify causes of analysis item failures, we often find it
suitable to attribute this failure causes to failures of items on an even lower level of
indenture. The lowest level is usually referred to as components.
Component: The lowest level at which equipment can be disassembled without
damage or destruction to the items involved. Smith (2005) refers to this lowest level
as least replaceable assembly, while OREDA (2002) uses the term maintainable
It is very important that the analysis items are selected and defined in a clear
and unambiguous way in this initial phase of the RCM analysis process, since the
following analysis will be based on these analysis items. If the OREDA database is
to be used in later phases of the RCM process, it is recommended as far as possible
to define the analysis items in compliance with the “equipment units” in OREDA.
84 M. Rausand and J. Vatn

4.2.3 Step 3: Functional Failure Analysis (FFA)

The objectives of this step are to:

1. Identify and describe the systems’ required functions

2. Describe input interfaces required for the system to operate
3. Identify the ways in which the system might fail to function Step 3(i): Identification of System Functions

The objective of this step is to identify and describe all the required functions of
the system.
According to ABS (2004) “each function should be documented as a function
statement that contains a verb describing the function, an object on which the
function acts, and performance standard(s)”. A function of a shutdown valve may
therefore be “close flow of oil within 5 s”.
A complex system will usually have a high number of different functions. It is
often difficult to identify all these functions without a checklist. The checklist or
classification scheme of the various functions presented below may help the
analyst in identifying the functions. The same scheme may be used in Step 6 to
identify functions of analysis items. The term item is therefore used in the
classification scheme to denote either a system or an analysis item:

1. Essential functions: These are the functions required to fulfil the intended
purpose of the item. The essential functions are simply the reasons for
installing the item. Often an essential function is reflected in the name of the
item. An essential function of a pump is, e.g., to pump a fluid.
2. Auxiliary functions: These are the functions that are required to support the
essential functions. The auxiliary functions are usually less obvious than the
essential functions, but may in many cases be as important as the essential
functions. Failure of an auxiliary function may in many cases be more
critical than a failure of an essential function. An auxiliary function of a
pump is, e.g., to “contain fluid.”
3. Protective functions: The functions intended to protect people, equipment,
and the environment from damage and injury. The protective functions may
be classified according to what they protect, as: (i) safety functions, (ii)
environment functions, and (iii) hygiene functions. An example of a pro-
tective function is the protection provided by a rupture disk on a pressure
4. Information functions: These functions comprize condition monitoring,
various gauges and alarms, and so on.
5. Interface functions: These functions apply to the interfaces between the item
in question and other items. The interfaces may be active or passive. A passive
interface is, e.g., present when an item is a support or a base for another item.
6. Superfluous functions: According to Moubray (1997) “Items or components
are sometimes encountered which are completely superfluous. This usually
happens when equipment has been modified frequently over a period of years,
or when new equipment has been over-specified”. Superfluous functions are
Reliability Centred Maintenance 85

sometimes present when the item has been designed for an operational context
that is different from the actual operational context. In some cases failures of a
superfluous function may cause failure of other functions.

For analysis purposes the various functions of an item may also be classified as:
• On-line functions: These are functions operated either continuously or so
often that the user has current knowledge about their state. The termination
of an on-line function is called an evident (or detectable) failure. In relation
to safety instrumented systems, on-line functions correspond to high
demand systems; see IEC 61508 (IEC 1997).
• Off-line functions: These are functions that are used intermittently or so
infrequently that their availability is not known by the user without some
special check or test. The protective functions are very often off-line
functions. An example of an off-line function is the essential function of an
emergency shutdown (ESD) system on an oil platform. The termination of
an off-line function is called a hidden (or undetectable) failure. In the IEC
61508 setting, off-line functions correspond to low demand systems.
Note that this classification of functions should only be used as a checklist to
ensure that all relevant functions are revealed. Discussions about whether to
classify a function as, e.g., “essential” or “auxiliary” should be avoided.
The item may in general have several operational modes (e.g., running, and
standby), and several functions related to each operating state. Step 3(ii): Functional Block Diagrams

Various types of functional diagrams may represent the system functions identified
in Step 3(i). The most common diagram is the so-called functional block diagram.
A simple functional block diagram of a diesel engine is shown in Figure 4.1.
It is generally not required to establish functional block diagrams for all the
system functions. The diagrams are, however, efficient tools to illustrate the input
interfaces to a function.
In some cases we may want to split system functions into sub-functions on an
increasing level of detail, down to functions of analysis items. The functional block
diagrams may be used to establish this functional hierarchy in a pictorial manner,
illustrating series-parallel relationships, possible feedbacks, and functional interfaces
(e.g., see Blanchard and Fabrycky 1998; Rausand and Høyland 2004). Alternatives to
the functional block diagram are reliability block diagrams and fault trees.
Functional block diagrams are also useful as a basis for the FMECA in Step 6
in the RCM analysis process. Step 3(iii): Functional Failures

The next step of the FFA is to identify and describe how the various system functions
may fail. A system function may be subject to a set of performance standards (or
functional requirements) that may be grouped as physical properties, operational
performance properties including output tolerances, and time requirements such as
continuous operation or required availability. An unacceptable deviation from one or
more of these performance standards is called a functional failure.
86 M. Rausand and J. Vatn

Figure 4.1. Functional block diagram for a diesel engine

The term functional failure is mainly used in the RCM literature, and has the
same meaning as the more common term failure mode. In RCM we talk about
functional failures on equipment level, and use the term failure mode related to the
parts of the equipment. The failure modes will therefore be causes of a functional
failure. It is important to realize that a functional failure (and a failure mode) is a
manifestation of the failure as seen from the outside, i.e., a deviation from perform-
ance standards.
Functional failures and failure modes may be classified in three main groups
related to the function of the item:
• Total loss of function: In this case the function is not achieved at all, or the
quality of the function is far beyond what is considered as acceptable.
• Partial loss of function: This group may be very wide, and may range from
the nuisance category almost to the total loss of function.
• Erroneous function: This means that the item performs an action that was
not intended, often the opposite of the intended function.
A variety of classifications schemes for functional failures (failure modes) have
been published. Some of these schemes, e.g., Blache and Shrivastava (1994), may
be used in combination with the function classification scheme in Step 3(ii) to
ensure that all relevant functional failures are identified.
The system functional failures may be recorded on a specially designed FFA-
worksheet that is rather similar to a standard FMECA worksheet. An example of an
FFA-worksheet is presented in Figure 4.2
In the first column of Figure 4.2 the various operational modes of the system
are recorded. For each operational mode, all the relevant functions of the system
are recorded in column 2.
Reliability Centred Maintenance 87

System: Performed by:

Ref. drawing no.: Date: Page: of:
Opera- Function Function Functional Freq- Criticality
tional requirements failure uency
mode S E A C

Figure 4.2. Example of an FFA-worksheet

The performance requirements to the functions, like target values and acceptable
deviations, are listed in column 3. For each function (in column 2) all the relevant
functional failures are listed in column 4. In column 5 the frequency/probability of
the functional failure is listed. A criticality ranking of each functional failure in that
particular operational mode is given is given in column 6. The reason for including
the criticality ranking is to be able to limit the extent of the further analysis by
disregarding insignificant functional failures. For complex systems such a screening
is often very important in order not to waste time and money.
The criticality ranking depends on both the frequency/probability of the
occurrence of the functional failure, and the severity of the failure. The severity must
be judged at plant level.
The severity ranking should be given in the four consequence classes: (S) safety
of personnel, (E) environmental impact, (A) production availability, and (C) eco-
nomic losses. For each of these consequence classes the severity should be ranked as
for example (H) high, (M) medium, or (L) low. How we should define the border-
lines between these classes will depend on the specific application.
If at least one of the four entries are (M) medium or (H) high, the severity of the
functional should be classified as significant, and the functional failure should be
subject to further analysis.
The frequency of the functional failure may also be classified in the same three
classes. (H) high may, e.g., be defined as more than once per 5 years, and (L) low
less than once per 50 years. As above, the specific borderlines will depend on the
The frequency classes may be used to prioritize between the significant system
failure modes.
If all the four severity entries of a system failure mode are (L) low, and the
frequency is also (L) low, the criticality is classified as insignificant, and the
functional failure is disregarded in the further analysis. If, however, the frequency is
(M) medium or (H) high the functional failure should be included in the further
analysis even if all the severity ranks are (L) low, but with a lower priority than the
significant functional failures.
The FFA may be rather time-consuming because, for all functional failures, we
have to list all the maintenance significant items (MSIs) (see Step 4). The MSI lists
will hence have to be repeated several times. To reduce the workload we often
conduct a simpler FFA where for each main function we list all functional failures in
one column, and all the related MSIs in another column. This is illustrated in Figure
4.3 for a railway application.
88 M. Rausand and J. Vatn

The function name reflects the functions to be carried out on a relatively high
level in the system. In principle, we should explicitly formulate the function(s) to
be carried out. Instead we often specify the equipment class performing the
function. For example, “departure light signal” is specified rather than the more
correct formulation “ensure correct departure light signal”. We observe that the last
functional failure in Figure 4.3 is not a failure mode for the “correct” functional
description (Ensure correct departure light signal), but is related to another function
of the “departure light signal”. Thus, if we use an equipment class description
rather than an explicit functional statement, the list of failure modes should cover
all (implicit) functions of the equipment class.
At the functional failure level, it is also convenient to specify whether the
failure mode is evident or hidden; see Figure 4.3 where we have introduced an
“EF/HF” column.
For each function we also list the relevant items that are required to perform the
function. These items will form “rows” in the FMECA worksheets; see Step 5.

4.2.4 Step 4: Critical Item Selection

The objective of this step is to identify the analysis items that are potentially
critical with respect to the functional failures identified in Step 3(iii). These
analysis items are denoted functional significant items (FSI). For simple systems
the FSIs may be identified without any formal analysis. In many cases it is obvious
which analysis items that have influence on the functional failures. For complex
systems with an ample degree of redundancy or with buffers, we may need a
formal approach to identify the FSIs.
If failure rates and other necessary input data are available for the various
analysis items, it is usually a straightforward task to calculate the relative importance
of the various analysis items based on a fault tree model or a reliability block
diagram. A number of importance measures are discussed by Rausand and Høyland
In addition to the FSIs, we should also identify items with high failure rate,
high repair costs, low maintainability, long lead-time for spare parts, or items
requiring external maintenance personnel. These analysis items are denoted
maintenance cost significant items (MCSI).
The sum of the functional significant items and the maintenance cost significant
items are denoted maintenance significant items (MSI).
In an RCM project for the Norwegian Railway Administration the use of
generic RCM analyses (see Section 4.3) made it possible to analyze all identified
MSIs. In this case this step could be omitted.

4.2.5 Step 5: Data Collection and Analysis

The purpose of this step is to establish a basis for both the qualitative analysis
(relevant failure modes and failure causes), and the quantitative analysis (reliability
parameters such as MTTF, PF-intervals, and so on). The data necessary for the
RCM analysis may be categorized into the following three groups:
Reliability Centred Maintenance 89

Function: _______
Function: “Home signal”
Function: “Departure light signal”
Description: “Five lamp signals, with three main signals and two pre-signals”

Functional failure EF / HF MSI

- Wrong signal picture HF - Signal mast
- Missing signal picture HF - Brands
- Unclear signal picture HF - Background shade
- Does not prevent contact - Earth conductor
hazard in case of earth fault HF - Lamp
- etc. - Lens
- Transformer
- etc.

Figure 4.3. Structure of functional failure analysis

1. Design data: (i) System definition: a description of the system boundaries

including all subsystems and equipment to fulfil the main functions of the
system, (ii) system breakdown: the assembly hierarchy as described in Step
2, (iii) a technical description of each subsystem, such as the structure of
the subsystem, capacity and functions (e.g., input and output), (iv) system
performance requirements, e.g., desired system availability, environmental
requirements, (v) requirements related to maintenance/testing, e.g., accord-
ing to rules and regulations.
2. Operational and failure data: (i) Performance requirements, (ii) operating
profile (continuous or intermittent operation), (iii) control philosophy (re-
mote/local and automatic/manual), (iv) environmental conditions, (v) main-
tainability, (vi) calendar- and accumulated operating time for overhauls,
(vii) maintenance and downtime costs, (viii) recommended maintenance
for each analysis item based on manufacturer specification, general
guidelines or standards, or in-house recommended practice, and (ix) failure
information, what happens when a failure occurs.
3. Reliability data: Reliability data may be derived from the operational data
by statistical analysis. The reliability data is used to decide the criticality, to
describe the failure process mathematically and to optimize the time
between PM tasks.

During the initial phase of the RCM analysis process it often becomes evident that
the format and quality of the operational data are not sufficient to estimate the
relevant reliability parameters. Some of the main problems encountered are:
• The failure data is on a too high level in the assembly hierachy, i.e., data is
not reported on the RCM analysis item level (MSI).
• Failure mode and failure causes are not reported, or the recorded infor-
mation does not correspond to definitions and code lists used in the
FMECA of Step 6.
90 M. Rausand and J. Vatn

• For systems being monitored by measurements or visual inspection, the

state information is often not reported, making it impossible to establish
models for the failure progression.
• For multiple copies of a component the failure reporting do not link each
failure report to a physical unit, but only states that “one of the components
has failed and has been replaced.”
When such problems are encountered, it is important to start a process to
improve the reporting of operational and failure data. However, there will always
be a cost associated with improved reporting due to: (i) the maintenance personnel
need to spend more time on reporting, (ii) the maintenance personnel need to be
trained in failure reporting, and get insight into the structured FMECA thinking,
and (iii) the reporting systems (maintenance management systems) have to be
restructured to allow reporting in a format in accordance with the logical structure
of the FMECA worksheets.
Our experience is that improved reporting quality is unattainable unless
maintenance personnel executing the maintenance also participate in the RCM
process. This would give ownership to the process, but it is no guarantee that
reporting will improve.

4.2.6 Step 6: FMECA

The objective of this step is to identify the dominant failure modes of the MSIs
identified in Step 4. The information entered into the FMECA worksheet should be
sufficient both with respect to maintenance task selection in Step 7, and interval
optimization in Step 8. Our FMECA worksheet has more fields than the FMECAs
found in most RCM standards. The reason for this is that we use the FMECA as
the main database for the RCM analysis. Other RCM approaches often use a rather
simple FMECA worksheet, but then have to add an additional FMECA-like
worksheet with the data required for optimization of maintenance intervals.

TOP Events
Experience has shown that we can significantly reduce the workload of the
FMECA by introducing so-called TOP events as a basis for the analysis. The idea is
that for each failure mode in the FMECA, a so-called TOP event is specified as
consequence of the failure mode. A number of failure modes will typically lead to
the same TOP event. A consequence analysis is then carried out for each TOP event
to identify the end consequences of that particular TOP event, covering all con-
sequence classes (e.g., safety, availability/punctuality, environmental aspects). For
many plants, risk analyses (or safety cases) have been carried out as part of the
design process. These may sometimes be used as a basis for the consequence
Figure 4.4 shows a conceptual model of this approach for a railway application
where the left part relatively to the TOP event is treated in the FMECA, and the
right part is treated as generic, i.e., only once for each TOP event.
Reliability Centred Maintenance 91



Initiating event TOP event
“Red bulb failure” “Train collision”



Failure cause: Maintenance barrier: Other barriers: Consequence reducing barriers:

- Burn-out bulb - Preventive replacement - Directional setting “block” - Rescue team
- Automatic train protection - Train construction
- Train control centre - Fire protection

Figure 4.4. Barrier model for safety

In the rectangle (dashed line) in the left-hand side of Figure 4.4 an “initiating
event” and a “barrier” are illustrated. To analyze this “rectangle” we need reliabil-
ity parameters, such as MTTF, aging parameter, and PF interval, that are included
in the FMECA worksheet (e.g., see Rausand and Høyland 2004). Three situations
are considered:

1. There is a failure or a fault situation that is not related to the component we

are analyzing with respect to maintenance. If, for example, we are analyzing
the automatic train protection (ATP) on the train, the initiating event may be
“locomotive driver does not comply with signaling”, and thus the ATP is a
barrier against this initiating event. In this situation the function of the ATP is
typically a hidden function.
2. There is a potential failure in the component that is being analyzed, and
maintenance is a barrier against this failure. An example is a crack that has
been initiated in the rail, or in an axle (initiating event); and ultrasonic in-
spection is a maintenance activity to reveal the crack, and prevent a serious
3. The initiating event is a component aging failure, and preventive main-
tenance is carried out to reduce the likelihood of this failure. In this
situation the initiating event and the first barrier in Figure 4.4 merges to
one element. An example is aging failure of a light bulb. The likelihood of
such a failure will, however, be reduced if the light bulb is periodically
replaced with a new one before the aging effect becomes dominant.

“Other barriers” in Figure 4.4 can prevent the component failure from
developing into a critical TOP event. “Track circuit detection” may be a barrier
against rail breakage, because the track circuit can detect a broken rail. Typical
examples of TOP events in railway application are:
• Train derailment
• Collision train-train
• Collision train-object
92 M. Rausand and J. Vatn

• Fire
• Persons injured or killed in or at the track
• Persons injured or killed at level crossings
• Passengers injured or killed at platforms
Several consequence-reducing barriers may also be available. Guide rails may,
e.g., be installed to mitigate the consequences in case of derailment.
In Figure 4.4 we have indicated that the outcome of the TOP event may be one
out of six (end) consequence classes:
C1: Minor injury
C2: Medical treatment
C3: Permanent injury
C4: 1 fatality
C5: 2–10 fatalities
C6: >10 fatalities
Note that the consequence reducing barriers and the end consequences are not
analyzed explicitly during the FMECA, but treated as generic for each TOP event.
In the railway situation this means only six analyses of the safety consequences
related to human injuries/fatalities.
In the following, a list of fields (columns) for the FMECA worksheets is
proposed. The structure of the FMECA is hierarchical, but the information is
usually presented in a tabular worksheet. The starting point in the FMECA is the
functional failures from the FFA in Step 3. Each maintainable item is analyzed
with respect to any impact on the various functional failures. In the following we
describe the various columns:
• Failure mode (equipment class level). The first column in the FMECA
worksheet is the failure mode at the equipment class level identified in the
FFA in Step 3.
• Maintenance significant item (MSI). The relevant MSI were identified in
the FFA.
• MSI function. For each MSI, the functions of the MSI related to the current
equipment class failure mode are identified.
• Failure mode (MSI level). For the MSI functions we also identify the failure
modes at the MSI level.
• Detection method. The detection method column describes how the MSI
failure mode may be detected, e.g., by visual inspection, condition monitor-
ing, or by the central train control system (for railway applications).
• Hidden or evident. Specify whether the MSI function is hidden or evident.
• Demand rate for hidden function, fD. For MSI functions that are hidden, the
rate of demand of this function should be specified.
• Failure cause. For each failure mode there is/are one or more failure
causes. A failure mode will typically be caused by one or more component
failures at a lower level. Note that supporting equipment to the component
is considered for the first time at this step. In this context a failure cause
may therefore be a failure mode of supporting equipment.
Reliability Centred Maintenance 93

• Failure mechanism. For each failure cause, there is one or several failure
mechanisms. Examples of failure mechanisms are fatigue, corrosion, and
wear. To simplify the analysis, the columns for failure cause and failure
mechanism are often merged into one column.
• Mean time to failure (MTTF). The MTTF when no maintenance is per-
formed should be specified. The MTTF is specified for one component if it
is a “point” object, and for a standardized distance if it is a “line” object
such as rails, sleepers, and so on.
• TOP event safety. The TOP event in this context is the accidental event that
might be the result of the failure mode. The TOP event is chosen from a
predefined list established in the generic analysis
• Barrier against TOP event safety. This field is used to list barriers that are
designed to prevent a failure mode from resulting in the safety TOP event.
For example, brands on the signalling pole would help the locomotive
driver to recognize the signal in case of a dark lamp.
• PTE-S. This field is used to assess the probability that the other barriers
against the TOP event all fail; see Figure 4.4. PTE-S should count for all the
barriers listed under “Barrier against TOP event safety”.
• TOP event availability/punctuality. Also for this dimension a predefined list
of TOP events may be established in the generic analysis.
• Barrier against TOP event availability/punctuality. This field is used to list
barriers that are designed to prevent a failure mode from resulting in an
availability/punctuality TOP event. Since the fail safe principle is fundamental
in railway operation, there are usually no barriers against the punctuality TOP
event when a component fails. An example of a barrier is a two out of three
voting system on some critical components within the system.
• PTE-P. This field is used to assess the probability that the other barriers
against an availability/punctuality TOP event all fails. PTE-P should count for
all the barriers listed under “Barrier against TOP event availability/
punctuality”. Due to the fail safe principle, PTE-P will often be equal to one.
• Other consequences. Other consequences may also be listed. Some of these
are non-quantitative like noise effects, passenger comfort, and aesthetics.
Material damage to rolling stock or components in the infrastructure may
also be listed. Material damage may be categorized in terms of monetary
value, but this is not pursued here.
• Mean downtime (MDT). The MDT is the time from a failure occurs until
the failure has been corrected and any traffic restrictions have been
• Criticality indexes. Based on already entered information, different criticality
indexes can be calculated. These indexes are used to screen out non-
significant MSIs.
If a failure mode is considered significant with respect to safety or availability/
punctuality (or other dimensions) a preventive maintenance task should be
assigned. In order to do such an assignment, further information has to be
specified. This additional information will be completed during Steps 7 and 8. The
following fields are recommended:
94 M. Rausand and J. Vatn

• Failure progression. For each failure cause the failure progression should
be described in terms of one of the following categories: (i) gradual
observable failure progression, (ii) non-observable and fast observable
failure progression (PF model), (iii) non-observable failure progression but
with aging effects, and (iv) shock type failures.
• Gradual failure information. If there is a gradual failure progression
information about a what values of the measurable quantity represents a
fault state. Further information about the expected time and standard
deviation to reach this state should be recorded.
• PF-interval information. In case of observable failure progression the PF
model is often applied (e.g., see Rausand and Høyland 2004, p. 394). The
PF concept assumes that a potential failure (P) can be observed some time
before the failure (F) occurs. This time interval is denoted the PF interval
(e.g., see Rausand and Høyland 2004). We need information both on the
expected value and the standard deviation of the PF interval.
• Aging parameter. For non-observable failure progression aging effects
should be described. Relevant categories are strong, moderate or low aging
effects. The aging parameter can alternatively be described by a numeric
value, i.e., the shape parameter α in the Weibull distribution.
• Maintenance task. The maintenance task is determined by the RCM logic
discussed in Step 7.
• Maintenance interval. Often we start by describing existing maintenance
interval, but after the formalized process of interval optimalization in Step
8 we enter the optimized interval.
An example of an FMECA worksheet is shown in Table 4.1 for a departure
light signal.

4.2.7 Step 7: Selection of Maintenance Actions

This step is the most novel compared to other maintenance planning techniques. A
decision logic is used to guide the analyst through a question–and–answer process.
The input to the RCM decision logic is the dominant failure modes from the
FMECA in Step 6. The main idea is for each dominant failure mode to decide
whether a preventive maintenance task is suitable, or it will be best to let the item
deliberately run to failure and afterwards carry out a corrective maintenance task.
There are generally three reasons for doing a preventive maintenance task:
• Prevent a failure
• Detect the onset of a failure
• Reveal a hidden failure
Only the dominant failure modes are subjected to preventive maintenance. To
obtain appropriate maintenance tasks, the failure causes or failure mechanisms
should be considered.
Reliability Centred Maintenance 95

Table 4.1. Example of part of an FMECA worksheet

System function: Ensure correct departure light signal
Functional failure: No signal

MSI Function Failure Failure TOP event Safety PTE-S TOP event
mode cause barriers
Lamp Give light No light Burnt-out Train – Directional 3 x 10 Manual
filament Train block, ATP, train
TCC, operation
Lens Protect Broken Rock fall Train – Directional 2 x 10 None
lamp lens Train block, ATP,
Slip No light Fouling Train – Directional 2 x 10–4 None
through slipping Train block, ATP,
light through TCC,

The failure mechanisms behind each of the dominant failure modes should be
entered into the RCM decision logic to decide which of the following basic
maintenance tasks is most applicable:

1. Continuous on-condition task (CCT)

2. Scheduled on-condition task (SCT)
3. Scheduled overhaul (SOH)
4. Scheduled replacement (SRP)
5. Scheduled function test (SFT)
6. Run to failure (RTF)

Continuous on-condition task (CCT) is a continuous monitoring of an item to

find any potential failures. An on-condition task is applicable only if it is possible
to detect reduced failure resistance for a specific failure mode from the measure-
ment of some quantity.

Scheduled on-condition task (SCT) is a scheduled inspection of an item at

regular intervals to find any potential failures. There are three criteria that must be
met for an on-condition task to be applicable:

1. It must be possible to detect reduced failure resistance for a specific failure

2. It must be possible to define a potential failure condition that can be detected
by an explicit task.
3. There must be a reasonable consistent age interval between the time of
potential failure and the time of failure.
96 M. Rausand and J. Vatn

There are two disadvantage of a scheduled vs. a continuous on-condition task:

• The man-hour cost of inspection is often larger than the cost of installing a
• Since the scheduled inspection is carried out at fixed points of time, one
might “miss” situations where the degradation is faster than anticipated.
An advantage of a scheduled on-condition task is that the human operator is
then able to “sense” information that a sensor will not be able to detect. This means
that traditional “walk around checks” should not be totally skipped even if sensors
are installed.

Scheduled overhaul (SOH) is a scheduled overhaul of an item at or before some

specified age limit, and is often called “hard time maintenance”.
An overhaul task can be considered applicable to an item only if the following
criteria are met:

1. There must be an identifiable age at which the item shows a rapid increase
in the item’s failure rate function.
2. A large proportion of the units must survive to that age.
3. It must be possible to restore the original failure resistance of the item by
reworking it.

Scheduled replacement (SRP) is scheduled discard of an item (or one of its parts)
at or before some specified age limit. A scheduled replacement task is applicable
only under the following circumstances:

1. The item must be subject to a critical failure.

2. Test data must show that no failures are expected to occur below the
specified life limit.
3. The item must be subject to a failure that has major economic (but not
safety) consequences.
4. There must be an identifiable age at which the item shows a rapid increase
in the failure rate function.
5. A large proportion of the units must survive to that age.

Scheduled function test (SFT) is a scheduled inspection of a hidden function to

identify any failure. A scheduled function test task is applicable to an item under
the following conditions:

1. The item must be subject to a functional failure that is not evident to the
operating crew during the performance of normal duties.
2. The item must be one for which no other type of task is applicable and

Run to failure (RTF) is a deliberate decision to run to failure because the other
tasks are not possible or the economics are less favourable.
Reliability Centred Maintenance 97

Continuous on-
task (CCT)
Does a failure alerting Yes Is continuous
measurable indicator monitoring
Scheduled on-
exist? feasible?
No condition
No task (SCT)

Scheduled overhaul
Yes (SOH)
Is aging parameter Yes Is overhaul
α>1? feasible? Scheduled
No replacement
No (SRP)

Is the function Yes Scheduled function

hidden? test (SFT)


No PM activity
found (RTF)

Figure 4.5. Maintenance task assignment/decision logic

In many situations a maintenance task may prevent several failure mechanisms.

Hence in some situations it is better to enter failure modes rather than failure
mechanisms into the RCM decision logic.
Note also that if a failure cause for a dominant failure mode corresponds to
supporting equipment, the supporting equipment should be defined as the “item” to
be entered into the RCM decision logic.
The RCM decision logic is shown in Figure 4.5 Note that this logic is much
simpler than that found in most RCM standards and guidelines. It should be
emphasized that such logic can never cover all situations. For example, in the
situation of a hidden function with aging failures, a combination of scheduled
replacements and function tests is required.

4.2.8 Step 8: Determination of Maintenance Intervals

Usually, formalized methods for optimization of maintenance interval are not a

part of the RCM analysis. In order to optimize maintenance intervals we need to
structure the analysis in such a way that it fits into the maintenance optimization
models that exist. See Section 4.4 for a discussion of determination of maintenance
intervals using optimization models.

4.2.9 Step 9: Preventive Maintenance-Comparison Analysis

Two overriding criteria for selecting maintenance tasks are used in RCM. Each
task selected must meet two requirements:
98 M. Rausand and J. Vatn

• It must be applicable
• It must be effective
Applicability: Meaning that the task is applicable in relation to our reliability
knowledge and in relation to the consequences of failure. If a task is found based
on the preceding analysis, it should satisfy the applicability criterion.
A PM task is applicable if it can eliminate a failure, or at least reduces the
probability of occurrence to an acceptable level (Hoch 1990) — or reduces the
impact of failures!
Cost-effectiveness: Meaning that the task does not cost more than the failure(s)
it is going to prevent.
The PM task’s effectiveness is a measure of how well it accomplishes that
purpose and if it is worth doing. Clearly, when evaluating the effectiveness of a
task, we are balancing the cost of “performing the maintenance with the cost of not
performing it. In this context, we may refer to the cost as follows (Hoch 1990):
The cost of a PM task may include:
• The risk of maintenance personnel error, e.g., “maintenance introduced
• The risk of increasing the effect of a failure of another component while
one is out of service
• The use and cost of physical resources
• The unavailability of physical resources elsewhere while in use on this task
• Production unavailability during maintenance
• Unavailability of protective functions during maintenance of these
• “The more maintenance you do the more risk you expose your maintenance
personnel to”
On the other hand, the cost of a failure may include:
• The consequences of the failure should it occur (i.e., loss of production,
possible violation of laws or regulations, reduction in plant or personnel
safety, or damage to other equipment)
• The consequences of not performing the PM task even if a failure does not
occur (i.e., loss of warranty)
• Increased costs for emergency

4.2.10 Step 10: Treatment of Non-MSIs

In Step 4 critical items (MSIs) were selected for further analysis. A remaining
question is what to do with the items that are not analyzed. For plants already
having a maintenance program it is reasonable to continue this program for the non-
MSIs. If a maintenance program is not in effect, maintenance should be carried out
according to vendor specifications if they exist, else no maintenance should be per-
formed. See Paglia et al. (1991) for further discussion.
Reliability Centred Maintenance 99

4.2.11 Step 11: Implementation

A necessary basis for implementing the result of the RCM analysis is that the
organizational and technical maintenance support functions are available. A major
issue is therefore to ensure the availability of the maintenance support functions.
The maintenance actions are typically grouped into maintenance packages, each
package describing what to do, and when to do it.
Many accidents are related to maintenance work. When implementing a
maintenance program it is therefore of vital importance to consider the risk asso-
ciated with the execution of the maintenance work. Checklists may be used to
identify potential risk involved with maintenance work:
• Can maintenance people be injured during the maintenance work?
• Is work permit required for execution of the maintenance work?
• Are means taken to avoid problems related to re-routing, by-passing, etc.?
• Can failures be introduced during maintenance work?
Task analysis, e.g., see Kirwan and Ainsworth (1992), may be used to reveal
the risk involved with each maintenance job. See Hoch (1990) for further discus-
sion on implementing the RCM analysis results.

4.2.12 Step 12: In-service Data Collection and Updating

The reliability data we have access to at the outset of the analysis may be scarce, or
even almost none. In our opinion, one of the most significant advantages of RCM
is that we systematically analyze and document the basis for our initial decisions
and, hence, can better utilize operating experience to adjust that decision as
operating experience data is collected. The full benefit of RCM is therefore only
achieved when operation and maintenance experience is fed back into the analysis
The updating process should be concentrated on three major time perspectives:
1. Short term interval adjustments
2. Medium term task evaluation
3. Long term revision of the initial strategy
For each significant failure that occurs in the system, the failure characteristics
should be compared with the FMECA. If the failure was not covered adequately in
the FMECA, the relevant part of the RCM analysis should, if necessary, be revised.
The short-term update can be considered as a revision of previous analysis
results. The input to such an analysis is updated reliability figures either due to more
data, or updated data because of reliability trends. This analysis should not require
excessive resources, since the framework for the analysis is already established. Only
Steps 5 and 8 in the RCM process will be affected by short-term updates.
The medium term update will also review the basis for the selection of
maintenance actions in Step 7. Analysis of maintenance experience may identify
significant failure causes not considered in the initial analysis, requiring an updated
FMECA in Step 6.
100 M. Rausand and J. Vatn

The long-term revision will consider all steps in the analysis. It is not sufficient
to consider only the system being analyzed; it is required to consider the entire
plant with its relations to the outside world, e.g., contractual considerations, new
laws regulating environmental protection, and so on.

4.3 Generic and Local RCM Analyses

An RCM analysis should be conducted for physical units in a stated operational
context. Assume that we are planning to carry out an RCM analysis of a specific
railway point1 (turnout) at location X on line Y. For this railway point we identify all
functions, failure modes, and so on. We then propose a set of maintenance tasks, and
finally we choose the maintenance intervals based on reliability parameters for the
railway point, punctuality parameters, and personnel risk. Now, there might be
several hundreds of similar railway points, with slightly varying reliability per-
formance and risk profiles that would require different maintenance intervals. To
avoid repeating the entire RCM analysis for all these railway points, we propose to
conduct a generic RCM analysis, and then make local adjustments with regard to
reliability and risk parameters. The following steps are then required:

1. Conduct a generic RCM analysis for selected components. In this analysis

we use generic (average) values of reliability and risk parameters (regarding
punctuality and personnel risk).
2. Establish a generic RCM database. The results from generic RCM analyses
of selected equipment types are stored in a generic RCM database. In a first
phase we may restrict ourselves to consider broad classes of typical railway
points. In a later phase, we may want to refine our analysis to cover specific
types and brands of railway points (with different failure modes).
3. Select local analysis objects. In the local analysis we work with a subset of
the railway system. This can, for example, be a specific railway point,
railway points in the main track of a specific line, and so on.
4. Find an appropriate generic RCM template. For a local analysis object, we
now recall the corresponding generic RCM analysis from the RCM
database. We first verify that the generic RCM analysis object (template) is
appropriate in terms of qualitative properties, with respect to functions,
failure modes, and so. At this point it might be necessary to add more
functions, failure modes, etc. In this case, we add the “new” RCM object to
the generic RCM database in order to make the generic RCM database
more comprehensive.
5. Adjust parameters. At the local level we identify differences from the
parameters used in the generic RCM database. A specific line may, for
example, have very old railway points that may cause the MTTF to be
smaller than the average MTTF. In this step of the procedure we have to

A railway point is a railway “switch” that allows a train to go from one track to another.
A railway point is called a “turnout” in American English.
Reliability Centred Maintenance 101

consider all parameters that are involved in the optimization model (see
Section 4.4.
6. Re-run the optimization procedure. Based on the new “local” parameters
we next re-run the optimization procedure to adjust maintenance intervals
taking local differences into account. To carry out this process we need a
computerized tool to streamline the work.
7. Document the results. The results from the local analysis are stored in a
local RCM database. This is a database where only the adjustment factors
are documented, for example, for railway points A, B, C, and D on line Y
the MTTF is 30 % higher than the average. Hence the maintenance interval
is also reduced accordingly.

4.4 Modelling and Optimizing Maintenance Intervals

A wide range of general models and methods for maintenance optimization have
been proposed, e.g., see Rausand and Høyland (2004), Pierskalla and Voelker
(1979), Valdez-Florez and Feldman (1989), Cho and Parlar (1991), Gertsbakh
(2000) and Wang (2002). A high number of models and methods for specific appli-
cations have also been developed, e.g., see Vatn and Svee (2002), Chang (2005),
Castanier and Rausand (2006), and Welte et al. (2006). In this section we present
basic elements required to optimize maintenance interval (τ ) , and a standard pro-
cedure for setting up the cost function, C(τ ) , is proposed.
A computerized tool called OptiRCM has been developed by the authors to
support the RCM procedure presented in this chapter. OptiRCM is currently being
used by the Norwegian National Railway (NSB). The Norwegian National Rail
Administration (JBV) has also adopted the same procedure. OptiRCM imports the
FMECA results generated by Steps 6 and 7 of the RCM analysis process. Cost
information is usually not available in the FMECA; hence information about
preventive and corrective maintenance costs must be provided separately. A screen
presenting the information on the MSI level is shown in Figure 4.6.
OptiRCM uses a procedure with three steps to optimize maintenance intervals:
(i) the component performance is established (left-hand part of Figure 4.6), (ii) the
system model is established (centre part of Figure 4.6), and (iii) the total cost if
calculated (right-hand part of Figure 4.6).

4.4.1 Component Model

The aim of the component model is to establish the effective failure rate with
respect to a specific failure mode, λE (τ ) , as a function of the maintenance interval
τ . The effective failure rate is the unconditional expected number of failures per
time unit for a given maintenance level. Typically, the effective failure rate is an
increasing function of τ . A large number of models for determining the effective
failure rate as a function of the maintenance strategies, the degradation models, and
so on, have been proposed in the literature.
102 M. Rausand and J. Vatn

Figure 4.6. OptiRCM input and analysis screen

The interpretation of the effective failure rate is not straightforward for hidden
functions. For such functions we also need to specify the rate at which the hidden
function is demanded. In this situation we may approximate the effective failure
rate by the product of the demand rate and the probability of failure on demand
(PFD) for the hidden function.
In the following we indicate models that may be used for modelling the
effective failure rate, and we refer to the literature for details. The aim of OptiRCM
has been:
• To cover the standard situations, both with respect to evident/hidden
failures, but also with respect to the type of failure progression.
• Provide formulae that do not require too many reliability parameters to be
• Limit the number of probabilistic models as a basis for the optimization.
Only the Weibull distribution is used to model aging failures in OptiRCM. There
may, of course, be situations where another distribution would be more realistic,
but our experience is that the user of such a tool rarely has data or insight that helps
him to do better than applying the Weibull model. Effective Failure Rate in the Situation of Aging

A standard block replacement policy is considered where an aging component is
periodically replaced after intervals of length τ . Upon a failure in one interval, the
component is replaced without affecting the next planned replacement. The
effective failure rate, i.e., the average number of failures per time unit is then given
by λE (τ ) = W (τ ) / τ , where W (τ ) is the renewal function (e.g., see Rausand and
Høyland 2004). Approximation formulas for the effective failure rate exist if we
assume Weibull distributed failure times (e.g., see Chang et al. 2006). OptiRCM
Reliability Centred Maintenance 103

uses the renewal equation to establish an iterative scheme for the effective failure
rate based on an initial approximation. Effective Failure Rate in the Situation of Gradual Observable Failure

The assumptions behind this situation is that the failure progression, say Y (t) , can
be observed as a function of time. In the simplest situation Y (t) is one-
dimensional, whereas in more complex situations Y (t) may be multidimensional.
We may also have situations where Y (t) denotes some kind of a signal where, for
example, the fast Fourier transform of the signal is available. In OptiRCM a very
simple situation is considered, where Y (t) is monotonically increasing. As Y (t)
increases, the probability of failure also increases, and at a predefined level
(maintenance limit), say l , the component is replaced, or overhauled. The effective
failure rate, λE (τ, l) , is now a function of both the inspection interval, and the
maintenance limit. In OptiRCM a Markov chain model is used to model the failure
progression (e.g., see Welte et al. 2006 for details of the Markov chain modelling,
and also an extension where it is possible to reduce the inspection intervals as we
approach the maintenance limit). In the Markov chain model it is easy to treat the
situation where Y (t) is a nonlinear function of time. If we restrict ourselves to
linear failure progression, continuous models as the Wiener and gamma processes
may also be used. Effective Failure Rate in the PF Model

The assumption behind the PF model is that failure progression is not observable for
a rather long time, and then at some point of time we have a rather fast failure
progression. This is the typical situation for cracks (potential failures) that can be
initiated after a large number of load cycles. The cracks may develop rather fast, and
it is important to detect the cracks before they develop into breakages. The time from
a crack is observable until a failure (breakage) occurs is denoted the PF-interval. The
important reliability parameters are the rate of potential failures, the mean and
standard deviation of the PF-interval, and the coverage of the inspection method. The
model implemented in OptiRCM for the PF situation is described in Vatn and Svee
(2002). See also Castanier and Rausand (2006) for a similar approach, and the more
general application of delay time models (Christer and Waller 1984).

4.4.2 System Model

Figure 4.4 shows a simplified model of the risk picture related to the component
failure being analyzed. In order to quantify the risk related to safety, we need the
following input data:
• The effective failure rate, λE (τ )
• The probability that the other barriers against the TOP event with respect to
safety all fail, PTE−S
• The probability that the TOP event results in consequence C j is PC j for j
running through the number of consequence classes.
104 M. Rausand and J. Vatn

Table 4.2. PLL and cost contribution and for each consequence class

Consequence PLLj = PLL-contribution SCj = Cost (Euro)

C1: Minor injury 0.01 2,000
C2: Medical treatment 0.05 30,000
C3: Permanent injury 0.1 300,000
C4: 1 fatality 0.7 1,600,000
C5: 2–10 fatalities 4.5 13,000,000
C6: >10 fatalities 30 160,000,000

The frequency of the consequence class C j is given by

Fj = λE (τ ) ⋅ PTE−S ⋅ PC j (4.1)

where PCj is the probability that the TOP event results in consequence class C j .
We will later indicate how we can model Equation 4.1 as a function of the
maintenance interval, τ .
In some situations we also assign a cost, and/or a PLL (potential loss of life)
contribution to the various cost elements. PLL denotes the annual, statistically
expected number of fatalities in a specified population. Proposed values adopted by
the Norwegian National Rail Administration are given in Table 4.2. Please see
discussion by Vatn (1998) regarding what it means to assign monetary values to
The total PLL contribution related to the component failure being analyzed is

PLL = PTE−S ⋅ ∑ j=1 (PC j ⋅ PLL j ) ⋅ λE (τ )


and the total cost contribution related to the component is

CS = PTE−S ⋅ ∑ j=1 (PC j ⋅ SC j ) ⋅ λE (τ )


where SCj is safety cost of consequence class C j .

Note that in the FMECA analysis we can have an automatic procedure that
calculates both the PLL contribution and the safety cost contribution based on the
reliability parameters, and the type of TOP event.
In the same way as we have done for safety consequences, we proceed with
punctuality or unavailability costs. Here we simplify, and assume that there exists a
fixed (expected) cost for each TOP event for punctuality, say PC(TOP). The
punctuality cost per time unit is then
Reliability Centred Maintenance 105

CP = PTE-P ⋅ PC (TOP) ⋅ λE (τ ) (4.4)

This procedure may, if required, be repeated for other dimensions like environ-
ment, material damage, and so on.

4.4.3 Total Cost and Interval Optimization

The approach to interval optimization is based on minimizing the total cost related
to safety, punctuality, availability, material damage, etc. Within an ALARP regime
(e.g., see Vatn 1998) this requires that the risk is not unacceptable. Assuming that
risk is acceptable, we proceed by calculating the total cost per time unit:

C(τ ) = CS (τ ) + CP (τ ) + CPM (τ ) + CCM (τ ) (4.5)

where CS (τ ) and CP (τ ) are given by Equation 4.3 and 4.4, respectively. Further,

CPM (τ ) = PM Cost / τ (4.6)

where PM Cost is the cost per preventive maintenance activity. Note that for
condition-based tasks we distinguish between the cost of monitoring the item, and
the cost of physically improving the item by some restoration or renewal activity.
This complicates Equation 4.6 slightly because we have to calculate the average
number of renewals.

Further, if CM Cost is the cost of a corrective maintenance activity, we have

CCM (τ ) = CM Cost ⋅ λE (τ ) (4.7)

Table 4.3. Generic probabilities, PCj, of consequence class Ci for the different TOP events

TOP event PC1 PC2 PC3 PC4 PC5 PC6

Derailment 0.1 0.1 0.1 0.1 0.05 0.01
Collision train-train 0.02 0.03 0.05 0.5 0.3 0.1
Collision train-object 0.1 0.2 0.3 0.15 0.01 0.001
Fire 0.1 0.2 0.2 0.1 0.02 0.005
Passengers injured or killed at platforms 0.3 0.3 0.2 0.05 0.01 0.001
Persons injured or killed at level crossings 0.1 0.2 0.3 0.3 0.09 0.01
Persons injured or killed in or at the track 0.2 0.2 0.2 0.3 0.1 0.0001
106 M. Rausand and J. Vatn

To find the optimum maintenance interval we can now calculate C(τ ) in

Equation 4.5 for various values of the maintenance interval, τ , and then choose
the τ value that minimizes C(τ ) .
As a numerical example we consider a pump used for oil cooling of the main
high voltage transformer in a locomotive. The relevant figures in the example are
assessed by experts in the Norwegian National Railway. Upon failure of the oil
pump, the TOP event for punctuality will most likely be a FULL STOP with a
probability, PTE−P = 0.75 for this punctuality consequence. It is considered that a
full stop gives an average delay of 15 min, and the cost of 1 min delay is set to 150
Euros. The potential TOP event for safety is a FIRE, but the likelihood is very
small, i.e. PTE−S = 0.0005 . The reliability parameters of the pump are for the aging
parameter α = 3.5 , and for the mean time to failure without any preventive
maintenance we set to MTTF = 10 million km. To calculate the safety cost we find
∑ j (PC j ⋅ SC j ) = 1.286 million Euros by combining Table 4.2 and 4.3. Equation
4.3 thus reads CS (τ ) = 643 ⋅ λE (τ ) .
Punctuality cost in Equation 4.4 is similarly given by CS (τ ) = 2250 ⋅ λE (τ ) . For
PM and CM cost we have PM Cost = 3100 Euros, and CM Cost = 4400 Euros,
respectively. Chang et al. (2006) argue that a good approximation for the effective
failure rate is
α 2
⎛ Γ(1 + 1/ α ) ⎞ α −1 ⎡ 0.1ατ (0.09α − 0.2)τ ⎤
λE (τ ) = ⎜ ⎟ τ ⎢1 − 2
+ ⎥ (4.8)

The total cost C(τ ) in Equation 4.5 can now be found as a function of τ ; see
Figure 4.7 for a graphical illustration. The optimum interval is found to be 7.5 mil-
lion km. The maintenance action is scheduled replacement of the pump; see
Figure 4.5.

Figure 4.7. Cost elements as a function of the maintenance interval

Reliability Centred Maintenance 107

4.5 Conclusions
The main parts of the RCM approach that we have described in this chapter are
compatible with common practice and with most of the RCM standards. We are,
however, using a more complex FMECA where we also record data that are
necessary during maintenance interval optimization. The novel parts of our approach
are related to the use of so-called generic RCM analysis and to maintenance interval
optimization. The use of generic RCM analysis will significantly reduce the
workload of a complete RCM analysis. Maintenance optimization is, generally, a
very complex task, and only a brief introduction is presented in this chapter. For
maintenance personnel to be able to use the proposed methods, they need to have
access to simple computerized tools where the mathematically complex methods are
hidden. This was our objective in developing the OptiRCM tool. Maintenance
optimization modules are, more or less, non-existent in the standard RCM tool.
OptiRCM is not a replacement for these tools, but rather a supplement. OptiRCM is
still in the development stage, and we are currently trying to implement several new
features into OptiRCM. Among these are additional methods related to maintenance
strategies, and grouping of maintenance tasks.

4.6 References
ABS, (2003) Guide for Survey Based on Reliability-Centered Maintenance. American
Bureau of Shipping, Houston.
ABS, (2004) Guidance Notes on Reliaility-Centered Maintenance. American Bureau of
Shipping, Houston.
Blanchard BS, Fabrychy WJ, (1998) Systems Engineering and Analysis, 3rd ed. Prentice
Hall, Englewood Cliffs, NJ.
Blanche KM, Shrivastava AB, (1994) Defining failure of manufacturing machinery and
equipment. Proceedings from the Annual Reliability and Maintainability Symposium,
pp. 69–75.
Castanier B, Rausand M, (2006) Maintenance optimization for subsea oil pipelines.
Pressure Vessels and Piping 83:236–243.
Chang KP, (2005) Reliability-centered maintenance for LNG ships. ROSS report 200506,
NTNU, Trondheim, Norway.
Chang KP, Rausand M, Vatn J, (2006) Reliability Assessment of Reliquefaction Systems on
LNG Carriers. Submitted for publication in Reliability Engineering and System Safety.
Cho DI, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European
Journal of Operational Research 51:1–23.
Christer AH, Waller WM, (1984) Delay time models of industrial inspection maintenance
problems, Journal of the Operational Research Society 35:401–406.
DEF-STD 02-45 (NES 45), (2000) Requirements for the application of reliability-centred
maintenance technique to HM ships, submarines, Royal fleet auxiliaries and other naval
aixiliary vessels. Defense Standard, U.K. Ministry of Defence, Bath, England.
Gertsbakh I, (2000) Reliability Theory with Applications to Preventive Maintenance.
Springer, New York.
Hoch R, (1990) A practical application of reliability centered maintenance. the American
Society of Mechanical Engineers, 90-JPGC/Pwr-51, Joint ASME/IEEE Power
Generation Conference, Boston, MA, 21–25 October.
108 M. Rausand and J. Vatn

IEC60300-3-11, (1999) Dependability management – Application guide – Reliability

centered maintenance. International Electrotechnical Commission, Geneva.
IEC61508, (1997) Functional safey of electrical/electronic/programmable electronic safety-
related systems, Part 1–7. International Electrotechnical Commission, Geneva.
Kirwan B, Ainsworth LK, (1992) A Guide to Task Analysis. Taylor and Francis, London.
MIL-STD-2173 (AS), (1986) Reliability-Centered Maintenance. Requirements for Naval
Aircraft, Weapon Systems and Support Equipment. U.S. Department of Defense,
Washington, DC.
Moubray J, (1997) Reliability-centered Maintenance II, 2nd ed. Industrial Press, New York.
NASA, (2000) Reliability Centered Maintenance Guide for Facilities and Collateral
Equipment. NASA Office of Safety and Mission Assurance, Washington DC.
NAVAIR 00-25-403, (2005) Guidelines for the naval aviation reliability-centered
maintenance process. Naval Air Systems Command, U.S.A.
OREDA, (2002) Offshore Reliability Data, 4th ed. OREDA Participants. Available from
Det Norske Veritas, NO-1322 Høvik, Norway.
Paglia A, Barnard D, Sonnett D, (1991) A case study of the RCM project at V.C. Summer
Nuclear Generating Station. 4th International Power Generation Exhibition and
Conference, Tampa, Florida, USA. 5:1003–1013.
Pierskalla WP, Voelker JA, (1979) A survey of maintenance models: The control and
surveillance of deteriorating systems. Naval Research Logistics Quarterly 23:353–388.
Rausand M, Høyland A, (2004) System Reliability Theory; Models, Statistical Methods, and
Applications. Wiley, New York.
SAE JA1012, (2002) A Guide to the Reliability-Centered Maintenance (RCM) Standard.
The Egineering Society for Advancing Mobility Land, Sea, Air, and Space, Warrendale,
Smith AM, (1993) Reliability-Centered Maintenance. McGraw-Hill, New York.
Smith DJ, (2005) Reliability, maintainability and Risk; Practical Methods for Engineers
icluding Reliability Centred Maintenance and Safety-Related Systems. Elsevier,
Butterworth Heinemann, Amsterdam.
USACERL TR 99/41 (1999) Reliability centered maintenance (RCM) guide. Operating a
more effective maintenance program. U.S. Army Corps of Engineers.
Valdez-Flores C, Feldman RM, (1989) A survey of preventive maintenance models for
stochastically deterioratingsingl-unit systems. Naval Research Logistics 36:419–446.
Vatn J, Hokstad P, Bodsberg L, (1996) An overall model for maintenance optimization.
Reliability Engineering and System Safety 51:241–257.
Vatn J, (1998) A discussion of the acceptable risk problem. Reliability Engineering and
System Safety 61:11–19.
Vatn J, Svee H, (2002) A risk based approach to determine ultrasonic inspection frequencies
in railway applications. ESReDA Conference, Madrid, 27–28 May.
Wang H, (2002) A survey of maintenance policies of deteriorating systems. European
Journal of Operational Research 139:469–489.
Welte T, Vatn J, Heggset J, (2006) Markov state model for optimization of maintenance and
renewal of hydro power components. 9th International Conference on Probabilistic
Methods Applied to Power Systems, KTH, Stockholm, 11–15 June 2006.
Part C

Methods and Techniques


Condition-based Maintenance Modelling

Wenbin Wang

5.1 Introduction
The use of condition monitoring techniques in industry to direct maintenance
actions has increased rapidly over recent years to the extent that it has marked the
beginning of what is likely to prove a new generation in production and main-
tenance management practice. There are both economic and technological reasons
for this development driven by tight profit margins, high outage costs and an
increase in plant complexity and automation. Technical advances in condition
monitoring techniques have provided a means to achieve high availability and to
reduce scheduled and unscheduled production shutdowns. In all cases, the
measured condition information does, in addition to potentially improving decision
making, have a value added role for a manager in that there is now a more ob-
jective means of explaining actions if challenged.
In November 1979, the consultants, Michael Neal & Associate Ltd published ‘A
Guide to Condition Monitoring of Machinery’ for the UK Department of Trade and
Industry; Neal et al. (1979). This groundbreaking report illustrated the difference in
maintenance strategies (e.g., breakdown, planned, etc.) and suggested that condition
based maintenance, using a range of techniques, would offer significant benefits to
industry. By the late 1990s condition based maintenance had become widely
accepted as one of the drivers to reduce maintenance costs and increase plant
availability. With the advent of e-procurement, business to business (B2B), customer
to business (C2B), business to customer (B2C) etc., industry is fast moving towards
enterprise wide information systems associated with the internet. Today, plant asset
management is the integration of computerised maintenance management systems
and condition monitoring in order to fulfil the business objectives. This enables
significant production benefits through objective maintenance prediction and
scheduling. This positions the manufacturer to remain competitive in a dynamic
Today there exists a large and growing variety of condition monitoring tech-
niques for machine condition monitoring and fault diagnosis. A particularly popular
112 W. Wang

one for rotating and reciprocal machinery is vibration analysis. However, irrespective
of the particular condition monitoring technique used, the working principle of
condition monitoring is the same, namely condition data become available which
need to be interpreted and appropriate actions taken accordingly. There are generally
two stages in condition based maintenance. The first stage is related to condition
monitoring data acquisition and their technical interpretations. There have been
numerous papers contributing to this stage, as evidenced by the proceedings of
COMADEM over recent years. This stage is characterised by engineering skill,
knowledge and experience. Much effort of the study at this stage has gone into
determining the appropriate variables to monitor, Chen et al. (1994), the design of
systems for condition monitoring data acquisition, Drake et al. (1995), signal
processing, Wong et al. (2006), Samanta et al. (2006), Harrison (1995), Li and Li
(1995), and how to implement computerised condition monitoring, Meher-Homji et
al. (1994). These are just a few examples and no modelling is explicitly entered into
the maintenance decision process based upon the results of condition monitoring. For
detailed technical aspects of condition monitoring and fault diagnosis, see Collacott
(1997). The second stage is maintenance decision making, namely what to do now
given that condition information data and their interpretations are available. The
decision at this stage can be complicated and entails consideration of cost, downtime,
production demand, preventive maintenance shutdown windows, and most im-
portantly, the likely survival time of the item monitored. Compared with the exten-
sive literature on condition monitoring techniques and their applications, relatively
little attention has been paid to the important problem of modelling appropriate
decision making in condition based maintenance.
This chapter focuses on the second stage of condition monitoring, namely
condition based maintenance modeling as an aid to effective decision making. In
particular, we will highlight a modelling technique used recently in condition based
maintenance, e.g. residual life modelling via stochastic filtering (Wang and
Christer 2000). This is a key element in modeling the decision making aspect of
condition based maintenance. The chapter is organised as follows. Section 5.2
gives a brief introduction to condition monitoring techniques. Section 5.3 focuses
on condition based maintenance modeling and discuss various modeling tech-
niques used. Section 5.4 presents the modelling of the residual life conditional on
observed monitoring information using stochastic filtering. Section 5.5 concludes
the chapter with a discussion of topics for future research.

5.2 Condition Monitoring Techniques

For many years condition monitoring has been defined as “The assessment on a
continuous or periodic basis of the mechanical and electrical condition of
machinery, equipment and systems from the observation and/or recordings of
selected measurement parameters” (Collacott 1997). One of the obvious analogies
is the temperature measurement of a human body where the observation is the
temperature and the system is the human body. Just as doctors strongly recommend
periodic checks of key health parameters such as blood pressure, pulse, weight
and/or temperature for an early indication of potential health problems, for
Condition-based Maintenance Modelling 113

industrial equipment some measurements can be taken and the likely condition of
the plant assessed.
Today there exists a large and growing variety of forms of condition monitoring
techniques for machine condition monitoring and fault diagnosis. Understanding the
nature of each monitoring technique and the type information measured will certainly
help us when establishing a decision model. Here we briefly introduce five main
techniques and among them, vibration and oil analysis techniques are the two most

5.2.1 Vibration Based Monitoring

Vibration based monitoring is the main stream of current applications of condition

monitoring in industry. Vibration based monitoring is an on (off) line technique
used to detect system malfunction based on measured vibration signals.
Generally speaking, vibration is the variation with time of the magnitude of a
quantity that is descriptive of the motion or position of a mechanical system, when
the magnitude is alternatively greater than and smaller than some average value or
Vibration monitoring consists essentially in identifying two quantities:
• The magnitude (overall level) of the vibration
• The frequency content (and/or time waveform)
The magnitude is basically used for establishing the severity of the vibration
and the frequency content for the cause or origin. Vibration velocity has been seen
as the most meaningful magnitude criterion for assessing machine condition,
though displacement or acceleration is also used. The magnitude of vibration is
usually measured in root mean square (rms). If T denotes the period of vibration
and V (t ) is the vibration (say, velocity) measured at time t, then

Vrms = 1
T ∫ 0
(V (t )) 2 dt ,

which is proportional to the energy of vibration (Reeves 1998).

However, since vibration signals from machines are, in general, periodic in
nature, a great deal of information is contained in its frequency spectrum form. The
frequency spectrum is usually obtained digitally using a digital analyser or com-
puter via a mathematical algorithm known as “fast fourier transform” (FFT). The
spectrum analysis of vibration signals is commonly used in the fault diagnosis of
rotating machines. Potentially, all machines can benefit from vibration monitoring
except, perhaps, those running at very low speed (below about 20 rev/min), and
those where isolation (or damping) occurs between the source and the sensor.
From observed vibration signals we often see a typical two-stage process where
the signals may stay flat over the normal operation period and then display some
increasing trend when a defect has initiated (Wang 2002). Another factor coming
into play when establishing a vibration based maintenance model is the casual
relationship between the measured signals and the state of the plant. It is the defect
114 W. Wang

which causes the abnormal signals, but not vice versa (Wang 2002). This factor
plays an important role when selecting an appropriate model for describing such a

5.2.2 Oil Based Monitoring

A detailed analysis of a sample of engine, transmission and hydraulic oils is a

valuable preventive maintenance tool for machines. In many cases it enables the
identification of potential problems before a major repair is necessary, has the
potential to reduce the frequency of oil changes, and increase the resale value of
used equipment.
Oil based monitoring involves sampling and analyzing oil for various properties
and materials to monitor wear and contamination in an engine, transmission or
hydraulic system etc. Sampling and analyzing on a regular basis establishes a
baseline of normal wear and can help indicate when abnormal wear or contamination
is occurring. Oil analysis works as follows. Oil that has been inside any moving
mechanical apparatus for a period of time reflects the possible condition of that
assembly. Oil is in contact with engine or mechanical components as wear metallic
trace particles enter the oil. These particles are so small they remain in suspension.
Many products of the combustion process will also become trapped in the circulating
oil. The oil becomes a working history of the machine. Particles caused by normal
wear and operation will mix with the oil. Any externally caused contamination also
enters the oil. By identifying and measuring these impurities, one can get an
indication of the rate of wear and of any excessive contamination. An oil analysis
also will suggest methods to reduce accelerated wear and contamination.
The typical oil analysis tests for the presence of a number of different materials
to determine sources of wear, find dirt and other contamination, and even check for
the use of appropriate lubricants. Today there exists a variety of forms of oil based
condition monitoring methods and techniques to check the volume and nature of
foreign particles in oil for equipment health monitoring. There are spectrometric oil
analysis, scan electron microscopy/energy dispersive X-ray analysis, energy dis-
persive X-ray fluorescent, low powered optical microscopy, and ferrous debris
quantification. One purpose of the oil analysis is to provide a means of predicting
possible impending failure without dismantling the equipment. One can “look
inside” an engine, transmission or hydraulic systems without taking it apart.
For oil based monitoring there is no such clear cut distinction between normal
and abnormal operating based on observed particle information in the oil samples.
The foreign particles that accumulate in the lubricant oil increase monotonically so
that we may not able to see a two-stage failure process as seen in the vibration
based monitoring. The casual relationship between the measured amount of par-
ticles in the oil and the state of the plant may also be bilateral in that, for example,
the wear may cause the increase of observed metals in the oil, but the metals and
other contaminants in the oil may also accelerate the wear. This marks a difference
when modeling the state of the plant in oil based monitoring compared to vibration
Condition-based Maintenance Modelling 115

5.2.3 Other Monitoring Techniques

The other popular condition monitoring techniques are infrared thermography,

acoustics and motor current analysis.
The basis of infrared thermography is quite simple. All objects emit heat or
infrared electro-magnetic energy, but only a very small proportion of this energy is
visible to the naked eye. At low temperatures in order to ‘see’ the heat being
emitted an infrared camera must be used. The camera detects the invisible thermal
energy and converts it to a visible image on a screen. The image can then be ana-
lyzed to identify any abnormality.
The acoustic emission (AE) based method is widely used for monitoring the
condition of rotating machinery. Compared to traditional vibration based methods,
the high frequency approach of AE has the advantage of a significant improvement
in signal to noise ratio. It can also be used for non-rotating machinery where defect
activities do not generate distinct repetition frequencies and hence FTT analysis
cannot be used. An item to note is that AE transducers need to have a relatively
narrow band to be able to detect high frequency faults.
The motor current noise signature analysis methods and apparatus for
monitoring the operating characteristics of an electric motor-operated device, such
as a motor-operated valve, have been frequently used for early detection of rotor
related faults in AC induction motors. Frequency domain signal analysis techniques
are applied to a conditioned motor current signal to identify distinctly various
operating parameters of the motor driven device from the motor current signature.
The signature may be recorded and compared with subsequent signatures to detect
operating abnormalities and degradation of the device. This diagnostic method does
not require special equipment to be installed on the motor-operated device, and the
current sensing may be performed at remote control locations, e.g., where the
motor-operated devices are used in inaccessible or hostile environments.
All the techniques briefly introduced above can offer some help for indicating
the current state or condition of the plant monitored. Based on the technical
analysis of the observed condition monitoring data, a maintenance decision has to
be made to maintain the plant in a cost effective way. We discuss in the next
section, how modeling can be used to support such a decision making utilizing
available monitoring information.

5.3 Condition Based Maintenance Modelling

There is a basic but not always clearly answered question in condition monitoring
— what is the purpose of condition monitoring? Have we lost sight of the ultimate
need? Condition monitoring is not an end itself, it involves an expenditure entered
into by the managers in the belief that it will save them money. How is this saving
achieved? It can be obtained by using monitored condition information to optimise
maintenance to achieve minimum breakdown of the plant with maximum avail-
ability for production, and to ensure that maintenance is only carried out when
necessary. This is what one calls condition based maintenance which contrasts with
the traditional breakdown or time based maintenance policies where maintenance
116 W. Wang

is only carried out when it becomes necessary utilizing available condition infor-
mation. But in reality, all too often we see effort and money spent on monitoring
equipment for faults which rarely occur, and we also see planned maintenance
being carried out when the equipment is perfect healthy though the monitored
information indicates something is “wrong”. A study of oil based condition moni-
toring of gear boxes of locomotives used by Canadian Pacific Railway (Aghjagan
1989) indicated, that since condition monitoring was commissioned (entailed 3–4
samples per locomotive per week, 52 weeks per year), the incidence failure of gear
boxes while in use fell by 90 %. This is a significant achievement. However, when
subsequently stripped down for reconditioning/overhaul, there was nothing evi-
dently wrong in 50 % of cases. Clearly, condition monitoring can be highly effec-
tive, but may also be very inefficient at the same time. Modelling is necessary to
improve the cost effectiveness and efficiency of condition monitoring.

5.3.1 The Decision Model

This is an extension to the agebased replacement model in that the replacement

decision will be made not only dependent upon the age, but also upon the
monitored information, plus other cost or downtime parameters. If we take the cost
model as an example, then the decision model amounts to minimising the long run
expected cost per unit time. We use the following notation:

c f : The mean cost per failure

c p : The mean cost per preventive replacement
cm : The mean cost per condition monitoring
ti : The ith and the current monitoring point
Yi : Monitored information at ti with yi of its observed value
ℑi : History of observed condition variables to ti , ℑi = { y1 ,..., yi }
X i : The residual life at time ti
pi ( xi | ℑi ) : Pdf of X i conditional on ℑi

The long term expected cost per unit time, C (t ) , given that a preventive replace-
ment is scheduled at time t> ti is given by (Wang 2003)

(c f − c p ) P (t − ti | ℑi ) + c p + icm
C (t ) = t − ti
ti + (t − ti )(1 − P (t − ti | ℑi )) + ∫ 0
xi pi ( xi | ℑi ) dxi

t − ti
where P (t − ti | ℑi ) = P ( X i < t − ti | ℑi ) = ∫ pi ( xi | ℑi )dxi , which is the probability

of a failure before t conditiional on ℑi . The right hand side of Equation 5.1 is the
expected cost per unit time formulated as a renewal reward function, though the
lifetimes are independent but not identical.
Condition-based Maintenance Modelling 117

The time point t is usually bounded within the time period from the current to
the next monitoring since a new decision shall be made once a new monitoring
reading becomes available at time ti +1 .
In general, if a minimum of C (t ) is found within the interval to the next
monitoring in terms of t , then this t should be the optimal replacement time. If no
minimum is found, then the recommendation would be to continue to use the plant
and evaluate Equation 5.1 at the next monitoring point when new information
becomes available. For a graphical illustration of the above principle see Figure 5.1.


No replacement is recommended

Optimal replacement time

ti Current time t* Next monitoring time ti +1 t

Figure 5.1. A graph to show the optimal replacement time

Obviously the key element in Equation 5.1 is the determination of pi ( xi | ℑi ) ,

which is the topic of the next two sections.

5.3.2 Modelling pi ( xi | ℑi )

Before we proceed to the discussion of the modelling of pi ( xi | ℑi ) , there are few

issues that need clarification.
The first relates to the concept of direct and indirect monitoring (Christer and
Wang 1995). In direct monitoring, the actual condition of the item, say the depth of
a brake pad, can be observed, and a critical level, say C , can be set up. While in
the indirect monitoring case we can only collect measurements related to the actual
condition of the item monitored in a stochastic manner. For example, in the
vibration monitoring case, if a high vibration signal is observed we may suspect the
item’s condition might be bad, but we may neither know the exact condition of it,
nor its quantification. For direct monitored systems, Markov models are popular;
see Black et al. (2005), Chen and Trivedi (2005), and Love (2000). Counting
processes have also been used for modeling the deterioration of directly monitored
plant; see Aven (1996) and Jensen (1992). Christer and Wang (1992) used a ran-
dom coefficient model for a direct monitored case. It is noted however that the
majority of condition monitoring applications are indirect monitoring such as the
118 W. Wang

five popular monitoring techniques discussed earlier. It is therefore in this chapter

that our attention is paid to indirect monitoring cases.
The second issue is the appropriate definition of the plant state. This also
relates to the first issue whether the monitoring is direct or indirect. In direct
monitoring, the actual observed condition of the item is clearly the plant state.
While in the indirect monitoring case we can only observe measures indirectly
related to the actual condition of the item monitored as discussed earlier. The most
simple and intuitive definition is a set of categorical states ranging, say from 0
(new) to N (failed) as seen from Markov based models (Baruah and Chinnam
2005). Wang (2006a) also used a generic term of wear to represent the state of the
monitored plant, which is particularly useful in modelling wear related problems in
condition monitoring. Wang and Christer (2000) first used the residual life at the
time of checking as a measure of the state of the monitored unit of interest. This
definition provides an immediate modeling means to establish directly a link
between the measured information and the residual life of interest. It is noted how-
ever, that this residual life is usually not observable which increases modeling
complexity. A model of pi ( xi | ℑi ) introduced later will be based on this definition.
Various different methods or models have been proposed in the literature to
formulate and calculate pi ( xi | ℑi ) . Proportional Hazard Modeling (PHM, one
particular and natural form for modelling the hazard) is a popular one; Kumar and
Westberg (1997), Love and Guo 1991, Makis and Jardine (1991), Jardine et al.
(1998), Banjevic et al. (2001). Accelerated life models (Kalbfleisch and Prentice
1980; Wang and Zhang 2005) could also be used here, and may be more appro-
priate since the analogy between accelerated life testing, where these models origi-
nate, and condition monitoring is a close one. It should be noted that accelerated
life models and proportional hazard models are identical when the time to failure
distribution is Weibull, that is when the hazard function is given by

h (t ) = α β t β −1 .

There are two problems with proportional hazards modeling or accelerated life
models in condition based maintenance. The first is that the current hazard is
determined partially by the current monitoring measurements and the full
monitoring history is not used. The second is the assumption that the hazard or the
life is a function of the observed monitoring data which acts directly on the hazard
via a covariate function. Both problems relate to the modeling assumption rather
than the technique. The first can be overcome if some sort of transformation of the
observed data is used. The second problem remains unless the nature of monitoring
indicates so. It is noted however that, for most condition monitoring techniques,
the observed monitoring measurements are concomitant types of information
which are a function of the underlying plant state. A typical example is in vibration
monitoring where a high level of vibration is usually caused by a hidden defect but
not vice versa as we have discussed earlier. In this case the observed vibration
signals may be regarded as concomitant variables which are caused by the plant
state. Note that in oil based monitoring things are different as the metal particles
and other contaminants observed in the oil can be regarded both as concomitant
Condition-based Maintenance Modelling 119

variables and covariates as we discussed earlier. In this case a model considers

both variables might be appropriate.
The last decade has seen an increased use of stochastic filtering and Hidden
Markov Models (HMM) for modelling pi ( xi | ℑi ) in condition based maintenance;
see Hontelez et al. (1996), Christer et al. (1997), Wang and Christer (2000), Bunks
et al. (2000), Dong and He (2004), Lin and Markis (2003, 2004), Baruah and
Chinnam (2005), and Wang (2006a). These techniques overcome both problems of
PHM and provide a flexible way to model the relationship between the observed
signals and unobserved plant state. HMM can be seen as a specific type of sto-
chastic filtering models that are usually used for discrete state and observation
variables. If the noise factors in the model are not Gaussian, then a closed form for
pi ( xi | ℑi ) is generally not available and one has to resort to numerical approxi-
mations. A comparison study using both filtering (Wang 2002) and PHM (Makis
and Jardine 1991) based on vibration data revealed that the filtering based model
produced a better result in terms of prediction accuracy (Matthew and Wang 2006).
It should also be noted that if the monitored variables also influence the state to
some extent, then both HMM and PHM should be used to tackle the problem.
Alternatively an interactive HMM can also be formulated where a bilateral
relationship is assumed between the observed and unobserved. In the next section,
we shall discuss in details a specific filtering model used for the derivation of
pi ( xi | ℑi ) . This model is simple to use and is analytically tractable.

5.4 Conditional Residual Life Prediction

First we define the true state of plant as the residual life conditional upon measured
condition related information to date, such as, vibration, temperature, etc.
Next we assume these conditional pieces of information are functions of the
residual life, that is, it is the residual life which controls the behavior of the
measured conditional information, but not vice versa (this assumption can be
relaxed). Generally we expect that a short residual life (depending on the severity
of the defect) will generate a high signal level in some of the measures of condition
variables, though in a typical stochastic fashion. In theory, we may have the
following relationship:

Defect Short residual life Higher than normal signal may be observed.

If the severity of the defect is represented by the length of the residual life, the
relationship between the residual life and observed condition related variables
120 W. Wang

5.4.1 Conditional Residual Life Prediction

The model is built based on the following assumptions:

1. Plant items are monitored regularly at discrete time points.

2. There are two periods in the plant life where the first period is the time
length from new to the point when the item was first identified to be faulty,
and the second period is the time interval from this point to failure if no
maintenance intervention is carried out. The second period is often called
the failure delay time. It is also assumed that these two periods are statisti-
cally independent from each other.
3. A threshold level is established to classify the item monitored to be in a
potentially faulty state if the condition information signal is above the level.
Such a threshold level is usually determined by engineering experience or by
a statistical analysis of measured condition related variables.
4. The conditional information obtained at time ti , yi , during the failure
delay time is a random variable which depends on xi .

Assumptions 1 and 2 can often be observed in condition monitoring practice.

Assumption 3 can be relaxed and a model which can both identify the starting point
of the second stage and residual life prediction can be established (Wang 2006b).
For now, to keep the model simple we still use assumption 3. Assumption 4 was
first proposed in Wang and Christer (2000), which states that the rapid increase in
the observed condition information is partly due to the shortened residual life
because of the hidden defect. However this relationship is contaminated with ran-
dom noise. Assumption 4 is the fundamental principle underpinning our model. For
a detailed discussion on assumption 4 see Wang and Christer (2000).
Because the interest in residual life prediction is over the failure delay time
(assuming it exists) and the information collected over the normal working period
may not be beneficial for residual life prediction, we revise our notation on ti as
the ith and the current monitoring time since the item was suspected to be faulty
but still operating (noted that the order starts from the moment when the item was
first identified to be possibly faulty). This implies that t1 is the first monitoring
point which may indicate that the second stage has started. However, some moni-
toring may not be able to display a two-stage process such as oil based monitoring.
If this is the case, we can simply set the threshold level to be zero. Figure 5.2
shows a typical condition monitoring practice.
It is noted from Figure 5.2 that the conditional information obtained before t1
is not used since it is irrelevant to the decision making process. It is noted however,
that the time to t1 is one of important information sources to be used in determining
the condition monitoring interval (Wang 2003).
Since the residual life at ti is the residual life at ti −1 minus the interval between
ti and ti −1 provided the item has survived to ti and no maintenance action has been
taken, it follows that

⎧ X − (ti − ti −1 ) if X i −1 > ti − ti −1
X i = ⎨ i −1 . (5.2)
⎩ not defined else
Condition-based Maintenance Modelling 121


y2 x3
y1 x2
Threshold level

0 t1 t2 t3 failure

Figure 5.2. Condition monitoring practice

The relationship between Yi and X i is yet to be identified. From assumption 4 we

know that it can be described by a distribution, say, p( yi | xi ) . We will discuss this
later when fitting the model to data.
We wish to establish the expression of pi ( xi | ℑi ) , and therefore a consequen-
tial decision model can be constructed on the basis of such a conditional probabil-
ity; see Equation 5.1. Since ℑi = { y1 , y2 ,..., yi } = { yi , ℑi −1} , then pi ( xi | ℑi ) can be
expressed as pi ( xi | ℑi ) = p( xi | yi , ℑi −1 ) . It follows that

p ( xi , yi | ℑi −1 )
pi ( xi | ℑi ) = p ( xi | yi , ℑi −1 ) = (5.3)
p ( yi | ℑi −1 )

By using the multiplicative rule, the joint distribution, p ( xi , yi | ℑi −1 ) is given as

p ( xi , yi | ℑi −1 ) = p( yi | xi , ℑi −1 ) p ( xi | ℑi −1 ) (5.4)

Since given both xi and ℑi−1 , yi depends on xi only from assumption 4 so

Equation 5.4 reduces to

p ( xi , yi | ℑi −1 ) = p ( yi | xi , ℑi −1 ) p( xi | ℑi −1 ) = p( yi | xi ) p( xi | ℑi −1 ) (5.5)

Integrating out the xi term in Equation 5.5 we have

∞ ∞
p( yi | ℑi −1 ) = ∫ 0
p( xi , yi | ℑi −1 )dxi = ∫ 0
p( yi | xi ) p( xi | ℑi −1 )dxi (5.6)
122 W. Wang

We focus our attention to p ( xi | ℑi −1 ) which appears both in Equation 5.4 and

Equation 5.6.
From Equation 5.2 we have xi −1 = g ( xi ) = xi + (ti − ti −1 ) conditional on
X i −1 > ti − ti −1 . Then the distribution of X i | ℑi −1 can be expressed by a transfor-
mation of variables from X i to X i −1 (Freund 2004) as

dg ( xi )
p( xi | ℑi −1 ) = pi −1 ( g ( xi ) | ℑi −1 , X i −1 > ti − ti −1 ) (5.7)

dg ( xi )
Since = 1 and

pi −1 ( g ( xi ) | ℑi −1 )
pi −1 ( g ( xi ) | ℑi −1 , X i −1 > ti − ti −1 ) = ∞
∫ ti − ti −1
pi −1 ( xi −1 | ℑi −1 )dxi −1

we finally have

pi −1 ( xi + ti − ti −1 | ℑi −1 )
p( xi | ℑi −1 ) = ∞
∫ ti − ti −1
pi −1 ( xi −1 | ℑi −1 )dxi −1

Using Equations 5.5, 5.6 and Equation 5.9, 5.3 becomes

p ( yi | xi ) pi −1 ( xi + ti − ti −1 | ℑi −1 )
pi ( xi | ℑi ) = ∞
∫ 0
p( yi | xi ) pi −1 ( xi + ti − ti −1 | ℑi )dxi −1

which is a recursive equation which starts at time t1 . At time t1 , using Equation

5.10 we have

p ( y1 | x1 ) p0 ( x1 + t1 − t0 | ℑ0 )
p1 ( x1 | ℑ1 ) = ∞
∫ 0
p( y1 | x1 ) p0 ( x1 + t1 − t0 | ℑ0 )dx1

Since ℑ0 is usually 0 or not available, so p0 ( x1 + t1 − t0 | ℑ0 ) = p0 ( x1 + t1 − t0 ) , then

if p0 ( x0 ) and p( y1 | x1 ) can be specified, Equation 5.11 can be determined.
Similarly we can proceed to determining pi ( xi | ℑi ) if pi −1 ( xi −1 | ℑi −1 ) and
p ( yi | xi ) are available from the previous step calculation at time ti −1 .
Now the task is how to specify p0 ( x0 ) and p ( yi | xi ) .
Condition-based Maintenance Modelling 123

5.4.2 Specification of p0 ( x0 ) and p ( yi | xi )

p0 ( x0 ) is just the delay time distribution over the second stage of the plant life.
Here we use the Weibull distribution as an example in this context. In practice or
theory, the distribution density function p0 ( x0 ) should be chosen from the one
which best fits to the data or from some known theory.
The set-up of the p ( yi | xi ) term requires more attention. Here we follow the
one used in Wang (2002), where yi | xi is assumed to follow a Weibull distribution
with the scale parameter being equal to the inverse of A + Be − cx . In this way we i

establish a negative correlation between yi and xi as expected, that is

E (Yi | X i = xi ) ∝ A + Be − cx . The pdf is given below:

η yi −( )η
)η −1 e A+ Be
− cxi
p( yi | xi ) = − cx
( − cx
. (5.12)
A + Be A + Be
i i

This is a concept called floating scale parameter, which is particularly useful in this
case (Wang 2002). There are other choices to model the relationship between yi
and xi , but these will not be discussed here, and can be found in Wang (2006a).

5.4.3 Estimating the Model Parameters Within pi ( xi | ℑi )

To calculate the actual pi ( xi | ℑi ) we need to know the values for the model
parameters. They are the parameters of p0 ( x0 ) and p ( yi | xi ) . The most popular
way to estimate them is using the method of maximum likelihood.
At each monitoring point, ti , two pieces information are available, namely, yi
and X i −1 > ti − ti −1 , both conditional on ℑi−1 . The pdf. for yi | ℑi −1 is given by
Equation 5.7 and the probability function of X i −1 > ti − ti −1 | ℑi −1 is given by

P ( X i −1 > ti − ti −1 | ℑi −1 ) = ∫ pi −1 ( xi −1 | ℑi −1 ) dxi −1 (5.13)
ti − ti −1

If the item monitored failed at time t f after the last monitoring at time t n , the
complete likelihood function is then given by

L (Θ) = (∏ n
i =1
p( yi | ℑi −1 ) ∫

ti − ti −1 )
pi −1 ( xi −1 | ℑi −1 )dxi −1 ) pn (t f − tn | ℑn ) (5.14)

where Θ is the set of parameters to be estimated. Taking logs on both sides of

Equation 5.14 and maximising it in terms of unknown parameters should give the
estimated values of those parameters. However, computationally it has to be solved
numerically since Equation 5.14 involves many integrals which may not have
analytical solutions.
124 W. Wang

5.4.4 A Case Study

Figure 5.3 shows the data of overall vibration level in rms of six bearings, which is
from a fatigue experiment (Wang 2002). It can be seen from Figure 5.3 that the
bearing lives vary from around 100 h to over 1000 h, which shows a typical sto-
chastic nature of the life distribution. The monitored vibration signals also indicate
an increasing trend with bearing ages in all cases, but with different paths. An
important observation is the pattern of vibration signals which stays relatively flat
in the early stage of the bearing life and then increases rapidly (a defect may have
been initiated). This indicates the existence of the two stage failure process as
defined earlier.

Figure 5.3. Vibration data of six bearings

The initial point of the second stage in these bearings is identified using a
control chart called the Shewhart average level chart and the threshold levels of the
bearings are shown in Table 5.1 (Zhang 2004).

Table 5.1. Threshold level for each bearing

Bearing Threshold level
1 5.06
2 5.62
3 4.15
4 5.14
5 3.92
6 4.9
Condition-based Maintenance Modelling 125

Assuming both distributions for p0 ( x0 ) and p ( yi | xi ) are Weibull where

p ( x0 ) = αβ (α x0 ) β −1 e − (α x 0)


η yi −( )η
)η −1 e A+ Be
− cxi
p( yi | xi ) = − cx
( − cx
A + Be A + Be
i i

then starting from t1 and after recursive filtering we have

β i
( xi + ti ) β −1 e− (α ( x + t )) i i
k =1
ψ k ( xi , ti )
pi ( xi | ℑi ) = ∞


( z + ti ) β −1 e − (α ( z + t )) i
ψ ( z , ti )dz
k =1 k

− C ( z +ti −tk ) −1 η
e− ( y ( A+ Be
k ) )
ψ k ( z , ti ) = − C ( z + t −t )
A + Be i k

To estimate the parameters in p0 ( x0 ) and p ( yi | xi ) we need write down the

likelihood function as Equation 5.14. The actual process to estimate these unknown
parameters is complicated and involves heavy numerical manipulation which we
omit and interested readers can get the details in Zhang (2004). The estimated
result is listed in Table 5.2.

Table 5.2. Estimated parameter values in p0 ( x0 ) and p ( yi | xi )

α̂ β̂ Â B̂ Ĉ η̂
0.011 1.873 7.069 27.089 0.053 4.559

Based on the estimated parameter values in Table 5.2 and Equation 5.15 the
predicted residual life at some monitoring points given the history information of
bearing 6 in Figure 5.3 is plotted in Figure 5.4.
In Figure 5.4 the actual residual lives at those checking points are also plotted
with symbol *. It can be seen that actual residual lives are well within the predicted
residual life distribution as expected.
Given the estimated values for parameters and associated costs such as
c f = 6000 , c p = 2000 and cm = 30 (Wang and Jia 2001) we have the expected
cost per unit time for one of the bearings at various checking time t, shown in
Figure 5.5.
126 W. Wang

Figure 5.4. Predicted condition residual life of bearing 6

Expectd cost per unit time

t=80.5 hrs
t=92.5 hrs
t=104 hrs
t=116.5 hrs
19 t=129 hrs

0 10 20 30
Planned replacement time

Figure 5.5. Expected cost per unit time vs. planned replacement time in hours from the
current time t

In can be seen from Figure 5.5. that at t = 116.5 and 129 h both planned replace-
ments are recommended within the next 30 h.
To illustrate an alternative decision chart in terms of the actual condition
monitoring reading, we transformed the cost related decision into actual reading in
Figure 5.6 where the dark grey area indicates that if the reading falls within this area
a preventive replacement is required within the planning period of consideration.
The advantage of Figure 5.6 is that it can not only tell us whether a preventive
replacement is needed but also show us how far the reading is from the area of pre-
ventive replacement so that appropriate preparation can be done before the actual
Condition-based Maintenance Modelling 127

Preventive replacement area
Observed CM reading


No preventive replacement area

80.5 92.5 104 116.5 129
Tim e (age in hour) of CM reading taken

Figure 5.6. Decision chart using observed CM reading

The transformation is carried out in this way – at each monitoring point of ti ,

by gradually changing the value of yi in pi ( xi | ℑi ) used in Equation 5.1 until a
preventive replacement is recommended by the model within the planning period,
and then marking this value of yi as the threshold value at time ti . Connecting
these threshold values at those monitoring points forms the boundary between the
light and dark grey areas. Finally mark the actual reading of yi on the graph to see
which area it falls in.

5.5 Future Research Directions

5.5.1 Multi-component Systems

Previous condition based prognosis models developed in the literature mainly

focused on a single failure mode system subject to routine monitoring and replace-
ment such as bearings, pumps and motors, and various probability distributions are
used to describe the lifetime of the component. In the case of a high value and high
risk system with many components such as aircraft engines and gas turbines, how
to assess the health condition and make prognosis based on condition information
obtained from all components is still an open question. It is typical with a multi-
component system that many observed signal parameters are available and the
times between failures are neither independent nor identical.

5.5.2 Identification of the Initial Point of a Random Defect

With the delay time concept (see Chapter 14), system life is assumed to be
classified into two stages. The first is the normal working stage where no abnormal
condition parameters are to be expected. The second starts when a hidden defect is
first initiated with possible abnormal signals. The identification of the initial point
in the evolution of such a defect is important and has a direct impact on the
128 W. Wang

subsequent prediction model. Most research on fault diagnosis focuses on the

location of the fault, the possible cause of the fault and, of course, the type of fault.
This serves the engineering purpose of deciding what to repair, but does not aid the
decision of when to do the task. This initial point defect identification has received
very little attention in prognosis literature. Wang (2006b) addressed this problem to
some extent using a combination of the delay time concept and the HMM. Much
work still remains. It is possible that a multi-stage (>2) failure process could be
used, which might be more appropriate in some cases.

5.5.3 The Definition of Plant State

The definition of the underlying state and the relationship between the observed
monitoring parameters and the state of the system are issues which still need
attention. In the model presented in this chapter, the state of the system is defined
as the residual life, which is assumed to influence the observed signal parameters.
Whilst the modelling output appears to make sense, there are a few potential
problems with the approach. The first is the issue that the life of the plant is fixed
at birth (installation) but unknown. This is termed as playing God. Second, the
residual life is not the direct cause of the observed abnormal signals. These are
more likely caused by some hidden defects which are linked to the residual life in
this chapter. To correct the first problem we can introduce another equation
describing the relationship between X i and X i −1 deterministically or randomly.
This will allow X i to change during use, which is more appropriate. If the
relationship is deterministic, then a closed form of Equation 5.3 is still available,
but if it is random, HMM must be used and no closed form of Equation 5.3 exists
unless the noises associated are normally distributed. The second problem can be
overcome if we adopt a discrete or continuous state hidden Markov chain to de-
scribe the system deterioration process where the state space of the chain repre-
sents the system state in question.

5.5.4 Information Fusion

There is now a considerable amount of condition monitoring and process control

information available in industry, thanks to recent developments in condition
monitoring technology. It is noted that not all information is useful, or because of
correlation one may obtain similar information. There are two ways to deal with
this. One is to use some statistical methods to reduce the dimension of the original
data such as principal component analysis, and the other is to use multi-variate
distributions. The principal component analysis method has been used in Wang and
Zhang (2005), but unless the first principle component accounts for most of the
variation in the original data we still need to deal with a data set with more than
two dimensions. The use of multi-variate distributions in prognosis has not been
reported apart from the normal distribution which has the drawback of producing
negative values.
A final point worth mentioning is that, in practice, observed condition moni-
toring variables could be concomitant variables or covariates with respect to the
Condition-based Maintenance Modelling 129

system state. A model which can handle both type of information is ideal, but very
few attempts have been made (Hussin and Wang 2006).

5.6 Summary and Conclusions

This chapter introduces the concept of condition monitoring, key condition moni-
toring techniques, condition based maintenance and associated modelling support
in aid of condition based maintenance. Particular attention is paid to the residual
time prediction based on available condition information to date. An important
development made here is the establishment of the relationship between the ob-
served information and underlying condition which is the residual life in this case.
This is achieved by letting the mean of the observed information at ti be a function
of the residual life at that point conditional on X i = xi . The mathematical develop-
ment is based on a recursive algorithm called filtering where all past information is
included. The example illustrated is based on real data which came from a fatigue
experiment. However, data from industry has shown the robustness of the approach
and the residual life predictions conducted so far are satisfactory.

5.7 References
Aghjagan, H.N., (1989) Lubeoil analysis expert system, Canadian Maintenance Engineering
Conference, Toronto.
Aven, T., (1996) Condition based replacement policies – a counting process approach, Rel.
Eng. & Sys. Safety, 51(3), 275–281.
Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, M., (2001) A control-limit policy and
software for condition based maintenance optimization, INFOR 39(1), 32–50.
Baruah, P. and Chinnam R.B., (2005) HMM for diagnostics and prognostics in maching
processes, I. J. Prod. Res., 43(6), 1275–1293.
Black, M., Brint, A.T. and Brailsford J.R., (2005) A semi-Markov approach for modelling
asset deterioration, J. Opl. Res. Soc. 56(11), 1241–1249.
Bunks C., McCarthy, D. and Al-Ani T., (2000) Condition based maintenance of machine
using hidden Markov models, Mech. Sys. & Sig. Pro., 14(4), 597–612.
Chen, D. and Trivedi, K.S., (2005) Optimization for condition based maintenance with
semi-Markov decision process, Rel. Eng. & Sys. Safety, 90(1), 25–29.
Chen, W., Meher-Homji, C.B. and Mistree, F., (1994) COMPROMISE: an effective
approach for condition-based maintenance management of gas turbines. Engineering
Optimization, 22, 185–201.
Christer, A.H., Wang, W. and Sharp, JmM., (1997) A state space condition monitoring
model for furnace erosion prediction and replacement, Euro. J. Opl. Res., 101, 1–14.
Christer, A.H. and Wang, W., (1992) A model of condition monitoring inspection of
production plant, I. J. Prod. Res., 30, 2199–2211.
Christer A.H and Wang, W., (1995) A simple condition monitoring model for a direct
monitoring process, E. J. Opl. Res., 82, 258–269.
Collacott, R.A., (1977) Mechanical fault diagnosis and condition monitoring, Chapman and
Hall Ltd., London.
Dong M. and He, D., (2004) Hidden semi-Markov models for machinery health diagnosis
and prognosis, Trans. North Amer. Manu. Res. Ins. of SME, 32, 199–206.
130 W. Wang

Drake, P.R., Jennings, A.D., Grosvenor, R.I. and Whittleton, D., (1995) acquisition system
for machine tool condition monitoring. Quality and Reliability Engineering Inter-
national 11, 15–26.
Freund, J.E., (2004) Mathematical statistics with applications, Pearson Prentice and Hall,
Harrison, N., (1995) Oil condition monitoring for the railway business. Insight 37, 278–283.
Hontelez, J.A.M., Burger, H.H. and Wijnmalen, D.J.D., (1996) Optimum condition based
maintenance policies for deteriorating systems with partial information, Rel. Eng. & Sys.
Safety, 51(3), 267–274.
Hussin, B., and Wang, W., (2006) Conditional residual time modelling using oil analysis: a
mixed condition information using accumulated metal concentration and lubricant
measurements, to appear in Proc. 1st Main. Eng. Conf, Chendu, China.
Jardine, A.K.S., Makis, V., Banjevic, D., Braticevic, D. and Ennis, M., (1998) A decision
optimization model for condition based maintenance, J. Qua. Main. Eng., 4(2), 115–
Jensen, U., (1992) Optimal replacement rules based on different information level, Naval
Res. Log. 39, 937–955.
Kalbfleisch, J.D. and Prentice, R.L., (1980) The Statistical Analysis of Failure Time Data.
Wiley, New York.
Kumar, D. and Westberg, U., (1997) Maintenance scheduling under age replacement policy
using proportional hazard modelling and total-time-on-test plotting, Euro. J. Opl. Res.,
99, 507–515.
Li, C.J. and Li, S.Y., (1995) Acoustic emission analysis for bearing condition monitoring.
Wear 185, 67–74.
Lin, D. and Makis, V., (2003) Recursive filters for a partially observable system subject to
random failures, Adv. Appl. Prob., 35(1), 207–227.
Lin D. and Makis, V., (2004) Filters and parameter estimation for a partially observable
system subject to random failures with continuous-range observations, Adv. Appl. Prob.,
36(4), 1212–1230.
Love C.E., Zhang Z.G., Zitron M.A., and Guo R., (2000) A discrete semi-Markov decision
model to determine the optimal repair/replacement policy under general repairs, Euro. J.
Opl Res, 125, 2, 398–409
Love, C.E. and Guo, R., (1991) Using proportional hazard modelling in plant maintenance.
Quality and Reliability Engineering International, 7, 7–17.
Makis, V. and Jardine, A.K.S., (1991) Computation of optimal policies in replacement
models, IMA J. Maths. Appl. Business & Industry, 3, 169–176.
Matthew, C. and Wang, W., (2006) A comparison study of proportional hazard and
stochastic filtering when applied to vibration based condition monitoring, submitted to
Int. Tran OR.
Meher-Homji, C.B., Mistree, F. and Karandikar, S., (1994) An approach for the integration
of condition monitoring and multi-objective optimization for gas turbine maintenance
management. International Journal of Turbo and Jet Engines, 11, 43–51.
Neal, M., and Associates, (1979) Guide to the condition monitoring of machinery, DTI,
Reeves, C.W. (1998) The vibration monitoring handbook, Coxmoor Publishing Company,
Samanta, B., Al-Balushi, K.R., Al-Araimi, S.A. (2006) Artificial neural networks and
genetic algorithm for bearing fault detection Soft Computing, 10 (3), 264–271.
Wang, W., (2002) A model to predict the residual life of rolling element bearings given
monitored condition monitoring information to date, IMA. J. Management Mathematics,
13, 3–16.
Condition-based Maintenance Modelling 131

Wang, W., (2003) Modelling condition monitoring intervals: A hybrid of simulation and
analytical approaches, J. Opl. Res Soc, 54, 273–282.
Wang, W., (2006a) A prognosis model for wear prediction based on oil based monitoring, to
appear in J. Opl. Res Soc,
Wang, W., (2006b) Modelling the probability assessment of the system state using available
condition information, to appear in IMA. J. Management Mathematics.
Wang, W. and Christer, A.H., (2000) Towards a general condition based maintenance model
for a stochastic dynamic system, J. Opl. Res. Soc. 51, 145–155.
Wang, W. and Jia, Y., (2001) A multiple condition information sources based maintenance
model and associated prototype software development, proceedings of COMADEM
2001, Eds. A. Starr and Raj B.K.N. Rao, Elsevier, 889–898.
Wang, W. and Zhang, W., (2005) A model to predict the residual life of aircraft engines
based on oil analysis data, Naval Logistics Research, 52, 276–284.
Wong, M.L.D., Jack, L.B., Nandi, A.K., (2006) Modified self-organising map for automated
novelty detection applied to vibration signal monitoring Mech. Sys. & Sig. Proc., 20(3),
Zhang, W., (2004) Stochastic modeling and applications in condition based maintenance,
PhD, thesis, University of Salford, UK.

Maintenance Based on Limited Data

David F. Percy

6.1 Introduction
Reliability applications often suffer from a lack of data with which to make in-
formed maintenance decisions. Indeed, the very nature of maintenance is to avoid
observed failure data from arising!
This effect is particularly noticeable for high reliability systems such as aircraft
engines and emergency vehicles, and when new production lines are established or
warranty schemes are planned. The evaluation of such systems is a learning pro-
cess and knowledge is continually updated as more information becomes available.
Such issues are of great importance when selecting and fitting mathematical
models to improve the accuracy and utility of these decisions.
This chapter investigates why reliability data are so limited, identifies the
problems that this causes and proposes statistical methods for dealing with these
difficulties. In particular, it considers graphical and numerical summaries, appropri-
ate methods for model development and validation, and the powerful approach of
subjective Bayesian analysis for including expert knowledge about the application
area, such as information pertaining to a particular manufacturing process and ex-
perience of similar operational systems.
Many reliability problems involve making strategic decisions under risk or un-
certainty. Stochastic models involving unknown parameters are often adopted for
this purpose and our concern is how to make inference about, and arising from,
these unknown parameters. The easiest approach involves skilfully guessing the
parameter values by subjective means, which is fine so long as there is sufficient
expert knowledge to perform this task well. More commonly, the parameters are
estimated from observed data and decisions are then made by assuming the
parameters equal to their estimates. This frequentist approach to inference is very
good if there are sufficient data to estimate the parameters well.
However, few data are available in many areas of maintenance and replace-
ment; see Percy et al. (1997) and Kobbacy et al. (1997) for example. There are
several reasons why data are scarce in these situations. New systems and processes
134 D. Percy

naturally offer scant historical data about their performance and reliability. Poor
and incomplete maintenance records are often kept, as the engineers and managers
do not always appreciate the potential benefits that can be achieved through
quantitative modelling and analysis. Of equal importance, many observations of
failure times tend to be censored due to maintenance interventions.
Typical applications take the form of reliability analysis, such as modelling a
critical system’s time to failure, and scheduling problems, such as determining
efficient policies for scheduling capital replacement and preventive maintenance,
all of which are considered elsewhere in this book. Other applications include
determining appropriate thresholds for condition monitoring and specifying
warranty schemes for new products. Under these circumstances, it is important to
allow for the uncertainty about the unknown model parameters. This is readily
achieved by adopting the Bayesian approach to inference, as described by
Bernardo and Smith (2000) and O’Hagan (2004).
The structure for the remainder of this chapter is as follows. Section 6.2
explains the need for Bayesian analysis and Section 6.3 introduces the concepts
beginning with Bayes’ theorem, which is of great importance in its own right.
Section 6.4 discusses the construction of prior and posterior distributions, whilst
Section 6.5 considers the role of predictive distributions and Section 6.6 considers
techniques for setting the hyperparameters of prior distributions. One of the great
strengths of the Bayesian approach, particularly in relation to practical problems in
reliability and maintenance, is its ability to improve the quality of decision
analysis, as described in Section 6.7. Section 6.8 presents a review of the Bayesian
approach to maintenance and Section 6.9 includes specific case studies that demon-
strate these methods. Finally, Section 6.10 suggests topics for future research and
possible new applications.
For convenience, there follows a list of symbols and acronyms that are used
throughout this chapter.
P(⋅) : Probability
E (⋅) : Expected value
p(⋅) : Probability mass function
f (⋅) : Probability density function
R(⋅) : Reliability function
L(⋅) : Likelihood function
g (⋅) : Prior or posterior probability density function
Be(θ ) : Bernoulli distribution
Po(µ ) : Poisson distribution
Ge(θ ) : Geometric distribution
Ex(λ ) : Exponential distribution
No(µ,ψ ) : Normal distribution
Ga (α , λ ) : Gamma distribution
We(α , λ ) : Weibull distribution
Maintenance Based on Limited Data 135

6.2 Need for Bayesian Approach

Figure 6.1 shows the links between equipment, maintenance, models, parameters and
data. Starting with the equipment, imperfect reliability necessitates some forms of
maintenance. These affect the performance of the equipment as implied by the arrow,
which represents a directional influence. In order to determine suitable maintenance
policies and strategies, we formulate appropriate mathematical models. These in-
volve unknown parameters that are modelled using expert knowledge and observed
data. Variations to the models arise due to modified reliability characteristics when
maintenance strategies are in place for particular equipment, forming the cycle at the
top of the chart.

Figure 6.1. The link between fundamental aspects of maintenance modelling and analysis

The conventional approach to model fitting is based upon frequentist methods

of estimation, as described in statistics books. One of the best such methods is that
of maximum likelihood. Essentially, all unknown model parameters are replaced
by estimates calculated from samples of data. For example, a parameter that re-
presents the mean lifetime of a rechargeable battery in a portable computer might
be replaced by the average lifetime calculated from a sample of such batteries that
were run from charged to flat.
However, this approximation can and does lead to substantial errors, inaccuracies
and poor decisions, particularly when the estimates are based on small samples of
data. When data are limited in this way, one starts with subjective estimates and
updates them as new data are observed. This is a very common scenario in reliability
and maintenance, where the samples typically contain few and censored failure data.
Example 6.1 Suppose we use a Weibull We (α , λ ) distribution to model the
random variable X , which represents the breaking strain of a steel cable. As
destructive testing can be very expensive and safety precautions can be crucial, it is
feasible that we might only collect right-censored observations of the form
D = { xi : X > xi ; i = 1, 2,… , n} . In order to make useful inference involving the
model parameters α > 0 and λ > 0 , we need to construct the likelihood function.
For this scenario, the likelihood involves the reliability function of X ,

R ( x α , λ ) = exp −λ xα ) (6.1)
136 D. Percy

for x > 0 , and takes the form

n n
⎛ ⎞
L (α , λ ; D ) ∝ ∏ R ( xi α , λ ) = exp ⎜ −λ xiα ⎟ .
∑ (6.2)
i =1 ⎝ i =1 ⎠

We typically maximize this function in order to evaluate the maximum likelihood

estimates of α and λ . To do so, the likelihood equations are

λ ∑x
i =1
i =0; (6.3)

λ ∑x
i =1
i log xi = 0 . (6.4)

These have no finite solutions for α̂ and λ̂ , so our analysis has been thwarted by
the lack of uncensored data. □

6.3 Bayesian Inference

In 1763, some research on probability theory by the Reverend Thomas Bayes was
posthumously published in Philosophical Transactions of the Royal Society. This
contained an incredibly important statement of what we now refer to as Bayes’
theorem. In its simplest form, Bayes’ theorem states that for two events A and B ,
the conditional probability of B given that A has occurred can be expressed as

P ( A B) P ( B)
P ( B A) = (6.5)
P ( A)

where it is sometimes useful to evaluate the probability of A using the law of total

P ( A) = P ( A B ) P ( B ) + P ( A B′ ) P ( B′ ) (6.6)

where the event B ′ is the complement of the event B ; that is, the event that B
does not occur. Bayes’ theorem can be interpreted as a way of transposing the
conditionality from P ( A B ) to P ( B A ) , or as a way of updating the prior prob-
ability P ( B ) to give the posterior probability P ( B A ) .
Example 6.2 An aircraft warning light comes on if the landing gear on either
side is faulty. Suppose we know that faults only occur 0.4% of the time, that they
are detected with 99.9% reliability and that false alarms only occur 0.5% of the
time when the landing gear is operational. Defining events W = “warning light
comes on” and L = “landing gear faulty”, this information can be summarized
Maintenance Based on Limited Data 137

concisely as P ( L ) = 0.004 , P (W L ) = 0.999 , P (W L′ ) = 0.005 . Our aim is to

calculate the probability that the landing gear is faulty if the warning light comes
on. Intuitively, one might suppose that this probability is very close to one, as the
alarm system appears to be very accurate. However, the law of total probability

P (W ) = P (W L ) P ( L ) + P (W L′ ) P ( L ′ )
= 0.999 × 0.004 + 0.005 × 0.996 = 0.008976 ,

from which Bayes’ theorem gives

P (W L ) P ( L ) 0.999 × 0.004
P(L W ) = = = 0.45 (6.8)
P (W ) 0.008976

to two decimal places. This result implies that most (55%) of these warning lights are
false alarms, despite the apparent accuracy of the alarm system! The reason for this
paradoxical outcome is that the landing gear is operational for the vast majority of the
time. If we were to specify P ( L ) = 0.04 instead, we would obtain P ( L W ) = 0.89 ,
which is far more acceptable. Similar patterns of behaviour apply to medical
screening procedures – in order to reduce the incidence of misdiagnoses, only
patients deemed to be at risk of an illness are routinely screened for it. □
Only in the mid-twentieth century were the real benefits of Bayes’ theorem
appreciated though. Not only does it apply to probabilities, but also to random
variables. For example, suppose X is a discrete random variable and Y is a con-
tinuous random variable. Then the conditional probability density function of Y
given X can be determined using Bayes’ theorem, if we know the marginal distri-
butions of X and Y , and the conditional distribution of X given Y :

p ( x y) f ( y)
f ( y x) = . (6.9)
p ( x)

This rule for “transposing the conditionals” has proven to be crucial in a variety of
important applications, including quality control, fault diagnosis, image processing,
medical screening and criminal trials.
Even more importantly, we can apply Bayes’ theorem to unknown model
parameters. This is the foundation of the Bayesian approach to statistical inference
and has had an enormous and profound impact on the subject over the last few
decades. Suppose that a continuous random variable X has a probability distribution
that depends on an unknown parameter θ . For example, X might represent the fire-
breach time of a door in minutes and it might have an exponential distribution with
unknown mean µ = 1 θ .
A naïve approach to statistical inference would simply replace θ by a good
guess based on expert opinions. However, this is inherently inaccurate and can lead
to poor decisions. A better method is the frequentist approach to inference, where-
138 D. Percy

by we evaluate an estimate θˆ for the unknown parameter θ based on a set of ob-

served data D = { x1 , x2 ,… , xn } , which might consist of a random sample of actual
fire-breach times for the above example. Subsequent analyses generally invoke the
approximation θ ≈ θˆ , which can again lead to poor decisions.
In contrast, the Bayesian approach does not involve any guesses or estimates of
unknown parameters in the model. Rather, it uses Bayes’ theorem to update our
prior beliefs about θ in response to the observed data D thus:

f ( D θ ) g (θ )
g (θ D ) = . (6.10)
f ( D)

This enables us to make any inference we wish about θ . We can also use our
posterior beliefs about θ for any subsequent inference involving X . The price that
we pay for obtaining exact answers and avoiding approximations in this way
comes in two parts, the need to assume a prior distribution for θ and the increase
in algebraic complexity. This chapter shows how to resolve these issues.
Example 6.3 Suppose the unknown parameter θ represents the proportion of
car batteries that fail within two years and our prior beliefs about θ can be ex-
pressed in terms of the probability density function

g (θ ) = 2 (1 − θ ) ; 0 < θ < 1 . (6.11)

Suppose also that we observe three car batteries, one of which fails within two
years and two of which do not. Then we can express the likelihood of these data
using the binomial probability mass function

p ( D θ ) = 3θ (1 − θ ) ,

which is the discrete equivalent to the probability density function f ( D θ )

referred to above. As p ( D ) , the discrete equivalent of f ( D ) , does not depend on
θ , an application of Bayes’ theorem as stated above gives

g (θ D ) ∝ p ( D θ ) g (θ ) ∝ θ (1 − θ )

for 0 < θ < 1 , so our posterior beliefs about the unknown parameter θ can be
expressed as a beta distribution θ D ~ Be ( 2, 4 ) . We elaborate on this process
further in Section 6.4. □
Maintenance Based on Limited Data 139

6.4 Prior and Posterior Distributions

Section 6.3 concluded by deriving an equation for updating a prior probability

density function g (θ ) for an unknown parameter θ , based on some observed data
D to give a posterior probability density function g (θ D ) . The term f ( D θ ) is
proportional to the likelihood function. If the data set D consists of a random
sample of observations x1 , x2 ,… , xn of a continuous random variable X with
probability density function f ( x θ ) , then the likelihood function becomes

L (θ ; D ) ∝ ∏ f (x θ ) .
i =1
i (6.14)

As the term f ( D ) does not depend on θ , we can therefore write

g (θ D ) ∝ L (θ ; D ) g (θ ) (6.15)

or, in words,

“posterior is proportional to likelihood times prior”.

This is the fundamental rule for Bayesian inference.

Example 6.4 Previously, in Example 6.3, we considered the proportion of car
batteries that fail within two years. This involved the use of Bayes’ theorem for
this unknown model parameter θ and was an illustration of how the fundamental
rule “posterior is proportional to likelihood times prior” can be applied. To clarify
this demonstration, the likelihood function takes the form

L (θ ; D ) ∝ ∏ p ( x θ ) = θ (1 − θ )
i (6.16)
i =1

where the probability mass function p ( xi θ ) corresponds to a Bernoulli distribu-

tion. Consequently, the posterior probability density function of θ given the data
D has the form

g (θ D ) ∝ L (θ ; D ) g (θ ) ∝ θ (1 − θ )

for 0 < θ < 1 , which agrees with the result we obtained previously. The
corresponding prior and posterior probability density functions are graphed for
comparison in Figure 6.2. □
140 D. Percy

prior ( θ )

posterior ( θ )

0 0.2 0.4 0.6 0.8 1

Figure 6.2. Prior and posterior probability density functions for Example 6.4

Having evaluated a posterior distribution using this rule, we can evaluate the
posterior mode θˆ such that

( )
g θˆ D ≥ g (θ D ) ∀θ , (6.18)

by solving the equation

L (θ ; D ) g (θ ) = 0 . (6.19)

However, to find the median or mean, and to use this posterior density to make any
further inference, we need to determine the constant of proportionality in the
fundamental rule above. In standard situations, we can recognise the functional
form of L (θ ; D ) g (θ ) and hence quote published work on probability distributions
to determine this constant of proportionality and so derive g (θ D ) explicitly. In
non-standard situations, we determine this constant of proportionality using
numerical quadrature or simulation, both of which we discuss later.

6.4.1 Reference Priors

There are two main types of prior distribution, which loosely correspond with
objective priors and subjective priors. As objective priors strictly do not exist, this
category is generally known as reference priors and are used if little prior
information is available and as a benchmark against which to compare the output
from using subjective priors. This offers a default Bayesian analysis that is not
dependent upon any personal prior knowledge. The simplest reference prior is
proposed by the Bayes-Laplace postulate and simply recommends the use of a
uniform or locally-uniform prior g (θ ) ∝ 1 for all θ in the region of support Rθ .
Maintenance Based on Limited Data 141

However, different parameterisations can lead to different inferences with this

To avoid this inconsistency, the standard univariate reference prior that analysts
now adopt is the invariant prior of Jeffreys (1998), defined by

g (θ ) ∝ I (θ ) ; θ ∈ Rθ (6.20)


⎧⎪ d 2 log f ( x θ ) ⎫⎪
I (θ ) = − E X θ ⎨ ⎬ (6.21)
⎩⎪ dθ 2 ⎭⎪

is Fisher’s expected information. An extension exists for the case of a parameter

vector θ , though we usually assume the components of θ are independent, so
g (θ ) is just the product of the univariate invariant priors. This invariant prior
distribution is occasionally improper, as its integral sometimes diverges. However,
this problem is generally unimportant because the corresponding posterior
distributions are usually proper. Books on Bayesian methods, such as Bernardo and
Smith (2000) and Lee (2004), present tables of invariant prior and posterior
distributions for common models.

6.4.2 Subjective Priors

Subjective prior distributions should be used if prior information is available,

which is almost always. They represent the best available knowledge about
unknown parameters and can be specified using smoothed histograms, relative
likelihoods or parametric families. The first two of these are arbitrary and compu-
tationally awkward, so we now investigate the last of these. A family of priors C
is closed under sampling if

g (θ ) ∈ C ⇒ g (θ D ) ∈ C , (6.22)

so that the posterior density has the same functional form as the prior density. This
property is particularly appealing, as our prior knowledge can be regarded as
posterior to some previous information. Again, we tend to suppose that com-
ponents in multi-parameter problems are independent, so that their joint prior
density is the product of corresponding univariate marginal priors.
Such closed priors exist, and are called natural conjugate priors, for sampling
distributions f ( x θ ) that belong to the exponential family. This family includes
Bernoulli, binomial, geometric, negative binomial, Poisson, exponential, gamma,
normal and lognormal models. For a model in the exponential family with scalar
parameter θ , we can express the probability density or mass function in the form
142 D. Percy

f ( x θ ) = exp {a ( x ) b (θ ) + c ( x ) + d (θ )} (6.23)

and the natural conjugate prior for θ is defined by

g (θ ) ∝ exp {k1b (θ ) + k2 d (θ )} (6.24)

for suitable constants k1 and k 2 . However, any conjugate prior of the form

g (θ ) ∝ h (θ ) exp {k1b (θ ) + k2 d (θ )} (6.25)

is also closed under sampling for models in the exponential family.

Books on Bayesian methods, such as Bernardo and Smith (2000), present tables
of the conjugate prior and posterior distributions for common models. However,
many applications in reliability and maintenance are not amenable to such simple
analyses. For example, the Weibull distribution is not a member of the exponential
family. As a result of this, the constant of proportionality in the expression

g (θ D ) ∝ L (θ ; D ) g (θ ) (6.26)

can sometimes not be evaluated algebraically and analytical approximations or

numerical computation are usually required.
It is desirable to avoid the inconsistency of using natural conjugate priors when
they exist and other forms of subjective prior, such as location-scale forms, when
they do not. The following recommendations by Percy (2004) provide a simple,
consistent and comprehensive strategy that achieves this for general use:
• Infinite range −∞ < θ < ∞ , use a normal prior distribution for θ
• Semi-infinite range 0 < θ < ∞ , use a gamma prior distribution for θ
• Finite range 0 < θ < 1 , use a beta prior distribution for θ
If necessary, linear transformations of the parameters ensure that these priors are
sufficient for modelling all situations. They match with the natural conjugate priors
for simple models and extend to deal with more complicated models. Mixtures of
these priors can be used if multimodality is present and prior independence can be
assumed for multiparameter situations.

6.5 Predictive Distributions

The frequentist approach to inference involves estimating unknown parameters,
evaluating confidence intervals and performing significance tests. Such intervals
and tests are statements about the data rather than the parameters and so are of little
use. For example, null hypotheses are often strictly impossible, in which case a test
will be significant if, and only if, sufficient data are observed. In contrast, the
Maintenance Based on Limited Data 143

Bayesian approach to inference makes statements about the parameters given the
data, which are precisely what is required. O’Hagan (1994) commented that the
“Bayesian approach … is fundamentally sound, very flexible, produces clear and
direct inferences and makes use of all the available information.” In contrast, he
noted that the “Classical approach suffers from some philosophical flaws, has a
restrictive range of inferences with rather indirect meanings and ignores prior
One of the most important and useful features of the Bayesian approach arises
when we wish to make predictions about future values of the random variable X
where f ( x θ ) is specified. If θ is unknown, the prior predictive probability
density function of X is

f ( x) = ∫ f ( x θ ) g (θ ) dθ .

If data D are observed, the posterior predictive probability density function of X


f ( x D) = ∫ f ( x θ ) g (θ D ) dθ

∝ ∫ f ( x θ ) L (θ ; D ) g (θ ) dθ .

In contrast, a frequentist approach either uses the approximation f ( x D ) ≈ f x θˆ( )

( )
or gives a point prediction x̂ = E X θˆ with a prediction interval if available.
Example 6.5 Suppose that the time X to breakdown of a large pulper in a
paper mill has an exponential sampling distribution given some unknown hazard
parameter λ . With Jeffreys’ invariant prior, the prior predictive density is given by

1 1
f ( x ) ∝ ∫ λ exp ( −λ x ) dλ ∝ (6.29)
λ x

for x > 0 , which is improper. However, this does provide information about the
relative likelihoods for different values of X . For example, the ratio of probabilities
that X lies in the intervals (5,10) and (10,20) is given by
144 D. Percy

P ( 5 < X < 10 ) ∫ x dx
5 log10 − log 5
= = =1 (6.30)
P (10 < X < 20 ) 20
1 log 20 − log10


so the time to breakdown of this pulper is equally likely to lie in these two intervals
without taking account of any subjective or empirical information that might
be available. Even if we subsequently observe a random sample of lifetimes
D = { x1 , x2 ,… , xn } the posterior predictive density

⎪⎧ ⎛ n ⎞ ⎪⎫
f ( x D ) ∝ λ exp ( −λ x ) × λ n −1 exp ⎨− ⎜ xi ⎟ λ ⎬ d λ
∫ ∑
0 ⎩⎪ ⎝ i =1 ⎠ ⎪⎭
n! (6.31)
= n +1
; x>0
⎛ ⎞
⎜ x + xi ⎟
⎝ i =1 ⎠

is still improper, though we can evaluate relative likelihoods as we did for the prior
predictive density. In contrast, a frequentist approach would merely generate the
approximation X D ~ Ex (1 x ) and could do no better than guess a value for X
before observing any data. □
Example 6.6 Reconsidering the time to breakdown of the pulper in Example
6.5, suppose we instead use a gamma prior to reflect the knowledge of experts on
site. The prior predictive density is now given by

f ( x ) = λ exp ( −λ x )
∫ λ a −1 exp ( −bλ ) d λ
Γ (a)
= ; x>0
( x + b)
a +1

which corresponds with a special form of gamma-gamma distribution. If we sub-

sequently observe a random sample of lifetimes D = { x1 , x2 ,… , xn } the posterior
predictive density is given by

⎧⎪ ⎛ n
⎞ ⎫⎪
f ( x D ) ∝ λ exp ( −λ x ) λ a + n −1 exp ⎨ − ⎜ b +
∫ ∑ x ⎟⎠ λ ⎬⎪ d λ
0 ⎪⎩ ⎝ i =1 ⎭
Γ ( a + n + 1) (6.33)
= a + n +1
; x>0
⎛ ⎞
⎜ x + b + xi ⎟
⎝ i =1

Maintenance Based on Limited Data 145

which again corresponds to a gamma-gamma distribution. As before, a frequentist

approach would merely yield the approximation X D ~ Ex(1 x ) . □

6.6 Prior Specification

In Section 6.4 we discussed what objective and subjective prior distributions are
appropriate for practical applications. As some prior knowledge is always avail-
able, a conjugate prior should be used whenever possible. However, reference
priors are useful in these circumstances:
• For an objective analysis with no specific personal inputs
• For comparison with similar analyses by other investigators
• As baselines to assess the sensitivity of results to choice of prior
We now consider the difficult problem of assigning values to the hyperparameters
of subjective prior distributions.
Suppose we have a model for a continuous random variable X with probability
density function f ( x θ ) , which depends on a parameter θ with subjective prior
probability density function g (θ ) . Typically, this prior distribution consists of two
unknown hyperparameters, which we now label a and b . We set fixed values for
these hyperparameters, to reflect our prior knowledge about θ . For two hyperpara-
meters, we need two distinct pieces of information such as the upper and lower
tertiles ( 33 1 3 and 66 2 3 percentiles) of θ or the cumulative probabilities corre-
sponding to any two suitable values of θ .
Alternative information about θ could be provided, though quantiles and
cumulative probabilities are the easiest and best formulations. One obvious alterna-
tive is to specify the prior mode, but this is occasionally at an endpoint of the para-
meter’s range and so provides no useful information. Furthermore, there is no
suitable candidate for the second piece of information when the prior mode is used.
Another obvious alternative is to specify the prior mean and standard deviation.
However, we cannot make meaningful judgments about these purely mathematical
Unfortunately, parameters are not observable and we cannot make accurate
statements about them directly. The sole exception is when our parameter repre-
sents the probability of an event associated with infinitely repeatable Bernoulli
trials. In this case, it is feasible to elicit information about an identical quantity, the
asymptotic proportion. In general, however, we can elicit hyperparameter values
by considering the prior predictive distribution introduced in Section 6.5, which is
also a function of a and b ; refer to Percy (2002) for further details. Research in
this area is still ongoing, particularly for models for which the prior predictive
cumulative distribution function cannot be determined analytically and for multi-
parameter models for which there are implicit and indeterminable constraints on
the prior predictive quantiles.
Example 6.7 We saw earlier that the prior predictive probability density func-
tion for the exponential sampling distribution (perhaps representing the time X to
146 D. Percy

failure of a pulper, as before, or the downtime X incurred as a result of a com-

puter system failure) with a gamma prior is given by

ab a
f ( x) = ; x>0. (6.34)
( x + b)
a +1

Hence the prior predictive cumulative distribution function is

x a
ab a ⎛ b ⎞
F ( x) = ∫ dx = 1 − ⎜ ⎟ ; x>0. (6.35)
0 ( x + b)
a +1
⎝ x+b⎠

If an expert specifies tertiles L and U , such that FX ( L ) = 1 3 and FX (U ) = 2 3 ,

then we can solve these two nonlinear simultaneous equations numerically for a
and b . These can then be substituted into our prior density f ( x ) . □
Example 6.8 In Example 6.7, the exponentially distributed random variable X
might instead represent the lifetime of an energy efficient light bulb, in operating
hours. Suppose that, based on subjective knowledge of similar light bulbs, we
believe that one third of the new type will fail within 2500 operating hours and one
third will last for at least 7500 operating hours. This implies that we believe that
the remaining third will fail between these two values. Then L = 2500 and
U = 7500 , so we need to solve these two simulateneous nonlinear equations
numerically for a and b :

1 ⎛ b ⎞
= 1− ⎜ ⎟ ; (6.36)
3 ⎝ 2,500 + b ⎠
2 ⎛ b ⎞
= 1− ⎜ ⎟ . (6.37)
3 ⎝ 7,500 + b ⎠

There are many algorithms for solving simultaneous nonlinear equations and
several computer packages that contain these algorithms. Mathcad gives the values
a = 3.5240 and b = 20,502 , so the prior distribution for the exponential parameter
λ is specified completely as λ ~ Ga ( 3.5240, 20,502 ) . □

6.7 Bayesian Decision Theory

Much research into maintenance modelling, as presented throughout this book,
involves making informed decisions in the presence of stochastic variability.
Sensitivity analyses are always advisable in such circumstances, to consider how
the conclusions are affected by misspecification of the model and its parameters;
see Kobbacy et al. (1995). Rather than replacing model parameters by guesses or
estimates, however, more accurate decisions can be made by adopting a Bayesian
Maintenance Based on Limited Data 147

analysis to allow for the uncertainty attached to these parameters. This effect is
particularly important when dealing with limited amounts of data, a common
problem in the area of reliability and maintenance and the subject of this chapter.
For example, the author recently acquired a set of data relating to the perform-
ance of an industrial valve subject to corrective and preventive maintenance. Only 12
uncensored lifetime observations were available, despite the fact that this represents
six years of data collection. From a frequentist point of view, it would be unwise to
fit any model involving more than three parameters to these data. However, the
Bayesian is not constrained in this manner, as prior knowledge gleaned from ex-
perience of similar systems can be incorporated in the analysis. Of course, parsimony
still dictates that models with fewer parameters are more robust for predictive
purposes, even if they provide better fits to the observed data. We can resolve such
issues using model comparison methods using prior odds, Bayes factors and posterior
odds, which we do not discuss here.
Consider a set of possible decisions d ∈ ∆ with associated utility function
u (d ,θ ) , which depends on an unknown parameter θ . The best decision is that
which maximizes the prior expected utility

E {u ( d , θ )} = ∫ u ( d ,θ ) g (θ ) dθ (6.38)

with respect to the prior probability density function g (θ ) . Alternatively, we can

minimize the prior expected loss

E {l ( d , θ )} = ∫ l ( d ,θ ) g (θ ) dθ (6.39)

for some loss function l ( d , θ ) . If we observe exchangeable data D = { x1 , x2 ,… , xn }

from the sampling density f ( x θ ) , the criterion to maximize (minimize) is the
posterior expected utility (loss) defined by

E {u ( d , θ ) D} = ∫ u ( d ,θ ) g (θ D ) dθ (6.40)


g (θ D ) ∝ L (θ ; D ) g (θ ) (6.41)

is the posterior probability density function.

Example 6.9 Which of two alarm systems should we buy if they cost ci units
and fail at times X i where

X i λi ~ Ex ( λi ) (6.42)
148 D. Percy

for i = 1,2 respectively? Assuming replacements on failure for an infinite horizon,

the elementary renewal theorem gives the expected cost per unit time for action i
as the loss function

l ( i, λi ) = ci λi . (6.43)


Eλ {l ( i, λi )} = ci E ( λi )

and we choose system i which minimizes this expected loss, where E ( λi ) is the
prior mean. □

6.8 Review of Bayesian Approach to Maintenance

Whether we are interested in modelling the reliability of components or systems,
assessing the quality of manufactured products, determining optimal replacement
policies, deciding when to intervene with preventive maintenance, interpreting the
results from condition monitoring, resolving stock control problems or establishing
warranty schemes, mathematical models and statistical analysis offer many ad-
vantages over subjective expert knowledge alone. This book describes many tech-
niques related to the modelling aspects and generally advocates the frequentist
approach of estimating unknown model parameters based upon random samples of
observed data.
However, Chapter 6 has emphasised that this approach only provides approxi-
mate inference, decisions, predictions and solutions. When many data are available,
such as might arise when analysing the returns data from common household appli-
ances, these approximations are very accurate. However, these approximations can
be very inaccurate when few data are available. We often encounter this situation in
maintenance modelling, as the whole purpose of maintenance is to prevent failures
from occurring and so lifetime observations are typically censored. Moreover, some
applications in this general area relate to products or systems that are completely
new or modified versions and for which reliability data are simply not available.
By combining the observed data with expert knowledge, the Bayesian approach
to statistical analysis avoids the need for approximate inference and yields exact
answers under the assumed model and prior formulations. These enable us to make
the best maintenance decisions given all available information. This chapter began
by justifying this approach and then investigated suitable forms for the prior
distributions corresponding to common reliability models. After describing how to
calculate the related posterior and predictive distributions, it discusses how to use
this knowledge for decision making in practice.
Maintenance Based on Limited Data 149

Table 6.1. Common distributions for maintenance modelling

Model and Prior Posterior

BERNOULLI beta beta

Be (θ ) Be ( a, b ) Be ( a + nx , b + n (1 − x ) )
probability θ
POISSON gamma gamma

Po ( µ ) Ga ( a, b ) Ga ( a + nx , b + n )

mean µ

GEOMETRIC beta beta

Ge (θ ) Be ( a, b ) Be ( a + n, b + n ( x − 1) )
probability θ
EXPONENTIAL gamma gamma

Ex ( λ ) Ga ( a, b ) Ga ( a + n, b + ∑ x )

hazard λ
NORMAL normal normal
No ( µ ,ψ ) No ( a, b ) ⎛ ab + nxψ ⎞
No ⎜ , b + nψ ⎟
mean µ ⎝ b + nψ ⎠

known precision ψ

NORMAL gamma gamma

No ( µ ,ψ ) Ga ( c, d ) ⎛ n n ( x − µ ) ( n − 1) s 2

Ga ⎜ c + , d + + ⎟
known mean µ ⎜ 2 2 2 ⎟
⎝ ⎠
precision ψ

Table 6.1 presents a summary of the probability distributions commonly en-

countered in maintenance analysis, together with details of their natural conjugate
prior and posterior distributions. For models with two parameters, including the
unconstrained normal, gamma and Weibull sampling distributions, the analysis is
less straightforward and readers are referred to Section 6.4.2 for guidance.
Among the published research that applies this methodology to maintenance
modelling is the extensive book on Bayesian reliability analysis by Martz and
Waller (1982). Journal papers that address specific issues include those by Soland
150 D. Percy

(1969), Bury (1972) and Canavos and Tsokos (1973), who are concerned particu-
larly with analysis of the Weibull distribution. Singpurwalla (1988) and Percy
(2004) are concerned with prior elicitation for reliability analysis and O’Hagan
(1998) presents an accessible, general discussion of Bayesian methods.
There are many other academic publications dealing with Bayesian approaches
in maintenance and a representative sample of recent articles include those by van
Noortwijk et al. (1992), Mazzuchi and Soyer (1996), Chen and Popova (2000),
Apeland and Aven (2000), Kallen and van Noortwijk (2005) and Celeux et al.
(2006). The general aim is to determine optimal policies for maintenance schedul-
ing and operation, by combining subjective prior knowledge with observed data
using Bayes’ theorem and employing belief networks for larger systems.

6.9 Case Studies

We now consider two case studies in which the techniques of this chapter can be
applied successfully.

6.9.1 Digital Set Top Boxes

The proportion of defective test versions of digital set top boxes θ in a large ship-
ment is unknown, but a beta prior probability density function of the form

g (θ ) = θ a −1 (1 − θ ) ; 0 < θ < 1
b −1
B ( a, b )

is appropriate. An expert believes that θ is equally likely to lie in each of the

intervals (0, 1 50 ) , ( 1 50 , 1 20 ) and ( 1 20 ,1) , which corresponds to hyperparameter
values of a = 1.112 and b = 24.03 as displayed in Figure 6.3.
Given that 100 boxes are selected at random from the shipment and 3 of these
are found to be defective, we can determine the posterior probability density func-
tion of θ from the first row of Table 6.1 as a Be ( 4.112,121.0 ) distribution, which
is also displayed in Figure 6.3. This enables us to evaluate numerically the
posterior probability that the proportion of defective boxes in the shipment exceeds
10 as P (θ > 10 D ) = 0.0013 , or 1 in 763.
1 1

As a final exercise, suppose we select a further box at random from the ship-
ment and consider the random variable X which takes the value 0 if the box is
functional, or 1 if it is defective. Then Equation 6.28 can be used to determine the
posterior predictive probability mass function for X given the data above as

⎧0.967 ; x = 0
p ( x D) = ⎨ (6.46)
⎩0.033 ; x = 1

so the posterior probability that a randomly chosen box from the shipment is
defective is P ( X = 1 D ) = 0.033 , or 1 in 30.
Maintenance Based on Limited Data 151


prior( θ )

posterior( θ )

0 0.1 0.2
Figure 6.3. Prior and posterior probability density functions for digital set top boxes

6.9.2 Rechargeable Tool Batteries

A manufacturer is interested in assessing the unknown hazard λ of rechargeable

tool batteries for inter-charge operational times measured in hours. Her prior be-
liefs are represented by a Ga (10, 40 ) distribution with probability density function

4010 λ 9
g (λ ) = exp ( −40λ ) ; λ > 0 . (6.47)

She runs an experiment for one day, replacing each flat battery by an identical fully
charged battery after failure, so that the total number of failures X has a Poisson
distribution with probability mass function

( 24λ )

p(x λ) = exp ( −24λ ) ; λ = 0,1, 2,… . (6.48)


In fact, she runs n = 10 such experiments in parallel, giving a sample mean of

x = 6.7 . Referring to the second row of Table 6.1 and transforming to failures per
hour, we see that her posterior beliefs about λ correspond to a Ga ( 77, 280 )
distribution, as displayed in Figure 6.4. The posterior mode is 0.27 , which corre-
sponds to the most likely value of λ .
152 D. Percy


prior( λ )
posterior ( λ )

0 0.5 1
Figure 6.4. Prior and posterior probability density functions for rechargeable tool batteries

6.10 Conclusions
Bayesian inference represents a methodology for mathematical modelling and
statistical analysis of random variables and unknown parameters. It provides an
excellent alternative to the frequentist approach which gained immense popularity
throughout the twentieth century. Whereas the frequentist approach is based upon
the restrictive inference of point estimates, confidence intervals, significance tests,
p-values and asymptotic approximations, the Bayesian approach is based upon
probability theory and provides complete solutions to practical problems.
Advocates of the Bayesian approach regard it as superior to the frequentist
approach in most circumstances and infinitely superior in some. However, it does
depend upon the existence and specification of subjective probability to represent
individual beliefs, whereas the frequentist approach is almost completely objective.
Partial resolution of these difficulties was addressed in Section 6.6 and continues to
be improved upon, particularly in regards to eliciting subjective prior knowledge
for multiparameter models. The approach advocated here also involves more ana-
lytical and computational complexity, though this is not much of a hindrance with
modern computing power.
In particular, this approach often involves intractable integrals of the forms

g (θ D ) ∝ ∫ L (θ ; D ) g (θ ) dθ posterior densities; (6.49)

f ( x) = ∫ f ( x θ ) g (θ ) dθ
predictive densities; (6.50)
Maintenance Based on Limited Data 153

E {u ( d , θ )} = ∫ u ( d ,θ ) g (θ ) dθ expected utilities. (6.51)

Monte Carlo simulation can be used to approximate any integral of this form by
generating many pseudo-random numbers θ1 , θ 2 ,… ,θ n from the prior or posterior
density in the integrand and evaluating the unbiased estimator
∞ n
∫ s (θ ) g (θ ) dθ ≈ n ∑ s (θ ) ,
−∞ i =1
i (6.52)

though more efficient procedures exist.

Rejection methods are used to generate the pseudo-random numbers and the
most powerful such algorithms are referred to as Markov chain Monte Carlo
(MCMC) methods, the most common of which is Gibbs sampling. At the time of
writing, WinBUGS software is freely available for performing MCMC calculations
and may be downloaded from the internet. Further information about MCMC
techniques, and other analytical and numerical methods for Bayesian computation,
are discussed in the textbooks mentioned in the introduction.
We have explained why the solution of many problems arising in maintenance
applications is often hampered by a lack of data and so are prime candidates for
applying the ideas presented in this chapter. In particular, we suggested and demon-
strated how this methodology might benefit decision making related to modelling
times to failure and scheduling problems, such as determining efficient policies for
scheduling capital replacement and preventive maintenance, determining appropriate
thresholds for condition monitoring and specifying warranty schemes for new pro-
ducts. There is considerable scope for developing these techniques for new appli-
cation areas within maintenance and extending them into related areas.
Potential future projects might consider original products, such as recent inven-
tions or modified lines, and items that are tailored to consumers’ specifications, such
as construction works, for which historical data are not available. Similarly, some
rare, expensive and safety critical systems will have limited failure data with which
to estimate model parameters. Enhancements to warranty analysis are also possible,
particularly in cases where returns data are not readily available, including natural
extensions of the basic concepts to the analysis of extended warranties. Finally,
broader definitions of reliability and maintenance would enable us to apply some of
the preceding ideas to non-industrial systems, such as information networks, social
communities and public services.

6.11 References
Apeland S, Aven T, (2000) Risk based maintenance optimization: foundational issues.
Reliability Engineering & System Safety 67:285–292
Bernardo JM, Smith AFM, (2000) Bayesian Theory. Chichester: Wiley
Bury KV, (1972) Bayesian decision analysis of the hazard rate for a two-parameter Weibull
process. IEEE Transactions on Reliability 21:159–169
154 D. Percy

Canavos GC, Tsokos CP, (1973) Bayesian estimation of life parameters in the Weibull
distribution. Operations Research 21:755–763
Celeux G, Corset F, Lannoy A, Ricard B, (2006) Designing a Bayesian network for
preventive maintenance from expert opinions in a rapid and reliable way. Reliability
Engineering & System Safety 91:849–856
Chen TM, Popova, E, (2000) Bayesian maintenance policies during a warranty period.
Communications in Statistics 16:121–142
Jeffreys H, (1998) Theory of Probability. Oxford: University Press
Kallen MJ, van Noortwijk JM, (2005) Optimal maintenance decisions under imperfect
maintenance. Reliability Engineering & System Safety 90:177–185
Kobbacy KAH, Percy DF, Fawzi BB, (1995) Sensitivity analyses for preventive-main-
tenance models. IMA Journal of Mathematics Applied in Business and Industry 6:53–66
Kobbacy KAH, Percy DF, Fawzi, BB, (1997) Small data sets and preventive maintenance
modelling. Journal of Quality in Maintenance Engineering 3:136–142
Lee PM, (2004) Bayesian Statistics: an Introduction. London: Arnold
Martz HF, Waller RA, (1982) Bayesian Reliability Analysis. New York: Wiley
Mazzuchi TA, Soyer R, (1996) A Bayesian perspective on some replacement strategies.
Reliability Engineering & System Safety 51:295–303
O’Hagan A, (1998) Eliciting expert beliefs in substantial practical applications. The
Statistician 47:21–35
O’Hagan A, (1994) Kendall's Advanced Theory of Statistics Volume 2B: Bayesian Inference.
London: Arnold
Percy DF, (2002) Bayesian enhanced strategic decision making for reliability. European
Journal of Operational Research 139:133–145
Percy DF, (2004) Subjective priors for maintenance models. Journal of Quality in Main-
tenance Engineering 10:221–227
Percy DF, Kobbacy KAH, Fawzi BB, (1997) Setting preventive maintenance schedules
when data are sparse. International Journal of Production Economics 51:223–234
Singpurwalla ND, (1988) An interactive PC-based procedure for reliability assessment
incorporating expert opinion and survival data. Journal of the American Statistical
Association 83:43–51
Soland RM, (1969) Bayesian analysis of the Weibull process with unknown scale and shape
parameters. IEEE Transactions on Reliability 18:181–184
van Noortwijk JM, Dekker A, Cooke RM, Mazzuchi TA, (1992) Expert judgement in
maintenance optimization. IEEE Transactions on Reliability 41:427–432

Reliability Prediction and Accelerated Testing

E. A. Elsayed

7.1 Introduction
Reliability is one of the key quality characteristics of components, products and
systems. It cannot be directly measured and assessed like other quality characteris-
tics but can only be predicted for given times and conditions. Its value depends on
the use conditions of the product as well as the time at which it is to be predicted.
Reliability prediction has a major impact on critical decisions such as the optimum
release time of the product, the type and length of warranty policy and associated
duration and cost, and the determination of the optimum maintenance and replace-
ment schedules. Therefore, it is important to provide accurate reliability predictions
over time in order to determine accurately the repair, inspection and replacements
strategies of products and systems.
Reliability predictions are based on testing a small number of samples or proto-
types of the product. The difficulty in predicting reliability is further complicated by
many limitations such as the available time to conduct the test and budget con-
straints, among others. Testing products at design conditions requires extensive
time, large number of units and cost. Clearly some kind of reliability testing, other
than testing at normal design conditions, is needed. One of the most commonly used
approaches for testing products within the above stated constraints is accelerated life
testing (ALT) where units or products are subjected to more severe stress conditions
than normal operating conditions to accelerate its failure time and then use the test
results to predict (extrapolate) the reliability at design conditions. This Chapter will
address the determination of optimum maintenance schedule at normal operating
conditions while utilizing the results from accelerated testing.
We classify the ALT into two types: accelerated failure time testing (AFTT) and
accelerated degradation testing (ADT). The AFTT is conducted when accelerated
conditions result in the failure of test units without experiencing failure mechanisms
different from those occurring at normal operating conditions and when there is
“enough” units to be tested at different conditions. Moreover, the economics of
conducting AFTT need to be justified as the test is destructive and its duration is
156 E. Elsayed

directly related to the reliability of test units and the applied stresses. Finally, testing
at stresses far from normal makes it difficult to predict reliability accurately at
normal conditions as in some cases few or no failures are observed even under
accelerated conditions making reliability inference via failure time analysis highly
inaccurate, if not impossible. On the other hand ADT is a viable alternative to
AFTT when the product’s physical characteristics or performance indices leading to
failure (e.g. drift in resistance value of a resistor, change in light intensity of light
emitting diodes (LED) and loss of strength of a bridge structure) experience
degradation over time. Moreover, significant degradation data can be obtained by
observing degradation of a small number of units over time. Degradation testing
may also be conducted either at normal or accelerated conditions, and no actual
failure is required for reliability inference (Liao 2004).
In this chapter, we address the issues associated with conducting accelerated life
testing and describe how the reliability models obtained from ALT are used in the
determination of the optimum maintenance schedules at normal operating conditions.
This chapter is organized as follows. Section 7.1 provides an overview of the role of
reliability prediction and the importance of accelerated life testing. In Section 7.2 we
present the two most commonly used accelerated life testing types in reliability
engineering. The approaches and models for predicting reliability using accelerated
life testing are described in Section 7.3 while Section 7.4 focuses on mathematical
formulation and solution of the design of accelerated life testing plans. Section 7.5
shows how accelerated life testing is related to maintenance decisions at normal
operating conditions. Models to determine the optimum preventive maintenance
schedules for both failure time models and degradation models are presented in
Section 7.6. A summary of the chapter is presented in Section 7.7. We begin by
describing the ALT types.

7.2 ALT Types

7.2.1 Accelerated Failure Time Testing

It is known that the more reliable the device, the more difficult it is to measure its
reliability. In fact, many devices last so long that life testing at normal operating
conditions is impractical. Furthermore, testing devices or components at normal
operating conditions requires an extensive amount of time and a large number of
devices in order to obtain accurate measures of their reliabilities.
ALT is commonly used to obtain reliability and failure rate estimates of devices
and components in a much shorter time.
A simple way to accelerate the life of many components or products that are
used on a continuous time basis such as tires and light bulbs is to accelerate time
(i.e. run the product at a higher usage rate). It is typically assumed that the number
of cycles, hours, etc., to failure during testing is the same as would be observed at
the normal usage rate. For example, in evaluating the failure time distribution of
light bulbs which are used on the average about 6 h per day, one year of operating
experience can be compressed into three months by using the light bulb for 24 h
every day. The advantage of this type of testing is that no assumptions need to be
Reliability Prediction and Accelerated Testing 157

made about the relationship of the failure time distributions at both the accelerated
and the normal conditions. However, it is not always true that the number of cycles
to failure at high usage rate is the same as that of the normal usage rate. Moreover,
the effect of aging is ignored. Therefore, this type of testing must be run with
special care to assure that product operation and stress remain normal in all regards
except usage rate and the effect of aging is taken into account, if possible.
An alternative to the above accelerated failure time testing is to accelerate
stress (apply stresses more severe than that of the normal conditions) to shorten
product or component life. Typical accelerating stresses are temperature, voltage,
humidity, pressure, vibration, and fatigue cycling. It is important to recognize the
type of stress which indeed accelerates product or component life. Suitable
accelerating stresses need to be determined. One may also wish to know how
product life depends on several stresses operating simultaneously. In accelerated
life testing, the test stress levels should also be controlled. They cannot be so high
as to produce other failure modes that rarely or are unlikely to occur at normal
conditions. Yet levels should be high enough to yield enough failures similar to
those that exist at the design (operating) stress. The limited range of the stress
levels needs to be specified in the test plans to avoid invalid or biased estimates of
reliability. The stress application loading can be constant, increase (or decrease)
continuously or in steps, vary cyclically, or vary randomly or combinations of
these loadings. The choice of such stress loading depends on how the product is
loaded in service and on practical and theoretical limitations (Shyur 1996).

7.2.2 Accelerated Degradation Testing

In some cases, applying high stresses might not induce failures or result in
sufficient data and reliability inference via failure time analysis becomes highly
inaccurate, if not impossible. However, if a product’s physical characteristics or
performance indices leading to failure experience degradation over time then
degradation analysis could be a viable alternative to traditional failure time ana-
lysis. The advantages of degradation modeling over time-to-failure modeling are
significant. Indeed, degradation data may provide more reliability information than
would otherwise be available from time-to-failure data with censoring. Moreover,
degradation testing may be conducted either at normal or accelerated conditions,
and no actual failure is required for reliability inference.
Degradation data needed for reliability inference may be obtained from two
categories: the first is field application and the second is degradation testing
experiments. The first category requires an extensive data collection system over a
long time. Since the collected data are often subject to highly random stress environ-
ment and human errors, the data may exhibit significant volatility and sometimes its
accuracy is questionable, limiting its use for reliability inference and prediction. The
second category, prognostics, is a process of predicting the future state of a product
(or component). Degradation data analysis might be used in this process to mini-
mize field failure and reduce the life-cycle expenses by recommending condition-
based maintenance on observed components or systems. Moreover, degradation
testing is usually conducted to demonstrate products’ reliability and helps in
revealing the main failure mechanisms and the major failure-causing stress factors.
158 E. Elsayed

It may be conducted at or close to the normal operating conditions to provide more

accurate and precise information for reliability estimates. Yet, to save time and cost,
accelerated degradation testing (ADT) is commonly used to obtain immediate data
for extrapolating reliability under normal conditions. ADT is conducted by testing
units (products or components) at accelerated conditions and measuring its degrada-
tion indicators with time. The test can be terminated once “enough” observations are
obtained without causing destruction of the test unit (nondestructive testing) if
possible. For general purposes, a degradation model along with inference procedure
that can utilize both field degradation data and degradation testing data is preferred,
and its potential ability to be embedded into the development of systems for
prognostics purposes is of additional value to the manufacturers.
Reliability assessment using ADT experiments requires an appropriate degrada-
tion model, a carefully designed test plan and insightful investigation of the field
operating environment in order to achieve high accuracy of the reliability estimates.
An appropriate degradation model is the one that accurately interprets the effects
of the stresses on the degradation process of a product based on its physical
properties and the related probability distributions. On the other hand, a carefully
designed test plan may improve the accuracy of the developed degradation model
and the efficiency of the experiments. The design of the test plan consists of
objective functions, several constraints and decision variables such as stress levels,
sample allocation ratios at stress levels, frequency of observing and measuring
degradation and test termination time. Inappropriate assignments of these decision
variables in practice result in inaccurate reliability estimates. Moreover, it is a
challenging and critical issue to consider the stochastic nature of the normal (field)
operating conditions in reliability inference from ADT to the normal operating
conditions. When field stresses are not deterministic, which is usually the case; their
uncertainty will potentially influence the degradation process of the product. If such
variations and extremes are ignored in a reliability model, an inaccurate estimate
will result, sometimes misleading the judgment for reliability requirements,
warranty decisions and the maintenance plans. Therefore, it is important to design
robust test plans subject to constraints. The plans should be robust to: the accuracy
estimation of the model parameters, the underlying distributions (in case of
misspecifications) and robust to the underlying stress-life relationship. Currently,
the literature relating ADT to field applications is rare. Without scientific guide
from the literature, it is hard to make an appropriate robust design to tolerate the
extremes while avoiding “over-design” of the product.
In both accelerated failure time (usage or stress) and degradation testing
(normal or stress) robust models that relate the results of the test to the normal
operating conditions (or other conditions) are needed. In the following section, we
describe such models and discuss their assumptions and limitations.

7.3 ALT Reliability Estimation Models

The accuracy of reliability estimation depends on the models that relate the failure
data under severe conditions, or high stress, to that at normal operating conditions,
or design stress. Elsayed (1996) classifies these models into three groups: statistics-
Reliability Prediction and Accelerated Testing 159

based models, physics-statistics models, and physics-experimental models.

Furthermore, he classifies the statistics models into two sub-categories: parametric
and nonparametric models. We limit the models in this chapter to the statistics
models as they are more general while the physics-statistics and physics-experi-
mental models are usually developed for particular applications such as fatigue
testing, creep testing and electromigration models.

7.3.1 Statistics-based Models: Parametric

The failure times at each stress level are used to determine the most appropriate
failure time distribution along with its parameters. We refer to these models as
AFT (accelerated failure time). Parametric models assume that the failure times at
different stress levels are related to each other by a common failure time distribu-
tion with different parameters. Usually, the shape parameter of the failure time
distribution remains unchanged for all stress levels, but the scale parameter may
present a multiplicative relationship with the stress levels. For practical purposes,
we assume that the time scale transformation (also referred to as acceleration
factor, AF > 1 ) is constant, which implies that we have a true linear acceleration.
Thus the relationships between the accelerated and normal conditions are sum-
marized as follows (Tobias and Trindade 1986; Elsayed 1996). Let the subscripts o
and s refer to the operating conditions and stress conditions, respectively.
The relationship between the time to failure at operating conditions and stress
conditions is

to = AF × tS . (7.1)

The cumulative distribution functions are related as

⎛ t ⎞
Fo ( t ) = Fs ⎜ ⎟. (7.2)
⎝ AF ⎠

The probability density functions are related as

⎛ 1 ⎞ ⎛ t ⎞
fo ( t ) = ⎜ ⎟ fs ⎜ ⎟. (7.3)
⎝ AF ⎠ ⎝ AF ⎠

The failure rates are given by

⎛ 1 ⎞ ⎛ t ⎞
ho ( t ) = ⎜ ⎟ hs ⎜ ⎟. (7.4)
⎝ AF ⎠ ⎝ AF ⎠
160 E. Elsayed

The acceleration factor is obtained by determining the median lives of units

tested at two different accelerated stresses and extrapolating to the median life at
normal operating stress. It can also be estimated by replacing the medians with
some quartiles.
The accuracy of the reliability estimates suffers when small samples are tested
at the stress conditions since the determination of proper failure time distribution
that describes these failures becomes difficult. More importantly, the assumption of
having the same failure time distributions at different stress levels is difficult to
justify especially when small numbers of failures are observed. In these cases, it is
more appropriate to use nonparametric models as described next.

7.3.2 Statistics-based Models: Nonparametric

Nonparametric models relax the requirement of the common failure time distribution,
i.e., no common failure time distribution is required among all stress levels. Several
nonparametric models have been developed and validated in recent years. We
describe these models below. Proportional Hazards Model

Cox’s Proportional Hazards (PH) model (Cox 1972, 1975) is the most popular non-
parametric model. It has become the standard nonparametric regression model for
accelerated life testing in the past few years. The PH model is distribution-free
requiring only the ratio of hazard rates between two stress levels to be constant
with time.
The proportional hazards model has the following form:

λ (t ; z ) = λ 0 (t ) exp( β z ) (7.5)

The base line hazard function λ 0 (t ) is an arbitrary function; it is modified

multiplicatively by the covariates (i.e. applied stresses).
Elsayed and Zhang (2006) assume λ 0 (t ) to be linear with time: λ 0 (t ) = γ 0 + γ 1t .
Substituting λ 0 (t ) into the PH model, we obtain: λ (t ; z ) = (γ 0 + γ 1t ) exp( β z ) ,
where z = ( z1 , z2 ,… z p )T is a column vector of the covariates (or applied stresses).
For ALT, the column vector represents the stresses used in the test and/or their
interactions. β = ( β1 , β 2 ,… β p ) is a row vector of the unknown coefficients corre-
sponding to the covariates z . These coefficients can be estimated using a partial
likelihood estimation procedure.
This model usually produces “good” reliability estimation with failure data for
which the proportional hazards assumption holds and even when it does not exactly
hold. Extended Linear Hazards Regression Model

The PH and AFT models have different assumptions. The only model that satisfies
both assumptions is the Weibull regression model (Kalbfleisch and Prentice 2002).
For generalization, the Extended Hazard Regression (EHR) model (Ciampi and
Reliability Prediction and Accelerated Testing 161

Etezadi-Amoli 1985; Etezadi-Amoli and Ciampi 1987; Shyur et al. 1999) is pro-
posed to combine the PH and AFT models into one form:

λ (t ; z ) = λ0 (ez'β t ) exp( z'α) (7.6)

The unknowns of this model are the regression coefficients α , β and the
unspecified baseline hazard function λ 0 (t ) . The model reflects that the covariate z
has both the time scale changing effect and hazard multiplicative effect. It becomes
the PH model when β = 0 and the AFT model when α = β .
Elsayed et al. (2006) propose a new model called Extended Linear Hazard
Regression (ELHR) model. The ELHR model (e.g., with one covariate) assumes
those coefficients to be changing linearly with time:

λ (t ; z ) = λ0 (te( β 0 + β1t ) z
) exp ( (α 0 + α1t ) z ) (7.7)

The model considers the proportional hazards effect, time scale changing effect
as well as time-varying coefficients effect. It encompasses all previously developed
models as special cases. It may provide a refined model fit to failure time data and
a better representation regarding complex failure processes.
Since the covariate coefficients and the unspecified baseline hazard cannot be
expressed separately, the partial likelihood method is not suitable for estimating the
unknown parameters. Elsayed et al. (2006) propose the maximum likelihood
method which requires the baseline hazard function to be specified in a parametric
form. In the EHR model, the baseline hazards function has two specific forms; one
is a quadratic function and the other is a quadratic spline. In the proposed ELHR
model, we assume the baseline hazard function λ 0 (t ) to be a quadratic function:

λ0 (t ) = γ 0 + γ 1t + γ 2 t 2 (7.8)

Substituting λ 0 (t ) into the ELHR model yields

λ (t ; z ) = γ 0 eα 0 z +α1zt + γ 1teθ0 z +θ1zt + γ 2t 2 eω0 z +ω1zt (7.9)


θ 0 = α 0 + β 0 , θ1 = α1 + β1 , ω0 = α 0 + 2β 0 , ω1 = α1 + 2β1

The cumulative hazard rate function is obtained as

162 E. Elsayed

t t t t
Λ (t ; z ) = ∫ 0
λ (u; z )du = ∫ 0
γ 0 eα 0 z +α1zu du + ∫ 0
γ 1ueθ0 z +θ1zu du + ∫ 0
γ 2u 2eω0 z +ω1zu du

γ 0 α 0 z +α1zt γ 0 α 0 z γ 1t θ0 z +θ1zt γ γ
= e − e + e − 1 2 eθ0 z +θ1zt + 1 2 eθ0 z
α1 z α1 z θ1 z (θ1 z ) (θ1 z )
γ 2t 2 ω0 z +ω1zt 2γ 2t ω0 z +ω1zt 2γ 2 ω0 z +ω1zt 2γ 2 ω0 z
+ e − e + e − e
(ω1 z ) (ω1 z ) (ω1 z )
2 3 3
ω1 z

The reliability function, R(t ; z ) and the probability density functions f (t ; z )

are obtained as

R(t ; z ) = exp(−Λ(t ; z ))
f (t ; z ) = λ (t ; z ) exp(−Λ(t; z ))

Although the ELHR model is developed based on the distribution-free concept,

a close investigation of the model reveals its capability of capturing the features of
commonly used failure time distributions. The main limitation of this model is that
“good” estimates of the many parameters of the model require a large number of
test units. Proportional Mean Residual Life Model

Oakes and Dasu (1990) originally propose the concept of the Proportional Mean
Residual Life (PMRL) by analogy with PH model. Two survivor distributions
F (t ) and F0 (t ) are said to have PMRL if

e( x) = θ e0 ( x) (7.10)

where e( x) is the mean residual life at time x .

We extend the model to a more general framework with a covariate vector Z
(applied stress)

e(t | z ) = exp( β T z )e0 (t ) (7.11)

We refer to this model as the proportional mean residual life regression model
which is used to model accelerated life testing. Clearly e0 ( x) serves as the MRL
corresponding to a baseline reliability function R0 (t ) and is called the baseline
mean residual function; e(t z ) is the conditional mean residual life function of
T − t given T > t and Z = z . Where z T = ( z1 , z2 ; , z p ) is the vector of covariates,
β T = ( β1 , β 2 ; , β p ) is the vector of coefficients associated with the covariates,
and p is the number of covariates. Typically, we can experimentally obtain
{(ti , zi ), i = 1, 2, , n} the set of failure time and the vectors of covariates for each
unit (Zhao and Elsayed, 2005). The main assumption of this model is the pro-
portionality of mean residual lives with applied stresses. In other words, the mean
Reliability Prediction and Accelerated Testing 163

residual life of a unit subjected to high stress is proportional to the mean residual
life of a unit subjected to low stress. Proportional Odds Model

In many applications, however, it is often unreasonable to assume the effects of
covariates on the hazard rates remain fixed over time. Brass (1971) observes that
the ratio of the death rates, or hazard rates, of two populations under different
stress levels (for example, one population for smokers and the other for non-
smokers) is not constant with age, or time, but follows a more complicated course,
in particular converging closer to unity for older people. So the PH model is not
suitable for this case. Brass (1974) proposes a more realistic model: the pro-
portional odds (PO) model. The proportional odds model has been successfully
used in categorical data analysis (McCullagh 1980; Agresti and Lang 1993) and
survival analysis (Hannerz 2001) in the medical fields. The PO model has a distinct
different assumption on proportionality, and is complementary to the PH model. It
has not been used in reliability analysis of accelerated life testing so far. Zhang and
Elsayed (2005) extend this model for reliability estimates using ALT data.
We describe the PO model as follows. Let T > 0 be a failure time associated
F (t ; z )
with stress level z with cumulative distribution F (t ; z ) , and that ratio ,
1 − F (t ; z )
1 − R(t ; z )
or , be the odds on failure by time t . The PO model is then expressed as
R(t ; z )

F (t ; z ) F (t )
= exp( β z ) 0 (7.12)
1 − F (t ; z ) 1 − F0 (t )

where F0 (t ) ≡ F (t ; z = 0) is the baseline cumulative distribution function and β is

unknown regression parameter. Let θ (t ; z ) denote the odds function, then the
above PO model is transformed to

θ (t ; z ) = exp( β z )θ 0 (t ) (7.13)

where θ 0 (t ) ≡ θ (t ; z = 0) is the baseline odds function.

For two failure time samples with stress levels z1 and z2 , the difference
between the respective log odds functions is

log[θ (t ; z1 )] − log[θ (t ; z2 )] = β ( z1 − z2 ) ,

which is independent of the baseline odds function θ 0 (t ) and the time t . Hence,
the odds functions are constantly proportional to each other. The baseline odds
function could be any monotone increasing function of time t with the property of
θ 0 (0) = 0 . When θ 0 (t ) = t ϕ , PO model presented by Equation 7.13 becomes the
164 E. Elsayed

log-logistic accelerated failure time model (Bennett 1983), which is a special case
of the general PO models.
In order to utilize the PO model in predicting reliability at normal operating
conditions, it is important that both the baseline function and the covariate
parameter, β , be estimated accurately. Since the baseline odds function of the
general PO models could be any monotone increasing function, it is important to
define a viable baseline odds function structure to approximate most, if not all, of
the possible odds function. In order to find such a “universal” baseline odds
function, we investigate the properties of odds function and its relation to the
hazard rate function.
The odds function θ (t ) is denoted by

F (t ) 1 − R(t ) 1
θ (t ) = = = −1 (7.14)
1 − F (t ) R(t ) R(t )

From the properties of reliability function and its relation to odds function
shown in Equation 7.14, we could easily derive the following properties of odds
function θ (t ) :

1. θ (0) = 0 , θ (∞) = ∞
2. θ (t ) is monotonically increasing function in time
1 − exp[−Λ (t )]
3. θ (t ) = = exp[−Λ (t )] − 1 , and Λ (t ) = ln[θ (t ) + 1]
exp[−Λ (t )]
θ ′(t )
4. λ (t ) =
θ (t ) + 1

Further investigation of such a “universal” odds function shows that it can be

approximated by a polynomial function.
An appropriate ALT model is important since it explains the influences of the
stresses on the expected life of a product based on its physical properties and the
related statistical properties. On the other hand, a carefully designed test plan
improves the accuracy and efficiency of the reliability estimation. The design of an
accelerated life testing plan consists of the formulation of objective function, the
determination of constraints and the definition of the decision variables such as
stress levels, sample size, allocation of test units to each stress level, stress level
changing time and test termination time, and others. Inappropriate values of the
decision variable result in inaccurate reliability estimates and/or unnecessary test
resources. Thus it is important to design test plans to minimize the objective
function under specific time and cost constraints.
Reliability Prediction and Accelerated Testing 165

7.4 Design of Accelerated Life Testing Plans

Conducting an accelerated life testing (ALT) requires the determination or develop-
ment of a reliability inference model that relates the failure data at stress conditions
with design or operating conditions. Moreover, an accelerated test plan needs to be
developed to obtain appropriate and sufficient information in order to estimate
reliability performance accurately at operating conditions. A test plan requires the
identification of the type of stresses to be applied, stress levels, methods of stress
application (constant, ramp, cyclic), number of units at every stress level, minimum
number of failures at every stress level, optimum test duration, frequency of test data
collection and other test parameters. Indeed, without an optimum test plan, it is likely
that a large sequence of expensive and time consuming tests be conducted that might
cause delays in product release or in some cases the termination of the entire product.
In this section, we describe the procedure for designing an optimum test plan
based on the proportional hazards model followed by a numerical example. Opti-
mum test plans based on other ALT models can be developed in a similar fashion.

7.4.1 Design of ALT Plans

An ALT plan requires the determination of the type of stress, method of applying
stress, stress levels, the number of units to be tested at each stress level and an
applicable accelerated life testing model that relates the failure times at accelerated
conditions to those at normal conditions.
When designing an ALT, we need to address the following issues: (a) select the
stress types to use in the experiment, (b) determine the stress levels for each stress
type selected, (c) determine the proportion of devices to be allocated to each stress
level (Elsayed and Jiao 2002). We refer the reader to Meeker and Escobar (1998)
and Nelson (2004) for other approaches for the design of ALT plans.
We consider the selection of the stress level zi and the proportion of devices pi to
allocate for each zi such that the most accurate reliability estimate at use conditions
zD can be obtained. We consider two types of censoring: type I censoring involves
running each test unit until a prespecified time. The censoring times are fixed and
the number of failures is random. Type II censoring involves simultaneously testing
units until a prespecified number of them fails. The censoring time is random while
the number of failures is fixed. We use the following notations:

ln Natural logarithm
ML Maximum likelihood
n Total number of test units
zH, zM, zL High, medium, low stress levels respectively
zD Specified design stress
p1 , p2 , p3 Proportion of test units allocated to zL, zM and zL, respectively
T Pre-specified period of time over which the reliability estimate is of
R(t; z) Reliability at time t, for given z
f(t; z) Pdf at time t, for given z
F(t; z) Cdf at time t, for given z
166 E. Elsayed

Λ (t ; z ) Cumulative hazard function at time t, for given z

λ0 (t ) Unspecified baseline hazard function at time t
We assume the baseline hazard function λ0 (t ) to be linear with time:

λ0 (t ) = γ 0 + γ 1t

Substituting λ0 (t ) into the PH model given by Equation 7.5, we obtain,

λ (t ; z ) = (γ 0 + γ 1t ) exp( β z )

We obtain the corresponding cumulative hazard function Λ (t ; z ) , and the

variance of the hazard function as

γ 1t 2 β z
Λ (t ; z ) = (γ 0 t + )e

ˆ ˆ 2
Var[(γˆ0 + γˆ1t )e β Z ] = (Var[γˆ0 ] + Var[γˆ1 ]t 2 )e2( β z +Var [ β ] z
D )

ˆ 2 ˆ 2
+ e 2 β z +Var [ β ] z (eVar [ β ] z − 1)(γ 0 + γ 1t ) 2 Formulation of the Test Plan

Under the constraints of available test units, test time and specification of minimum
number of failures at each stress level, the problem is to allocate stress levels and test
units optimally so that the asymptotic variance of the hazard rate estimate at normal
conditions is minimized over a prespecified period of time T. If we consider three
stress levels, then the optimal decision variables ( z *L , zM* , p1* , p2* , p3* ) are obtained by
solving the following optimization problem with a nonlinear objective function and
both linear and nonlinear constraints:

Min ∫ Var[(γˆ
0 + γˆ1t )e β z ]dt

subject to

Σ = F −1

0 < pi < 1, i = 1, 2,3

i =1
i =1

z D < zL < zM < zH

npi Pr[t ≤ τ | zi ] ≥ MNF , i = 1, 2,3
Reliability Prediction and Accelerated Testing 167

where, MNF is the minimum number of failures and Σ is the inverse of the
Fisher's information matrix. 
Other objective functions can be formulated which result in different design of
the test plans. These functions include the D-Optimal design that provides efficient
estimates of the parameters of the distribution. It allows relatively efficient deter-
mination of all quantiles of the population, but the estimates are distribution depen-
dent. Numerical Example

An accelerated life test is to be conducted at three temperature levels for MOS
capacitors in order to estimate its life distribution at design temperature of 50°C.
The test needs to be completed in 300 h. The total number of items to be placed
under test is 200 units. To avoid the introduction of failure mechanisms other than
those expected at the design temperature, it has been decided, through engineering
judgment, that the testing temperature should not exceed 250°C. The minimum
number of failures for each of the three temperatures is specified as 25. Further-
more, the experiment should provide the most accurate reliability estimate over a
10-year period of time.
Consider three stress levels; then the formulation of the objective function and
the test constraints follow the same formulation given in the above section. The
optimum plan derived (Elsayed and Jiao 2002) that optimizes the objective
function and meets the constraints is shown as follows:
z L = 160o C , zM = 190o C , z H = 250o C
The corresponding allocations of units to each temperature level are:
p1 = 0.5, p2 = 0.4, p3 = 0.1 Concluding Remarks

Design of ALT plans plays a major role in providing accurate estimates of
reliability, mean time to failure and the variance of failure time at normal operating
conditions. These estimates have a major impact on many decisions during the
product life cycle such as maintenance schedules, warranty and repair policies and
replacement times. Therefore, the test plans should be robust (Pascual 2006), i.e., it
should be:
1. Robust to planning values of the model parameters. This implies that ALT
conducted at three or more stresses are more robust than those conducted at
two stresses. Allocating more units at the low stress level will also improve
the robustness of the plan.
2. Robust to the type of the underlying distribution. In other words, misspeci-
fication of the underlying distribution should not result in significant errors
in calculating reliability characteristics.
3. Robust to the underlying stress-life relationship. The commonly used
concept that higher stresses result in more failures might result in the
“wrong” stress-life relationship. For example, testing circuit packs at higher
temperature reduces humidity which in turn results in fewer failures than
those at field conditions. In essence, this is a deceleration test (higher stresses
show fewer failures).
168 E. Elsayed

7.5 Relating ALT Results to Maintenance Decisions at Normal

Operating Conditions
It is important to note that it is not necessary to conduct destructive ALT when the
product’s characteristics can be monitored through degradation with time. For
example, light emitting diodes (LED) are likely to experience degradation in the
light intensity before they are deemed completely unsuitable for use. In such cases it
is important to conduct accelerated degradation testing (ADT). The threshold level
where a unit is considered unacceptable might be considered the same threshold
level for replacement or maintenance (if possible). In a typical experiment the
threshold level is set as the level at which the light intensity drops to 50% of its
original value. This threshold level is set based on engineering and users’ experi-
ence. Of course, an optimum level can be determined based on other factors such as
economic, maintenance strategy, availability of maintenance crew and others. It
should be noted that this level is set for accelerated conditions. Clearly the deter-
mination of the optimum maintenance schedule at normal operating conditions
depends on many factors as follows. (1) The variance of the time to failure at
normal conditions is much larger than that of the ADT as shown in Figure 7.1. (2)
The failure time distribution or the degradation paths at accelerated conditions are
directly related to the failure rate or degradation rate. Higher accelerated stresses
result in higher rates as shown in Figure 7.2. Thus the failure rate at normal con-
ditions requires careful evaluation as it directly affects the maintenance schedule.
(3) Since there are no universal normal operating conditions but a distribution is
likely to describe these conditions, the maintenance threshold level will then be
greatly affected by such a distribution. (4) The repair rate in field conditions is
likely to be different from that of the ALT. (5) The effect of aging at stress con-
ditions is not captured. (6) When a unit is repaired it is not considered as good as
new; consequently the time to next failure is shorter.
Therefore, the maintenance threshold level needs to be optimally determined so
that the total maintenance cost is minimized or the system availability is maxi-
mized as discussed in Section 7.7.

Figure 7.1. Distributions of the time to failure at stress and normal conditions
Reliability Prediction and Accelerated Testing 169

40° C
Degradation Path 80° C 60° C


Figure 7.2. Distributions of degradation paths with time at different stress levels

In order to determine the optimum maintenance schedule at normal operating

conditions using accelerated testing results one needs to perform the following two

1. Relate the reliability function at stress conditions to that at normal conditions

by developing an appropriate model using the approaches discussed earlier in
this chapter or using an ADT model as described in Eghbali and Elsayed
(2001), Liao (2004) and Meeker and Escobar (1998).
2. Relate the maintenance threshold level to the operating conditions. For
example, when the stress at operating conditions is higher than the mean of
the normal conditions then a lower threshold level is used. Similarly, when
the stress at operating conditions is lower than the mean of the normal
operating conditions then a higher threshold level is used.

The first step has been discussed in Section 7.3 and the second step will be
discussed in Section 7.6.

7.6 Determination of the Optimum Preventive Maintenance

Schedule and Optimum Threshold Degradation Level
at Normal Conditions
The optimum preventive maintenance schedule at operating conditions can be
determined by relating the reliability functions at accelerated conditions with that
at normal conditions then utilize an optimization function that relates reliability to
preventive maintenance schedule. In Section 7.6.1 we demonstrate these steps
through an example. Another approach for determining the optimum preventive
maintenance for degrading systems is to determine the optimum threshold
degradation level at which maintenance actions are taken by minimizing the over-
170 E. Elsayed

all cost of maintenance or by ensuring a minimum acceptable system availability

level (Liao et al. 2005). This will be illustrated in Section 7.6.2.

7.6.1 Optimum Preventive Maintenance Schedule at Operating Conditions

The first step is to relate the accelerated testing results to stress conditions and
obtain a reliability expression which is a function of the applied stresses. We then
substitute the normal operating conditions in the expression to obtain a reliability
function at normal conditions. We illustrate this by designing an optimum test plan
then use its results to obtain the reliability expression.
Suppose we develop an accelerated life test plan for a certain type of electronic
devices using two stresses: temperature and electric voltage. The reliability estimate
at the design condition over a 10-year period of time is of interest. The design
condition is characterized by 50 ºC and 5V. From engineering judgment, the highest
levels (upper bounds) of temperature and voltage are pre-specified as 250 ºC and
10 V, respectively. The allowed test duration is 200 h, and the total number of
devices placed under test is 200. The minimum number of failures at any test
combination is specified as 10. The test plan is determined through the following

1. According to the Arrehenius model, we use 1/(absolute temperature) as the

first covariate z1 and 1/(Voltage) as the second covariate z2 in the ALT
2. The PH model is used in conducting reliability data analysis and designing
the optimal ALT plan using the approach described in Section The
model is given by

λ (t ; z ) = λ 0 ( t ) exp ( β1 z1 + β 2 z2 )

where λ 0 (t ) = γ 0 + γ 1t + γ 2 t 2
3. A baseline experiment is conducted to obtain initial estimates for the model
parameters. These values are: γˆ0 = 0.0001 , γˆ1 = 0.5 , γˆ2 = 0 , β̂1 = −3800 ,
and β̂ 2 = −10 .

Approximating γˆ0 to zero we write the hazard rate function as

3800 10
−( + )
λ (t ; T , V ) = 0.5t e T V

The reliability and the probability density function (pdf) expressions are respec-
tively given as f (t ;30o C,5V ) = 0.5t exp[−(e −3.6336 t ) )]

R(t ; T , V ) = exp[−(e−0.25((3800 / T ) +10 / V )t ) )] (7.16)
Reliability Prediction and Accelerated Testing 171

f (t ; T , V ) = 0.5t exp[−(e−0.25((3800 / T ) +10 / V ) t ) )] (7.17)

Assume that the normal operating temperature is 30 oC and the normal

operating voltage is 5 V. Substituting in Equations 7.16 and 7.17 yields
Rn (t ) = R(t ;30o C,5V ) = exp[−(e −3.6336 t )] (7.18)

f n (t ) = f (t;30o C,5V ) = 0.5t exp[ −(e−3.6336 t )] (7.19)

In the second step, we chose an appropriate preventive maintenance (PM) model

and determine the optimum PM schedule.
Consider a simple preventive maintenance and replacement policy. Under this
policy, two types of actions are performed. The first type is the preventive re-
placement that occurs at fixed intervals of time. Components or parts are replaced
at predetermined times regardless of the age of the component or the part being
replaced. The second type of action is the failure replacement where components
or parts are replaced upon failure. This policy is illustrated in Figure 7.3.
The most widely used criterion of maintenance models is to minimize the total
expected maintenance and replacement cost per unit time. This can be accomplished
by developing a total expected cost function per unit time as follows.


0 tp

Figure 7.3. Constant interval replacement policy

Let c(t p ) be the total replacement cost per unit time as a function of t p .Then

Total expected cost interval (0, t p ]

c (t p ) = . (7.20)
Expected length of the interval

The total expected cost in the interval (0, t p ] is the sum of the expected cost of
failure replacements and the cost of the preventive replacement. During the interval
(0, t p ], one preventive replacement is performed at a cost of c p and M (t p ) failure
172 E. Elsayed

replacements at a cost of c f each, where M (t p ) is the expected number of

replacements (or renewals) during the interval (0, t p ]. The expected length of the
interval is t p . Equation 7.20 can be rewritten as

c p + c f M (t p )
c(t p ) = . (7.21)

We apply the above model to determine the optimum preventive maintenance

schedule for the example for the electronic devices whose reliability and pdf
functions obtained from accelerated conditions and are expressed as given in
Equations 7.18 and 7.19 respectively. Assuming c p =100 and c f =1200, we rewrite
Equation 7.21 as:


10 + 1200 tf n (t ) dt ∫
c(t p ) = (7.22)

Calculated values of the cost per unit time are shown in Table 7.1 and plotted in
Figure 7.4. The optimum preventive maintenance schedule at normal operating
conditions is 0.18 unit times.

Table 7.1. Time vs. cost per unit time values (bold numbers indicate optimum values)

Time 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2

Cost/unit time 918 885 862 847 839 836 840 848

C ost per unit tim e

0,03 0,13 0,23 0,33 0,43 0,53
Figure 7.4. Optimum preventive maintenance schedule
Reliability Prediction and Accelerated Testing 173

7.6.2 Optimum Preventive Maintenance Schedule Based on Accelerated

or Normal Degradation

Determining the optimum maintenance schedule for systems subject to degradation

follows the same procedures described in Section 7.6.1. It begins by developing a
degradation model (at normal operation conditions or at accelerated conditions then
extrapolate to normal conditions as shown above). Liao et al. (2005) assume that the
degradation is described by a gamma process and obtain the optimum degradation
level accordingly. Ettouney and Elsayed (1999) obtain the reliability function for
different threshold degradation levels. We demonstrate the determination of the
degradation threshold level at normal stress levels using Ettouney and Elsayed (1999)
results; then we utilize the optimum degradation level to determine the corresponding
optimum preventive maintenance schedule as follows.
Consider the case of corrosion in reinforced concrete bridges which is a major
concern to professional engineers because of both public safety and cost which
associated with needed repairs and replacement. Prediction of bridge functional
degradation due to corrosion conditions is investigated below.
The two main corrosion parameters which affect the reinforcing bars in
reinforced concrete bridges are the corrosion rate, rcorr , and the time it takes to
initialize corrosion, T1. Enright and Frangopol (1998) present several mean and
variance test measurements for both rcorr and T1. In a typical case, they show that
the mean and variance of rcorr are between 0.005 in/year and 3 × 10−6 in/year,
respectively. The mean and standard deviation of T1 are 10 years and 0.4 years,
In order to estimate the time-variant strength of a reinforced concrete corroded
beam, the corrosion effects on the diameter of the reinforcing bars is evaluated
first. After corrosion initiation time, T1, the diameter of a reinforcing bar, D(t), can
be evaluated as

D(t ) = Di − rcorr (t − T 1) (7.23)

where Di = 1.41 in. is the initial reinforcing bar diameter and t is the elapsed time.
Note that t ≥ T1 and D(t) ≥ 0. For more details of Equation 7.23 the reader is
referred to Enright and Frangopol (1998).
The time-variant reinforced concrete strength, Mp(t), can now be evaluated
using the conventional design equations in Enright and Frangopol (1998):

⎛ a⎞
M p = nAs f y ⎜ d − ⎟ (7.24)
⎝ 2⎠

a = ( nAs f y ) 0.85 f c` b ) (7.25)

Note that As = π D(t ) 2 4 . The reinforcing steel and the concrete strengths are
f y and f c` , respectively. The number of reinforcing bars is n. The effective depth
and the width of the beam are d and b, respectively. For the current example, the
174 E. Elsayed

values of different parameters are chosen as f y = 40 ksi, f c` = 3 ksi, d = 27 in. and

b = 16. Using Equation 7.23 through Equation 7.25 the random time-variant
strength, Mp(t), can be estimated. Using the previously mentioned values of rcorr
and T1 and a Monte Carlo simulation technique, different strength values for
different reinforced concrete beams can be simulated. Thus, a discrete time-variant
reinforced concrete strength, xij can be evaluated from the Monte Carlo simulation
of the continuous strength Mp(t).
Eghbali and Elsayed (2001) show that the reliability function for a specified
failure threshold degradation x is expressed as

− xγ
Rx (t ) = P( X > x; t ) = exp[ ] (7.26)
b exp(−at )

where X is a random variable represents the degradation measure, a, b and γ are

constants. The Maximum Likelihood method was utilized to estimate the para-
meters of Equation 7.26:

γ m ni − x ijγ (7.27)
L (γ , a , b , t ) = ∏ ( ) ni ∏ ∏ x ijγ −1 e x p ( b e x p ( − a ti ) )
i =1 b e x p ( − a ti ) i =1 j =1

where m is the number of years, ni is the total number of degradation data in a

year i and xij is the strength of unit j in year i. Taking the logarithm of Equation
7.27 we obtain

m m m m ni m ni
ln L = ∑ ni ln γ − ∑ ni ln b + ∑ ni ati + ∑∑ (γ − 1) ln xij − ∑∑ (7.28)
i =1 i =1 i =1 i =1 j =1 i =1 j =1 b exp(−ati )

Equating the partial derivatives of Equation 7.28 with respect to γ , a and b to

zeros and solving the resulting equations using a modified Powell hybrid algorithm
and a finite difference approximation to the Jacobian yields: a = 0.12, b = 1.1346×107
and γ = 1.49. The resulting reliability function is

− xγ
Rx (t ) = P( X > x; t ) = exp[ ]
b exp(−at )

− x1.49
Rx (t ) = exp[ ].
1.1346598 ×107 × exp (-0.12t )

The reliability for different threshold values of the strength is shown in Figure
7.5. The time to failure for threshold values of 4800, 4000, 3500, 3000, and 2500
are 25.04, 27.25, 28.88, 30.76, and 33.0 years respectively.
Reliability Prediction and Accelerated Testing 175

0.8 s=3500

0.6 s=4000
0.4 s=4800
0 10 20 30 40 50 60
Time (Years)
Figure 7.5. Reliability for different threshold levels

The next step is to determine the optimum preventive maintenance schedule for
every threshold level and select the schedule corresponding to the smallest cost
among all optimum cost values. This will represent both the optimum threshold
level and the corresponding optimum preventive maintenance schedule.
We demonstrate this for two threshold levels (S = 4800 and S = 2500) assuming
c p =10 and c f =1200; we utilize Equation 7.21 as follows:


10 + 1200 tf ( x; t ) dt
c(t p ) = (7.29)


γ γ −1 − xγ
f ( x; t ) = x exp( ) , t > 0, θ (t ) = be− at (7.30)
θ (t ) θ (t )

As shown in Figure 7.6, the optimum t p values for S=400 and S=2500 are 17
and 16 years respectively. The minimum of the two is the one corresponding to
S = 2500. Therefore, the optimum threshold is 2500 and the corresponding optimum
maintenance schedule is 16 years.
176 E. Elsayed


2 S=2500
Cost / Unit Time



2 12 22 32

Figure 7.6. Total cost per unit time vs. time

7.7 Summary
In this chapter we present the common approaches for predicting reliability using
accelerated life testing. The models are classified as accelerated life testing models
(ALT) and accelerated degradation models (ADT). The ALT models are also
classified as accelerated failure time models with assumed failure time distribu-
tions and “distribution free” models. Also we modify the proportional odds model
to be used for reliability prediction with multiple stresses. Most of the research in
the literature does not extend the use of accelerated life testing beyond reliability
predictions at normal conditions. This is the first work that links the ALT to
maintenance theory and maintenance scheduling. We develop optimum preventive
maintenance schedules for both ALT models and degradation models. We
demonstrate how the reliability prediction models obtained from ALT can be used
in obtaining the optimum maintenance schedules. We also demonstrate the link
between the optimum degradation threshold level and the optimum maintenance
schedule. This work can be further extended to include other maintenance cost or
insurance of minimum availability level of a system. Further work is needed to
investigate the relationship between threshold levels at accelerated conditions and
those at normal conditions. Moreover, the models need to include the repair rate as
well as spares availabilities.
Reliability Prediction and Accelerated Testing 177

7.8 References
Agresti, A. and Lang, J.B., (1993) Proportional odds model with subject-specific effects for
repeated ordered categorical responses, Biometrika, 80, pp. 527–534
Bennett, S. (1983) Log-logistic regression models for survival data, Applied Statistics, 32,
Brass, W., (1971) On the scale of mortality, In: Brass, W., editor. Biological aspects of
Mortality, Symposia of the society for the study of human biology. Volume X. London:
Taylor & Francis Ltd.: 69–110
Brass, W., (1974) Mortality models and their uses in demography, Transactions of the
Faculty of Actuaries, Vol. 33, 122–133.
Ciampi, A. and Etezadi-Amoli, J., (1985) A general model for testing the proportional
hazards and the accelerated failure time hypotheses in the analysis of censored survival
data with covariates, Commun. Statist. - Theor. Meth., Vol. 14, pp. 651–667.
Cox, D.R., (1972) Regression models and life tables (with discussion), Journal of the Royal
Statistical Society B, Vol. 34, pp. 187–208
Cox, D.R., (1975) Partial likelihood, Biometrika, Vol. 62, pp. 269–276
Eghbali, G. and Elsayed, E.A., (2001) Reliability estimate using degradation data, in
Advances in Systems Science: Measurement, Circuits and Control, Mastorakis, N. E.
and Pecorelli-Peres, L. A. (Editors), Electrical and Computer Engineering Series, WSES
Press, pp. 425–430
Elsayed, E.A., (1996) Reliability engineering, Addison-Wesley Longman, Inc., New York,
Elsayed, E.A. and Jiao, L., (2002) Optimal design of proportional hazards based accelerated
life testing plans, International Journal of Materials & Product Technology, Vol. 17,
Nos. 5/6, 411–424
Elsayed, E.A. and Zhang, H., (2006) Design of PH-based accelerated life testing plans under
multiple-stress-type, to appear in the Reliability Engineering and Systems Safety
Elsayed, E.A., Liao, H., and Wang, X., (2006) An extended linear hazard regression model
with application to time-dependent-dielectric-breakdown of thermal oxides, IIE Trans-
actions on Quality and Reliability Engineering, Vol. 38, No. 4, 329–340
Elsayed, E.A. and Zhang, H., (2005) Design of optimum simple step-stress accelerated life
testing plans, Proceedings of 2005 International Workshop on Recent Advances in
Stochastic Operations Research. Canmore, Canada.
Enright, M.P. and Frangopol, D.M., (1998) Probabilistic analysis of resistance degradation
of reinforced concrete bridge beams under corrosion, Engineering Structures, Vol. 20
No. 11, pp. 960–971
Etezadi-Amoli, J. and Ciampi, A., (1987) Extended hazard regression for censored survival
data with covariates: a spline approximation for the baseline hazard function,
Biometrics, Vol. 43, pp. 181–192
Ettouney, M. and Elsayed, E.A., (1999) Reliability estimation of degraded structural
components subject to corrosion, Fifth ISSAT International Conference, Las Vegas,
Nevada, August 11–13
Hannerz, H., (2001) An extension of relational methods in mortality estimation,
Demographic Research, Vol. 4, p. 337–368
Kalbfleisch, J.D. and Prentice, R.L., (2002) The statistical analysis of failure time data, John
Wiley & Sons, New York, New York
Liao, H., Elsayed, E.A., and Ling-Yau Chan, (2005) Maintenance of continuously monitored
degrading systems, European Journal of Operational Research, Vol. 75, No. 2, 821–835
Liao, H., (2004) Degradation models and design of accelerated degradation testing plans,
Ph.D. Dissertation, Department of Industrial and Systems Engineering, Rutgers
178 E. Elsayed

McCullagh, P., (1980) Regression models for ordinal data, Journal of the Royal Statistical
Society. Series B, Vol. 42, No. 2, 109–142
Meeker, W.Q. and Escobar, L.A., (1998) Statistical methods for reliability data, John Wiley
& Sons, New York, New York
Nelson, W., (2004) Accelerated testing: statistical models, test plans, and data analyses,
John Wiley & Sons, New York, New York
Oakes, D. and Dasu, T. (1990) A note on residual life, Biometrika, 77, pp. 409–410.
Pascual, F.G., (2006) Accelerated life test plans robust to misspecification of the stress-life
relation, Technometrics, Vol. 48, No. 1, 11–25
Shyur, H-J., (1996) A General nonparametric model for accelerated life testing with time-
dependent covariates, Ph.D. Dissertation, Department of Industrial and Systems
Engineering, Rutgers University
Shyur, H-J., Elsayed, E.A. and Luxhoj, J.T., (1999) A General model for accelerated life
testing with time-dependent covariates, Naval Research Logistics, Vol. 46, 303–321
Tobias, P. and Trindade, D., (1986) Applied reliability, Von Nostrand Reinhold Company,
New York, New York
Zhang, H. and Elsayed, E.A., (2005) Nonparametric accelerated life testing based on
proportional odds model, Proceedings of the 11th ISSAT International Conference on
Reliability and Quality in Design, St. Louis, Missouri, USA, August 4–6
Zhao, W. and Elsayed, E.A., (2005) Optimum accelerated life testing plans based on
proportional mean residual life, Quality and Reliability Engineering International

Preventive Maintenance Models for Complex Systems

David F. Percy

8.1 Introduction
Preventive maintenance (PM) of repairable systems can be very beneficial in
reducing repair and replacement costs, and in improving system availability, by
reducing the need for corrective maintenance (CM). Strategies for scheduling PM
are often based on intuition and experience, though considerable improvements in
performance can be achieved by fitting mathematical models to observed data; see
Handlarski (1980), Dagpunar and Jack (1993) and Percy and Kobbacy (2000) for
For systems comprising few components, and systems comprising many iden-
tical components, modelling and analysis using compound renewal processes
might be possible. Such situations are considered by Dekker et al. (1996) and Van
der Duyn Schouten (1996). However, many systems comprise a large variety of
different components and are too complicated for applying this methodology. We
refer to these as complex repairable systems.
This chapter reviews basic models for complex repairable systems, explaining
their use for determining optimal PM intervals. Then it describes advanced
methods, concentrating on generalized proportional intensities models, which have
proven to be particularly useful for scheduling PM. Computational difficulties are
addressed and practical illustrations are presented, based on sub-systems of oil
platforms and refineries.
The motivation is that for complex systems, one needs to build models for
failures based on the history of maintenance (PM and CM) available. Once a model
is built, one can evaluate different PM strategies to determine the best one. The
focus is to look at different models and how to determine the best model based on
historical data.
Section 8.2 presents some real examples of complex systems with historical
data sets. In each case, it discusses current maintenance policies and any problems
with collection or accuracy of the data. Section 8.3 considers the effects of PM and
CM actions upon system reliability and availability, so justifying the need for
180 D. Percy

modelling the operating situations in order to determine suitable scheduling strate-

gies. In Section 8.4, we review the models that can be used for this purpose. We
also assess the relevance, strengths and weaknesses of each model and provide
references where readers can find more details.
The remainder of the chapter presents general recommendations for modelling
of complex systems in order to schedule PM in practice. Section 8.5 describes the
generalized proportional intensities model, Section 8.6 reviews the method of
maximum likelihood for estimating unknown model parameters, Section 8.7
addresses the problem of model selection, and considers statistical tests for this
purpose, and Section 8.8 looks at the scheduling problem. Finally, Section 8.9
applies these methods to some of the data of Section 8.2 and Section 8.10 presents
some concluding remarks.
For convenience, we now present a list of symbols and acronyms that are used
throughout this chapter.
PM Preventive maintenance
CM Corrective maintenance
ROCOF Rate of occurrence of failures
NHPP Nonhomogeneous Poisson process
T1 , T2 , … Failure times of a system
X 1 , X 2 , … Inter-failure times of a system
N (t ) Number of failures up to time t
H (t ) History of process up to time t
ι (t ) Intensity function
ι 0 (t ) Baseline intensity function
Po(µ ) Poisson distribution
F ( x) Cumulative distribution function
f (x ) Probability density function
R(x ) Reliability or survivor function
h( x ) Hazard function
h0 (x ) Baseline hazard function
DRP Delayed renewal process
DARP Delayed alternating renewal process
VAM Virtual age model
PHM Proportional hazards model
IRM Intensity reduction model
PIM Proportional intensities model
GPIM Generalized proportional intensities model
MLE Maximum likelihood estimate
AIC Akaike information criterion
BIC Bayes information criterion
Preventive Maintenance Models for Complex Systems 181

8.2 Examples with Historical Data Sets

Example 8.1 Ascher and Feingold (1984) presented three hypothetical sets of
reliability data to illustrate the forms of historical failure information that are
typically observed for complex systems. The numbers represent inter-failure times
corresponding to happy, sad and noncommittal systems respectively and are
displayed in Table 8.1. The inter-failure times are increasing for the happy system,
as the system settles down and fewer failures occur later on.
This phenomenon can arise with prototype systems, such as a new aircraft,
items subject to a burn-in phase of operation, such as a piston engine, and debug-
ging of computer programs. Conversely, the inter-failure times are decreasing for
the sad system, as the system ages and wears over time. This situation is very
common and applies to most systems, such as television sets, music centres and
motor vehicles. The noncommittal system displays no clear trend in inter-failure

Table 8.1. Hypothetical reliability data from Ascher and Feingold (1984)

Happy system Sad system Noncommittal system

15 177 51
27 65 43
32 51 27
43 43 177
51 32 15
65 27 65
177 15 32

Example 8.2 Percy et al. (1998) published a set of data relating to the reliability
and maintenance history of a valve in a petroleum refinery, as displayed in Table 8.2.
The two columns successively represent the times in days between maintenance
actions and the types of actions, where 0 indicates no failure (PM) and 1 indicates
failure (CM).
At first glance, this would appear to be a noncommittal system. However, on
further inspection, there appear to be fewer failures later on and more preventive
actions. Whether the PM is proving to be effective or the system is generally happy
is not easy to determine. Modelling can provide these answers though. Based on
these data, our ultimate goal is to decide how often to perform PM in future or on
similar systems.
When collecting such data, it is very important to record all PM and CM events
accurately, as errors of omission or commission can result in wrong decisions. For
example, if the first failure were not recorded, the average time until system failure
over the first 94 days would appear to be twice its actual value, perhaps suggesting
that PM is not required.
182 D. Percy

Table 8.2. Reliability and maintenance history of a petroleum refinery valve

Time since Type of Time since Type of

last action action last action action
71 1 186 0
23 1 14 1
64 1 8 1
207 0 112 1
136 1 57 0
66 1 28 1
37 0 4 1
119 0 139 0
2 1 250 0
5 1 206 0
250 0 144 0

Example 8.3 Kobbacy et al. (1997) published a set of historical reliability and
maintenance data collected from a main pump at an oil refinery over a period of
nearly seven years. These data are reproduced in Table 8.3, with consecutive
observations reading down the columns successively from left to right.

Table 8.3. Reliability and maintenance history of a main oil refinery pump

Times since last actions

34* 1 37 22 3
14 4 28 51 21
81* 13 38 51 6
86* 27 20 15 26
156* 8 28* 18 15*
20* 148* 44 1 35
96* 92 3 26* 44*
47* 13 56 37 61
45* 13 64 36 84*
97 67* 8* 2 12
88* 29 62* 12* 65*
30 12 8 27 43*
4 1 46 102 4
Preventive Maintenance Models for Complex Systems 183

Right-censored observations corresponding to preventive maintenance are

marked by asterisks, whereas unmarked observations correspond to failures and
corrective maintenance. To clarify this point, consider the pump’s performance
from the time when data collection commenced. After 34 days without failure, PM
was performed. The pump then continued to operate for 14 more days and then
failed. Following CM, the pump worked for 81 more days without failure and then
PM was performed. Following 6 further PM actions, the next failure occurred 676
(=34+14+…+97) days after data collection began.
By scanning the inter-event times in Table 8.3, it is clear that preventive main-
tenance was not performed at regular intervals or according to any other simple
pattern. Such irregularity can arise because of opportunistic PM, such as when a
maintenance team is on site or has idle time, or because of condition monitoring
warnings, such as vibration and noise indicators. In many applications including
this, however, PM is simply not modelled and monitored effectively. This can
result in excessive repair costs and unacceptable levels of downtime.

8.3 Effects of Preventive and Corrective Maintenance

Before considering suitable models for the reliability and maintenance of complex
repairable systems, we must consider what is meant by these terms. A complex
system consists of any structure of more than one component, which performs a
particular function. Typical systems include industrial and domestic machinery,
such as production lines, utility supplies, railway operations, motor vehicles, cen-
tral heating systems and washing machines. We concentrate on industrial systems,
which benefit greatly from reliability and maintenance modelling. Such complex
systems are often subject to failures, upon which we either discard the systems or
repair them.
Failures can be total or catastrophic, in which case the system stops working,
such as when an exhaust pipe drops off a car or a microchip short circuits in a
refrigerator. Alternatively, they can be partial or debilitating, such as when a car
headlight bulb blows or a refrigerator clogs up with ice. Total failures incur
immediate repair costs. Repairs usually consist of replacing broken components
and we incur the costs of replacement parts, labour associated with repair and
system downtime. For expensive systems, the cost of replacement parts might
contribute most. For dangerous situations, the cost of labour might be most
influential. For continuous process industries, the cost of downtime will dominate.
As these costs can be very large, management will seek to avoid catastrophic
failures by intervening with preventive maintenance at a much smaller cost.
Debilitating failures are of less importance, as they do not incur direct costs.
However, when observable, they can serve as indicators of when to perform
preventive maintenance or capital replacement. Consequently, the failures that this
chapter generally refers to are catastrophic in nature.
Preventive maintenance can be specific, as identified by condition monitoring
indicators, or opportunistic, when such actions are convenient because of other
environmental factors. These possibilities are very much application dependent and
require in-depth analyses, though the models we consider here do extend to include
184 D. Percy

such information. Much preventive maintenance is less specific in terms of particu-

lar systems but not in terms of the work involved, and applies more generally. For
example, motor vehicles might be serviced annually according to a strict checklist
procedure. The actual work conducted during PM can involve many tasks, such as
cleaning surfaces, lubricating joints, sharpening blades, replacing fluids, removing
waste, cooling down and redecorating. As for CM, we incur costs of PM due to
parts, labour and downtime, though these tend to be substantially less than for
The challenge is to balance the costs of preventive maintenance with the
supposed improvements in system reliability. Too few PM actions means we incur
big CM costs and small PM costs, whereas too many PM actions means we incur
small CM costs and big PM costs. Unfortunately, there is no simple explanation of
how CM and PM affect system reliability. By modelling the failure patterns of
these systems mathematically, we can gain valuable insights about cost-effective
strategies for maintenance and replacement.

8.4 Review of Suitable Models

Many mathematical models have been proposed for statistical analysis of complex
repairable systems. Table 8.4 presents a summary of the main types. In order to
discuss the strengths and weaknesses of each model in more depth, we first intro-
duce some standard notation. Suppose that each time a system fails, we repair it
and thereby return it to operational condition. For a preliminary analysis, we also
assume that repair times are negligible. Let T1 , T2 , T3 ,… be the times to successive
failures of the system and let X i = Ti − Ti −1 be the time between failure i − 1 and
failure i where T0 = 0 . The Ti and X i are random variables and we define ti and
xi to be their corresponding realized values. Figure 8.1 illustrates this situation.
We also define N (t ) as the number of failures in the interval (0, t ] .

Figure 8.1. Notation for a repairable system

We generally model the time to first failure using a familiar lifetime probability
distribution or hazard function. However, this approach is inadequate for modelling
other times to failure, as the inter-failure times are neither independent nor
identically distributed in general (Ascher and Feingold 1984). Stochastic processes
form the appropriate basis for models to use under these circumstances. We are
interested in the probability that a system fails in the interval (t, t + ε ] given the
history of the process up to time t . We describe the behaviour of the failure
process by the intensity function (identified here by the Greek letter iota):
Preventive Maintenance Models for Complex Systems 185

ι ( t ) = lim
P N (t + ε ) − N (t ) ≥ 1 H (t )}. (8.1)
ε →0 ε

For an orderly process, where simultaneous failures are impossible, the intensity
function is equal to the derivative of the conditional expected number of failures:

ι (t ) =
E N (t ) H (t ) , } (8.2)

which is referred to as the rate of occurrence of failures (ROCOF).

Table 8.4. Summary of models for complex repairable systems

Models CM PM Comments References

Repair back to
Renewal process   (or replace by) Taylor and Karlin (1994)
new item
Ascher and Feingold
Nonhomogeneous Only CM actions, (1984);
Poisson process zero repair times Crowder et al. (1991);
Lindqvist et al. (2003)
Distributions for
Delayed renewal failures after PM
  Watson (1970)
process and CM actions,
zero downtimes
Delayed alternating Fixed or random
  Percy et al. (1998a)
renewal process downtimes
CM minimal repair, Jack (1998);
Virtual age model   PM reduction in Doyen and Gaudoin
age (2004)
Different hazard Cox (1972a);
Proportional hazards functions for Jardine et al. (1987);
model failures after PM Newby (1994);
and CM actions Lutigheid et al. (2004)
CM minimal repair,
Intensity reduction Doyen and Gaudoin
  PM reduction in
model (2004)
intensity function
Takes account of
Proportional intensities Cox (1972b);
  covariates, CM as
model Percy et al. (1998b)
minimal repairs
Both CM and PM
Generalized proportional
  affect the intensity Percy and Alkali (2006)
intensities model
186 D. Percy

For more details on statistical inference in this context, we refer readers to

Crowder et al. (1991). Our fundamental model is the nonhomogeneous Poisson
process (NHPP), which effectively implies that a repair restores a system to the
state it was in immediately before failure. Such corrective maintenance effects are
commonly referred to as minimal repairs. The NHPP satisfies these conditions for
0<s<t :

(i) N ( 0 ) = 0 [system initialisation at time t = 0 ]

(ii) { N ( t ) − N ( s )} ⊥ N ( s ) [independence of increments]

⎧t ⎫
(iii) { N ( t ) − N ( s )} ~ Po ⎪⎨∫ ι ( t ) dt ⎪⎬ [Poisson inter-failure times]
⎩⎪ s ⎭⎪

If we were to model the time to first failure of a complex system as a random

variable X , we could describe its probability distribution by a cumulative distribu-
tion function F ( x ) = P ( X ≤ x ) with corresponding probability density function
f ( x ) = F ′ ( x ) , reliability or survivor function R ( x ) = 1 − F ( x ) and hazard func-
tion h ( x ) = f ( x ) R ( x ) . Typical examples of probability density functions are:

(i) f ( x λ ) = λ exp ( −λ x ) ; x > 0 [exponential]

λ α α −1
(ii) f (x α,λ) = x exp ( −λ x ) ; x > 0 [gamma]
Γ (α )

(iii) f ( x α , β ) = αβ (α x )
β −1
exp − (α x )
}; x > 0 [Weibull]

The form of the hazard function is precisely the same as the form of the intensity
function if we were to use a stochastic process to model the complex system. For a
nonhomogeneous Poisson process, this intensity function applies beyond the first
failure. However, successive hazard functions for inter-failure times have different
forms, which correspond to shifted and truncated versions of the distribution for
time to first failure.
Imperfect maintenance models must allow for the dynamic evolution of a system
and take account of hypothesized and observed knowledge about the effectiveness
of repairs. As mentioned above, this section reviews a variety of existing models for
repairable systems and describes suitable adaptations for systems that are subject to
preventive maintenance. In passing, we remark that time is used as the only scale of
measurement here. Some applications use running time instead, or both, such as the
flight time of an aircraft or the mileage and age of a car. Further details of such
variations are described by Baik et al. (2004) and Jiang and Jardine (2006).
Preventive Maintenance Models for Complex Systems 187

8.4.1 Renewal Process (Maximal Repair)

This model assumes that repairs renew a system to its condition as new. A renewal
process is a counting process that registers the successive occurrence of events
during a given time interval ( 0,t ] where the time durations between consecutive
events X 1 , X 2 , X 3 ,… form a sequence of independent and identically distributed
non-negative random variables. The special case where their distribution is
exponential corresponds to the homogeneous Poisson process. We can characterize
the intensity function of a renewal process by

ι ( t ) = ι0 t − t N ( t ) ) (8.3)

where ι0 ( t ) is the baseline intensity function, which would prevail if there were no
system failures. As this is a renewal process, the baseline intensity function is
equal to the hazard function for the inter-failure times: ι0 ( x ) = h ( x ) . The baseline
intensity function can take many forms, including:

(i) ι0 (t ) = α [constant]

(ii) ι0 (t ) = αβ t [loglinear]

(iii) ι 0 (t ) = α t β [power-law]

The renewal process is a plausible first order model for components or parts
when the repair time is negligible, since complete replacement of a component
after failure implies renewal instead of repair. Conversely, the renewal process is a
poor model for complex systems, where repairs involve replacing or restoring just
a fraction of the system’s components. If a large portion of a system needs to be
restored, it is often more economical to replace the entire system. Even if a repair
restores the system’s performance to its original specification, the presence of
predominantly aged components implies that system reliability is not renewed.

8.4.2 Nonhomogeneous Poisson Process (Minimal Repair)

The assumptions underlying this model imply that, when a repair is carried out, a
system assumes the same condition that it was in immediately before failure. The
nonhomogeneous Poisson process (NHPP) differs from the homogeneous Poisson
process only in that the rate of occurrence of failures varies with time rather than
being constant. As mentioned early in this section, it is the fundamental model for
repairable systems. The NHPP is also the most appropriate model for the reliability
of a complex system comprising infinity components. However, for a finite
number of components, this model can only serve as an approximation, often poor,
as the intensity function changes following each repair. In this model, the inter-
arrival times X 1 , X 2 , X 3 ,… are neither independent nor identically distributed.
188 D. Percy

An important characteristic of the NHPP is that the intensity function depends

on the system’s global operating time, measured from the instant the system is put
into operation. A simple NHPP model can be expressed as

ι ( t ) = ι0 ( t ) (8.4)

where ι0 ( t ) is the baseline intensity function introduced earlier. In modelling the

reliability of repairable systems under the nonhomogeneous Poisson process
assumptions, the numbers of events in non-overlapping intervals are independent
random variables and the intensity becomes the rate of occurrence of failures or
peril rate of a repairable system. This model corresponds to minimal repair,
whereby system reliability returns to the condition immediately before failure. If
repair times are small relative to times between failures, so that they can be
ignored, then we have ι ( t ) = h ( t ) .

8.4.3 Delayed Renewal Process

We refer to a repairable system as stationary if there is no long-term improvement

or deterioration of its performance. For many applications, the assumptions of
renewal and minimal repair are too restrictive. We have encountered the need for
an alternative scenario that allows for minor repairs, as follows:
• Corrective maintenance is performed upon failure, to restore the system to
a reasonable operating state
• Preventive maintenance takes place at regular intervals, to reset the system
to a good operating state
Corrective maintenance (CM) corresponds to major or minor repair work and may
involve replacing the damaged components, whereas preventive maintenance (PM)
usually corresponds to minor interventions such as lubrication, cleaning and
Given this structure, we assume that failure times after corrective operations are
independent and identically distributed, as are failure times after preventive
operations. However, we allow for different probability distributions in the two
cases and this defines the delayed renewal process (DRP). This is not a simple
renewal process, because of the different lifetime distributions following the two
types of action. However, the simple renewal process could be regarded as a
limiting case of the DRP, if corrective operations were to repair the system to the
same state as preventive operations. Maximal repairs involve restoring the system
upon failure to its condition at new. Similarly, if corrective operations were to
restore the system to the state immediately before failure, minimal repairs would
result. This is not strictly a special case of the delayed renewal process, but a
computer program could easily allow for this assumption if required. However, we
believe that minimal repairs are convenient for mathematical modelling but are not
always valid in practice.
Preventive Maintenance Models for Complex Systems 189

Figure 8.2. Delayed renewal process

As shown in Figure 8.2, define the random variables U and V to be the lifetimes
after PM and CM respectively. Their probability density functions, conditional
upon known parameters, are fU ( u ) and fV ( v ) respectively. These distributions
might take the exponential, gamma or Weibull forms defined earlier, to achieve the
required flexibility. Note that the exponential distribution is a limiting case of the
gamma as α → 1 and Weibull as β → 1 . The DRP assumes that downtimes are
negligible compared with the costs of parts and labour. We now consider the
effects of non-ignorable downtimes.

8.4.4 Delayed Alternating Renewal Process

The delayed renewal process described above assumes that the downtimes for
preventive and corrective maintenance are negligible when compared with the
lifetimes. It also assumes that the costs associated with these downtimes are
dominated by the costs of parts and labour. The model and analysis are further
complicated when we allow for periods of downtime, when maintenance actions
take place. In many applications involving continuous-process industries, the
principal costs are not due to parts and labour, but are due to lost production whilst
the system is down. Consequently, we must consider downtime costs and durations
when determining cost-effective strategies for scheduling PM.
This extension results in the delayed alternating renewal process (DARP), for
which analytical solution is not even feasible in practice. The downtimes following
preventive and corrective maintenance can be fixed or random. Since analytical
solution of the optimisation problems is not possible and we are adopting a
simulation approach here, either of these can be included in the calculations with
ease. In the following work, we consider them fixed to avoid confusion. Another
benefit of simulation over numerical solution of the renewal equations is that
anomalies are readily catered for, such as switching from CM to PM if the system
is in the failed state when PM is due. The DARP is illustrated in Figure 8.3.

Figure 8.3. Delayed alternating renewal process

The delayed alternating renewal process is appropriate when the time to replace
(or repair back to new) a failed item is non-zero. In this case, we have working and
190 D. Percy

failed states and these alternate. So far, we have only allowed for systems that
display no long-term trends, corresponding to improvement or deterioration. We
now discuss age-based models that allow for such trends. These models can also be
used for stationary and non-stationary systems when concomitant information is
available. We discuss these benefits later, as the need for including such extra
sources of information is described.

8.4.5 Virtual Age Model (Rejuvenation)

The virtual age model (VAM) modifies the hazard function for a system’s inter-
failure times at each corrective maintenance action. For these repairs, the system’s
virtual age at any given time is determined by a variety of additive or multi-
plicative age-reduction factors. This resets the system to a younger state, which is
only an approximation for reasons mentioned earlier. The intensity function of a
point process under the age reduction model may be additive

⎛ N (t ) ⎞
ι ( t ) = ι0 ⎜ t −
⎜ ∑ si ⎟

⎝ i =1 ⎠

or multiplicative

⎛ N (t ) ⎞

ι ( t ) = ι0 ⎜ t si ⎟
⎜ i =1 ⎟
⎝ ⎠

where both si are constants, representing the age reduction factors, and ι0 ( t ) is
the baseline intensity function again.
In order to evaluate the intensity function for a sequence of failures under age
reduction, the renewal function governs the system failure pattern. The additive
model can generate negative intensities but the multiplicative model is suitable if
replacement components are infallible. The age-reduction model has been applied
to systems under a block replacement policy. A critical defect of the age-reduction
model and its many variants is that they do not provide a realistic description of the
failure processes. For example, replacing a corroded exhaust pipe does not reduce a
car’s age, as very many other components are no less likely to fail.

8.4.6 Proportional Hazards Model

The proportional hazards model (PHM) is more flexible than the renewal process,
DRP and DARP, as it allows for non-stationarity. It is also more flexible than the
virtual age model because it allows for concomitant information. In principle, this
model appears to be inappropriate for representing a complex system, because
hazards naturally relate to lifetimes of components rather than inter-failure times of
processes. We cannot physically justify this model as readily as the proportional
intensities model described later. However, this does not invalidate its use in this
context as a statistical model rather than a mathematical model and considerable
Preventive Maintenance Models for Complex Systems 191

success in applying the proportional hazards model to real reliability and PM

scheduling problems has been achieved.
In formulating the PHM for a repairable system, we adopt different hazard
functions after PM

κ ( u ) = κ 0 ( u ) exp ( y ′t γ ) (8.7)

and after CM

λ ( v ) = λ0 ( v ) exp ( z t′δ ) (8.8)

where u and v represent the lifetimes following PM and CM respectively. The

baseline hazard functions can take any suitable forms, including exponential,
Gumbel and Weibull.
The covariates that might be contained in the vectors y t and z t include cumu-
lative observations of:
• Time since last PM
• Time since last CM
• Total number or total downtime of PMs
• Total number or total downtime of CMs
• Average PM interval duration
We might consider other factors and covariates for inclusion here, representing the
concomitant information mentioned earlier. These could include:
• Severity measures of failures
• Quality measures of maintenance
• Condition-monitoring measurements
Temporal, or continuously time varying covariates (time since last PM and time
since last CM) cause substantial computational difficulties. These may be avoided
by choosing baseline hazard functions that are sufficiently flexible. The vectors γ
and δ contain the regression coefficients, which generally take the form of un-
known parameters. The results from extensive analyses demonstrate that this pro-
portional hazards model is flexible, easy to use and of considerable practical value,
despite its doubtful mathematical suitability for modelling repairable systems.

8.4.7 Intensity Reduction Model (Correction)

Improvement factors feature in additive and multiplicative intensity reduction

models (IRM) for imperfect maintenance. Perhaps the most suitable of these is an
intensity reduction model that involves a multiplicative scaling of the intensity
function upon each failure and repair. This is the natural model for systems that are
improving or deteriorating with time and provides a perfect description of the
physical situation. This model can be expressed as an NHPP with intensity function
192 D. Percy

N (t )

ι ( t ) = ι0 ( t ) ∏s
i =1
i (8.9)

where the si are constants representing the intensity reduction factors and ι0 ( t ) is
the baseline intensity function again. We later generalize this model by supposing
si are simple functions of i , or are random variables that are independent of the
failure and repair process. Having concluded that this model is ideally suited to
modelling complex repairable systems, this chapter later considers how to extend it
to allow for preventive maintenance and concomitant information.

8.4.8 Proportional Intensities Model

Whilst the proportional hazards model offered a valuable generalization of the

delayed renewal process and delayed alternating renewal process to allow for non-
stationarity and concomitant information, it is not the natural model for repairable
systems. The natural model takes the form of a nonhomogeneous Poisson process
and is the essence of the proportional intensities model (PIM), which is the subject
of this subsection and is a generalization of the intensity reduction model described
Define the random variable N ( t ) as the number of system failures by time t .
Then the NHPP is characterised by conditionally independent increments, corre-
sponding with conditionally independent times between failures that occur with

ι ( t ) = lim
{ }
P N (t + ε ) − N (t ) ≥ 1 H (t )
ε →0 ε

at system age t units, where H ( t ) is the history of the process. However, the NHPP
corresponds with minimal repair as in Section 8.4.2 and makes no allowances for
system improvement, or even deterioration, arising from maintenance actions.
Hence, we modify the intensity function by introducing a multiplicative factor,
so that we can express the intensity function as

ι ( t ) = ι0 ( t ) exp xTt β ,( ) (8.11)

where the baseline intensity ι0 ( t ) has a standard form such as constant, loglinear
and power-law. Furthermore, the parameter vector β represents the regression
coefficients and the observation vector x t contains factors and covariates relating
to the system, such as the cumulative observations and concomitant information
mentioned in Section 8.4.6.
An alternative option arises when using the PIM to model a complex repairable
system subject to PM. Rather than adopting a global time scale for the baseline
intensity function as implied above, we could reset the time scale of the baseline
intensity function to zero upon each PM action. This introduces an element of
Preventive Maintenance Models for Complex Systems 193

renewal that might be applicable if PM involves major reworking. System age

could then be included amongst the covariates if necessary. However, this
intervention results in a hybrid model, which suffers from the same difficulty of
interpretation as does the PHM.
As for predictor variables, the process simulation calculations for scheduling PM
simplify greatly if we hold factors and covariates at fixed values throughout each
PM interval. However, this essentially treats all CM as minimal repair work, an
assumption that we earlier claimed is often unreasonable. To avoid this constraint,
we need to consider variables that change during a PM interval, such as the
cumulative number of failures. The computational effort required to incorporate
such temporal covariates in our simulation is immense, but this relates to computer
power rather than manpower and so is quite acceptable.

8.5 Generalized Proportional Intensities Model (GPIM)

The GPIM is this chapter’s main model of interest, as it allows for covariates and
offers much potential for decision making related to scheduling preventive
maintenance. Special cases of the GPIM are the intensity reduction model and the
proportional intensities model investigated in Section 8.4. An algebraic represen-
tation of the GPIM in terms of the intensity function is given by

⎧⎪ M ( t ) ⎫⎪ ⎧⎪ N ( t ) ⎫⎪
∏ ∏ ( )
ι ( t ) = ι0 ( t ) ⎨ ri ⎬ ⎨ s j ⎬ exp xTt β . (8.12)
⎩⎪ i =1 ⎭⎪ ⎩⎪ j =1 ⎭⎪

Here, ι0 ( t ) is the baseline intensity function, whilst ri > 0 and s j > 0 are the
intensity scaling factors for preventive maintenance (PM) and corrective main-
tenance (CM) actions respectively. Furthermore, M ( t ) and N ( t ) are the total
numbers of PM and CM actions, whilst xt is a vector of predictor variables and β
is an unknown parameter vector of regression coefficients. One might expect the
rj and s j to be less than one for a deteriorating system and greater than one for an
improving system, though replacing failed components with used parts and acci-
dentally introducing faults during maintenance can produce the opposite effects.
System copies can have different forms of baseline intensity function. For
reduction of intensity, the scaling factors can take the forms of positive constants,
random variables, deterministic functions of time ( t ) and events ( i and j ) or
stochastic functions of time and events. As for the intensity reduction model de-
scribed in Section 8.4.7, a reasonable assumption for initial analysis is that ri = ρ
for i = 1, 2,… , M ( t ) and s j = σ for j = 1, 2,… , N ( t ) , in which case the GPIM
corresponds with the PIM of Section 8.4.8. The vector of predictor variables might
194 D. Percy

• Quality of last maintenance action

• Time since last maintenance action
• Condition indicators
The quality of maintenance affects the functionality of a system and its future
performance. Our justification for including the time since last maintenance here is
to allow for the possibility that maintenance interventions can introduce problems
similar to the burn-in of new components. The first of these is a discrete function
of time, whereas the second is a continuous function of time. Condition indicators,
when available, give direct and very strong guidance on the likely occurrence of
failures. They are typically discrete functions of time that vary at, and between,
maintenance actions.

8.6 Parameter Estimation

All of the preceding models contain unknown parameters. In order to make any
decisions based on these models, such as determining when to schedule the next PM
activity, we need to quantify our knowledge about these parameters subjectively and
empirically. Three forms of inference are applicable here. In increasing order of
accuracy and precision, but also of algebraic complexity, they are naïve (fully
subjective), frequentist (fully objective) and Bayesian (both subjective and ob-
jective). The first of these is trivial, whereas the others both require us to specify the
likelihood function.
Firstly, consider the delayed renewal process of Section 8.4.1. In practical
applications, the model parameters for each of the PM and CM lifetime distribu-
tions are unknown and we need some subjective or objective information about
these parameters. Subjective information typically represents the expert views of
maintenance engineers about a system’s repair and failure process, and can take
many forms such as simply specifying values for the unknown parameters.
Objective information typically takes the form of historically observed failure and
repair data for the system under consideration, of the form

D = ( ui , vij ) ; i = 1,… , n; j = 1,… , ni } (8.13)

which covers n complete PM intervals, where interval i contains ni failures. Note

that the ui are right censored if ni = 0 and the vij are right censored when j = ni .
Otherwise, the observations represent actual failure times. We introduce the indica-
tor variables

⎧0 ; ui right censored
ci = ⎨ (8.14)
⎩1 ; ui observed lifetime

Preventive Maintenance Models for Complex Systems 195

⎧⎪0 ; vij right censored

d ij = ⎨ (8.15)
⎪⎩1 ; vij observed lifetime

to identify when observations are right censored.

The likelihood function for this delayed renewal process then becomes

L (θ , φ ; D ) ∝
∏ { f ( u θ )} {R ( u θ )} ∏ { f ( v φ )} {R ( v φ )}
n 1− dij
ci 1− ci dij
i i ij ij
i =1 j =1

where R ( ⋅) represents the corresponding reliability function. Due to the nature of

the DRP model, this likelihood function can be written as the product of a function
of θ and a function of φ , so that

L (θ , φ ; D ) = L (θ ; D ) L (φ ; D ) (8.17)


{ } {R ( u θ )}
ci 1− ci
L (θ ; D ) ∝ ∏ f ( ui θ ) i (8.18)
i =1



∏∏ { f ( v φ )} {R ( v φ )}
n dij 1− dij
L (φ ; D ) ∝ ij ij . (8.19)
i =1 j =1

For a frequentist analysis, we evaluate the maximum likelihood estimates of θ and

φ by maximising the natural logarithm of this function with respect to these
parameters. Subsequent inference generally assumes that the parameters are equal to
these values. To avoid the errors that arise through adopting a naïve or frequentist
approach, we can instead adopt a Bayesian approach. This leads naturally to a
decision-theoretic solution to the problem of PM scheduling and we refer interested
readers to the article by Percy (1998a) for details.
We now turn our attention to parameter estimation for the nonhomogeneous
Poisson process. For failure times T1 , T2 ,… , TN (T ) with observed values t1 , t2 ,… , t N (T )
in the interval ( 0,T ] , the likelihood function corresponding to a NHPP with intensity
function ι ( t ) is given by

⎧⎪ N (T ) ⎫⎪ ⎪⎧
( )
L {ι; H ( t )} ∝ ⎨ ∏ ι ti− ⎬ exp ⎨− ∫ ι ( t ) dt ⎬ (8.20)
⎩⎪ i =1 ⎪⎭ ⎩⎪ 0 ⎭⎪
196 D. Percy

and so the log-likelihood function becomes

N (T ) T
l {ι; H ( t )} = const. + ∑ logι ( t ) − ∫ι ( t ) dt .
i =1


Therefore, once we specify the formulation of ι ( t ) , we can obtain estimates for its
unknown parameters via likelihood-based methods.
Example 8.4 Assuming T = t N (T ) so that observation ceases at a failure, the
maximum likelihood estimates (MLEs) can be determined analytically for the
power-law process (NHPP with power-law intensity). With ι (t ) = α t β and
n = N (T ) , the MLEs are

βˆ = n
−1 (8.22)

i =1

αˆ =
n βˆ + 1 ). (8.23)
βˆ +1

For a particular system, successive arrival times (not inter-arrival times) were
observed to be 15, 42, 74, 117, 168, 233 and 410 days. With n = 7 , T = 410 and
t1 = 15,… , t7 = 410 , we have βˆ ≈ −0.3007 and then αˆ ≈ 0.07288 . As βˆ < 0 , the
intensity is a strictly decreasing function of time; this is a happy system that seems
to improve with age.
Analysis of the intensity based models follows by extending this likelihood
function corresponding to the NHPP. Consider the generalized proportional inten-
sities model of Section 8.5. The choice of which predictor variables to include
depends upon the sample size (history of failures) and the results of standard
selection procedures based on analyses of deviance for nested models. Only im-
portant predictors should be included in order to produce a robust model. We can
estimate the parameters in the model by maximum likelihood, on extending the
NHPP likelihood presented above, whereby the log-likelihood is given by

l {ι; H (T )} = const. +

∑ c {logι ( t ) + M ( t ) log ρ + N ( t ) log σ + x γ}

k =1
k 0




n ⎧⎪ M t N t tk +1
⎪ ⎫
∑ ∫ ι ( t ) exp ( x γ ) dt ⎬⎪ .
( )
− ⎨ρ σ ( k k )
k =0 ⎪
⎩ tk ⎭

This corresponds to the simple case where the scaling factors are constant: minor
changes are needed for the more general cases.
Preventive Maintenance Models for Complex Systems 197

8.7 Model Selection

In this section, we consider how to choose among the many approaches described
above, namely the renewal process (RP), delayed renewal process (DRP), delayed
alternating renewal process (DARP), virtual age model (VAM), proportional haz-
ards model (PHM), nonhomogeneous Poisson process (NHPP), intensity reduction
model (IRM), proportional intensities model (PIM) and generalized proportional
intensities model (GPIM). The main distinguishing features are process stationarity,
goodness of fit, mathematical robustness, consistency and ease of implementation.
All of these models are concerned with describing the failure and repair process of
complex repairable systems subject to preventive maintenance. We subsequently
use the fitted models to forecast the system behaviour under different PM strategies
by simulation. This enables us to determine the optimal strategy by minimising the
expected cost per unit time over a suitable horizon, finite or infinite, with respect to
suitable loss or utility functions.
The RP only applies to individual components, for which corrective and
preventive maintenance effectively amount to replacement. The DRP and DARP
apply to stationary systems, whereas all of the later models allow for non-
stationarity. The DRP is easier to fit than the DARP, but ignores the influence of
downtimes, so we use the latter if these are significant. Ascher and Feingold (1984)
discuss several methods for assessing stationarity. Perhaps the simplest of these is
a graph that plots the observed cumulative number of failures against the observed
cumulative operating hours. Consistent departures from linearity might suggest that
some trends are present. Naturally, we must exercise care to avoid distorting the
results when allowing for PM interventions. Sometimes though, we seek a formal
hypothesis test to assess whether the assumption of stationarity is reasonable.
One of these is Laplace's trend test, which is simple and sufficient for most
needs. Suppose we observe the system history from time 0 until time t and sup-
pose that we observe n failures at times t1 , t2 ,… , tn . Then Laplace's trend test
compares the test statistic

∑t − 2
i =1
U= (8.25)

with standard normal critical values, rejecting the null hypothesis of no trend if
U ∉ ( − z p 2 , z p 2 ) for a hypothesis test at the 100 p % level of significance, where
the proportion p represents the size of the test. For a 5% significance test, the
critical values are given by z p 2 = 1.960 .
If we decide that a system is nonstationary, we could use the VAM or PHM,
which are easier to fit to data than the stochastic processes considered next, but are
less robust because of their statistical rather than mathematical derivation.
However, all of these models require numerical computation to some extent. The
VAM and PHM might provide a better fit to the observed data on occasions,
198 D. Percy

though the mathematical justification and consequent robustness of the stochastic

processes are most appealing properties. The NHPP corresponds to minimal repairs
and only applies to systems containing very many similar components. However, it
is the fundamental model for complex repairable systems and its simplicity appeals
to many practitioners. The IRM and PIM improve upon the NHPP by allowing for
partial repairs and preventive maintenance. The GPIM combines the best features
of both models and perhaps offers the most potential for PM scheduling problems,
despite the extra computational burden it attracts.
Whichever model we choose to fit to our data, some degree of model compari-
son is necessary. For the RP, DRP and DARP, we need to decide which lifetime
distributions to fit following PM and CM. For the VAM and PHM, we must select
suitable baseline hazard functions, scaling factors and explanatory variables for our
linear predictors. Similar choices are necessary for the NHPP, IRM, PIM and
GPIM. We can assess the goodness of fit of a model using its likelihood function
and can compare different models using likelihood ratios or Bayes factors. Con-
sider two nested models M 1 ⊃ M 2 with p1 > p2 parameters and likelihood func-
tions L1 > L2 respectively. Then under general conditions, asymptotic sampling
distribution theory states that

2 log ~ χ 2 ( p1 − p2 ) (8.26)

and so we can test whether the extra parameters are significant. This is particularly
beneficial when choosing which elements to include in a linear predictor.
If the models M 1 and M 2 are not nested, we cannot use this formal test and
simply compare the log-likelihood functions log L1 and log L2 , choosing the model
with the larger log-likelihood. This is appropriate for choosing between gamma and
Weibull baseline hazard functions, for example. However, it is only valid if p1 = p2 ,
as a model with more parameters often fits better than a model with fewer para-
meters, by definition. To compare non-nested models with different numbers of para-
meters, we usually apply a correction factor to the log-likelihood functions.
Two common modified forms are the Akaike information criterion (AIC),
which suggests that we compare log L1 − p1 with log L2 − p2 , and the Schwarz
criterion, or Bayes information criterion (BIC), which suggests that we compare
log L1 − ( p1 log n ) 2 with log L2 − ( p2 log n ) 2 where n is the number of obser-
vations in the data set. The latter arises as the limiting case of the posterior odds
resulting from a Bayesian analysis with reference priors. In each case, the best
model to choice is the one that maximizes the information criterion.
Example 8.5 Suppose we fit two non-nested models to a set of lifetime data,
based on n = 31 observed failures. The first model contains three parameters and
has a likelihood of L1 = 8.742 × 10−18 . The second model contains five parameters
and has a likelihood of L2 = 3.110 × 10−17 . The Bayes information criterion for the
first model is log L1 − ( p1 log n ) 2 ≈ −44.43 and for the second model it is
log L2 − ( p2 log n ) 2 ≈ −46.59 so we prefer the first, simpler model here.
Preventive Maintenance Models for Complex Systems 199

8.8 Preventive Maintenance Scheduling

The objective of model fitting is to determine the optimal PM period for minimising
the expected cost per unit time. We will see that analytical solution of this problem
is not possible and that simulation of the failure and repair process over a given
horizon provides the best approach for resolving this difficulty. Sometimes, no
particular horizon is specified and we can do no more than assume an infinite
horizon. However, this problem simplifies for stationary systems involving the
models based on the renewal process, as we only need to simulate the process over a
single PM interval.
On other occasions, a finite horizon is clearly defined. Perhaps a factory or
machine is owned on a 20-year lease. Alternatively, the equipment might be re-
tained until cost efficiencies on a larger scale recommend replacement or scrap-
ping. For a pre-determined finite horizon such as these, we base decisions on
simulating the process for the whole horizon. Further analysis could be performed
for the situation where a finite horizon is not pre-specified and must be regarded as
random. The extra complexity introduced is a current research issue.
We begin by considering the delayed renewal process again. Suppose that the
costs associated with PM and CM are k PM and kCM units respectively. Assuming
an infinite horizon, we now simulate a PM interval of length t . This involves
generating a pseudo-random observation u from fU ( u ) , to represent a typical
lifetime following PM. If u ≥ t , the interval is complete and the total cost incurred
is k PM . However, if u < t we generate a pseudo-random observation v1 from
fV ( v ) , to represent a typical lifetime following CM, and add a cost kCM for the
repair. We continue this process, generating CM lifetimes v1 , v2 , v3 ,… and adding
a further cost kCM each time until this interval is complete, and then calculate the
total cost for this interval. Call this total cost K1 .
This procedure has completely simulated a PM interval of length t . We next
repeat the procedure, until we have m repetitions in total, and determine the total
costs for these simulated intervals, K1 , K 2 ,… , K m . Then their sample mean

i =1
i (8.27)

represents an unbiased estimator for the total cost per PM interval. This enables us
to estimate the expected cost per unit time as K t .
Now we must repeat the whole simulation for different values of t , using an
efficient search algorithm, to determine the value of t that minimises this expected
cost per unit time. This is the recommended PM interval duration. We advocate
direct search algorithms for practical implementation, such as golden-section search.
For practical purposes, t is unlikely to vary continuously and discrete values will
dominate. Convenient multiples of days, weeks or months provide suitable units of
measurement for practical implementation.
200 D. Percy

To deal with scenarios involving finite horizons, we modify this simulation

procedure. Instead of generating one simulated PM interval on many occasions, we
simulate the process over the whole horizon h and accumulate the costs of PM and
CM over this period. If we redefine K1 as the total cost over this horizon, then
successive replications of this simulated process generate total costs K1 , K 2 ,… , K m
as before. This time however, the expected cost per unit time is given by K h
where K is the sample mean defined earlier.
We now shift our attention to the delayed alternating renewal process. For most
applications, it is reasonable to suppose that all PM activities have downtimes of
similar durations, at an average cost of k PM units, and that all CM activities have
downtimes of similar durations, at an average cost of kCM units. The analysis of this
DARP model then proceeds exactly as for the DRP model, except that the simulation
of successive PM intervals must also take account of these downtimes. Our model
assumptions could be extended to consider different levels of maintenance activity, if
these are evident in practice. For example, PM and CM might each be performed as
minor or major activities, with corresponding downtimes. Such possibilities are
application specific and can readily be incorporated as required, by adapting the basic
simulation program. Indeed, simulation is the only feasible method of analysis and
optimisation in this case.
To investigate PM scheduling for the nonhomogeneous Poisson process, we
condition only upon the history at time t to avoid the problems associated with
doubly stochastic processes and obtain

{µt (ε )}

P N (t + ε ) − N (t ) = n H (t ) } =
exp {− µt ( ε )} (8.28)

for n = 0,1, 2,… where

t +ε
µt ( ε ) = ∫ ι ( t ) dt (8.29)

is the mean number of failures in the interval ( t , t + ε ) .

Consequently, the reliability function for the next failure from time t is

{ }
Rt ( ε ) = P N ( t + ε ) − N ( t ) = 0 H ( t ) = exp {− µt ( ε )} , (8.30)

from which we can determine the lifetime distribution following a particular

maintenance action at time t as

ft ( ε ) = − Rt′ ( ε ) = ι ( t + ε ) exp {− µt ( ε )} . (8.31)

Preventive Maintenance Models for Complex Systems 201

This allows us to simulate the process as before, evaluate expected costs over a
finite horizon, and so deduce the most economical time for the next preventive
maintenance. This decision can be made at any specific event, such as during PM
or CM, or even between events, so long as the intensity function is known.
Next we consider the proportional hazards model. To avoid referring separately
to the hazard functions κ ( u ) and λ ( v ) , consider a general hazard function h ( x ) .
For the purposes of simulation in order to schedule PM in the future, the reliability
function can be determined as

⎪⎧ ⎪⎫
R ( x ) = exp ⎨ − h ( x ) dx ⎬ ,
∫ (8.32)
⎪⎩ 0 ⎪⎭

from which the probability density function is

⎧⎪ x ⎫⎪
f ( x ) = − R ′ ( x ) = h ( x ) exp ⎨− h ( x ) dx ⎬ ,
∫ (8.33)
⎩⎪ 0 ⎭⎪

allowing us to simulate the system’s failure process for PM optimisation as before.

8.9 Applications
We now apply some of these models to the data sets in Section 8.2.

Example 8.6 For each system, we fitted the intensity reduction model using
constant, loglinear and power-law baseline intensities with constant reduction
factors. Its goodness of fit is measured by the log-likelihoods in Table 8.5, obtained
using Mathcad software. For comparison, we also display the log-likelihoods for
the extremes of renewal process (maximal repairs) and nonhomogeneous Poisson
process (minimal repairs)
202 D. Percy

Table 8.5. Log-likelihoods for analyses of hypothetical reliability data

Model Baseline Happy Sad Noncommittal

intensity system system system
constant −33⋅ 7 −33⋅ 7 −35 ⋅ 5
intensity reduction loglinear −32 ⋅ 4 −28 ⋅ 5 −33⋅ 4
power-law −29 ⋅ 4 −32 ⋅ 0 −34 ⋅ 7
constant −35 ⋅ 5 −35 ⋅ 5 −35 ⋅ 5
maximal repair loglinear −34 ⋅ 8 −34 ⋅ 8 −34 ⋅ 8
power-law −35 ⋅ 1 −35 ⋅ 1 −35 ⋅ 1
constant −35 ⋅ 5 −35 ⋅ 5 −35 ⋅ 5
minimal repair loglinear −34 ⋅ 8 −32 ⋅ 0 −35 ⋅ 2
power-law −35 ⋅ 0 −31⋅ 8 −35 ⋅ 3

As expected, the intensity reduction model provides a good fit to all three
systems, preferring the power-law baseline intensity for the happy system and the
loglinear baseline intensity for the sad and noncommittal systems. Figure 8.4
shows that these baseline intensities are all increasing functions and any apparent
happiness is due to the high quality of repairs rather than a self-improving system.
Preventive Maintenance Models for Complex Systems 203

Intensity Function

λ ( t , a , b , s)

0 t 410

Intensity Function

λ ( t , a , b , s)

0 t 410

Intensity Function

λ ( t , a , b , s)

0 t 410

Figure 8.4. Best fitting models for happy, sad and noncommital systems, respectively
204 D. Percy

Example 8.7 Regarding all PM actions as CM actions for demonstration

purposes, we apply the Laplace trend test to determine whether there is any
evidence of non-stationarity at the 5% level of significance. Our test statistic is

nt 22 × 2,128
∑t − 2
i =1
i 21,901 −
U= = ≈ −0.5230 . (8.34)
n 22
t 2,128 ×
12 12

As −1.960 < U < 1.960 , the test is not significant at the 5% level and we conclude
that this test provides no evidence of non-stationarity for these data. Consequently,
the delayed renewal process might provide an adequate fit to these data, without
the need for a more complicated model. However, we might consider using the
DARP if downtime is important or one of the later models if concomitant
information is also available.

Example 8.8 Here the data comprise 65 event observations collected over seven
years. In the first half of this period, there were 15 CM and 11 PM actions. In the
second half of this period, there were 29 CM actions and 10 PM actions. Hence,
this is a sad system, which might benefit from preventive maintenance. We fit the
generalized proportional intensities model to these data with explanatory variables
representing quality of last maintenance and time since last maintenance. A
loglinear baseline with constant reduction factors generates the results in Table 8.6.

Table 8.6. Log-likelihoods and parameter estimates for GPIM analyses of oil pump data

Predictor Log- Parameter estimates

variables likelihood α̂ β̂ ρ̂ σˆ γ̂

─ −211.7 5 × 10 −4 1.01 0.719 0.740 ─

Quality of −210.2 6 × 10−4 1.01 0.699 0.745 6 × 10−3

last action
Ttime since −210.8 7 × 10−4 1.01 0.666 0.728 − 8 ×10−3
last action
Quality of −209.5 8 ×10−4 1.01 0.653 0.734 6 × 10−3
last action
Time since − 7 ×10 −3
last action

The best model includes both “quality of last maintenance action” and “time
since last maintenance action” as predictor variables. This is not surprising, as it
contains six parameters whereas the model with no predictor variables has only
four. As the associated PM reduction factor ρ̂ is about two-thirds, preventive
Preventive Maintenance Models for Complex Systems 205

maintenance reduces the intensity of critical failures for this system and so
improves its reliability. Although slightly less impressive, corrective maintenance
reduces the intensity function too. Hence, the maintenance workforce appears to be
very effective for this application! A graph of the intensity function for the GPIM
with both covariates follows in Figure 8.5, based on the corresponding parameter
estimates in the last row of Table 8.6.

Intensity Function

λ ( t , a , b , r , s , c1 , c2)

0 t 2487

Fig. 8.5. Intensity function for GPIM analysis of oil pump data with two covariates

We now perform a simulation analysis for this last model based on the methods
described in Section 8.8, in order to determine an optimal strategy for scheduling
preventive maintenance. Several convenient PM intervals are considered for our
calculations, including weekly, monthly, two-monthly, quarterly, biannually, annu-
ally and biennially. The minimum cost per unit time over a ten-year fixed horizon
is achieved with monthly PM and generates a projected 80% saving over annual
PM, though this estimated reduction in costs is sensitive to the choice of model.
The previous policy implemented averages about three PM actions per year, which
our simulation estimates would cost about four times as much in preventive main-
tenance when compared with the optimal policy of monthly PM.

8.10 Conclusions
This chapter discussed the ideas of modelling complex repairable systems, with the
intention of scheduling preventive maintenance to improve operational efficiency
and reduce running costs. It started by emphasising the importance of improved,
accurate and complete data collection in practice. It then presented the renewal
process, delayed renewal process and delayed alternating renewal process as
reasonable models for systems that exhibit stationary failure patterns.
206 D. Percy

The virtual age model and proportional hazards model were described as
suitable for systems that do not exhibit stationarity and for systems where predictor
variables such as condition monitoring observations are also measured. The non-
homogeneous Poisson process, intensity reduction model and proportional inten-
sities model, with a promising generalization, were described next. We claim that
these models offer natural interpretations of the physical underlying reliability and
maintenance processes.
Finally, this chapter demonstrated some applications of these ideas using
reliability and maintenance data taken from the oil industry and reviewed several
methods for model selection and goodness-of-fit testing, including graphs, Laplace
trend test, likelihood ratios and the Akaike and Bayes information criteria. The use
of mathematical modelling and statistical analysis in this fashion can improve, and
has improved, the quality of PM scheduling. This can then result in considerable
cost savings and help to improve system availability.

8.11 References
Ascher HE, Feingold H, (1984) Repairable Systems Reliability: Modeling, Inference,
Misconceptions and their Causes. New York: Marcel Dekker
Baik J, Murthy DNP, Jack N, (2004) Two-dimensional failure modeling with minimal
repair. Naval Research Logistics 51:345–362
Cox DR, (1972a) Regression models and life tables (with discussion). Journal of the Royal
Statistical Society Series B 34:187–220
Cox DR, (1972b) The statistical analysis of dependencies in point processes. In Stochastic
Point Processes (Lewis PAW). New York: Wiley
Crowder MJ, Kimber AC, Smith RL, Sweeting TJ, (1991) Statistical Analysis of Reliability
Data. London: Chapman and Hall
Dagpunar JS, Jack N, (1993) Optimizing system availability under minimal repair with non-
negligible repair and replacement times. Journal of the Operational Research Society
Dekker R, Frenk H, Wildeman RE, (1996) How to determine maintenance frequencies for
multi-component systems? A general approach. In Reliability and Maintenance of
Complex Systems (Ozekici S). Berlin: Springer
Doyen L, Gaudoin O, (2004) Classes of imperfect repair models based on reduction of
failure intensity or virtual age. Reliability Engineering and System Safety 84:45–56
Handlarski J, (1980) Mathematical analysis of preventive maintenance schemes. Journal of
the Operational Research Society 31:227–237
Jack N, (1998) Age-reduction model for imperfect maintenance. IMA Journal of
Mathematics Applied in Business and Industry 9:347–354
Jardine AKS, Anderson PM, Mann DS, (1987) Application of the Weibull proportional
hazards model to aircraft and marine engine failure data. Quality and Reliability
Engineering International 3:77–82
Jiang R, Jardine AKS, (2006) Composite scale modeling in the presence of censored data.
Reliability Engineering and System Safety 91:756–764
Kobbacy KAH, Fawzi BB, Percy DF, Ascher HE, (1997) A full history proportional hazards
model for preventive maintenance scheduling. Quality and Reliability Engineering
International 13:187–198
Lindqvist BH, Elvebakk G, Heggland K, (2003) The trend-renewal process for statistical
analysis of repairable systems. Technometrics 45:31–44
Preventive Maintenance Models for Complex Systems 207

Lugtigheid D, Banjevic D, Jardine AKS, (2004) Modelling repairable systems reliability

with explanatory variables and repair and maintenance actions. IMA Journal of
Management Mathematics 15:89–110
Newby M, (1994) Perspective on Weibull proportional hazards models. IEEE Reliability
Transactions 43:217–223
Percy DF, Alkali BM, (2006) Generalized proportional intensities models for repairable
systems. IMA Journal of Management Mathematics 17:171–185.
Percy DF, Kobbacy KAH, (2000) Determining economical maintenance intervals.
International Journal of Production Economics 67:87–94
Percy DF, Bouamra O, Kobbacy KAH, (1998a) Bayesian analysis of fixed-interval
preventive-maintenance models. IMA Journal of Mathematics Applied in Business and
Industry 9:157–175
Percy DF, Kobbacy KAH, Ascher HE, (1998b) Using proportional-intensities models to
schedule preventive-maintenance intervals. IMA Journal of Mathematics Applied in
Business and Industry 9:289–302
Taylor HM, Karlin S, (1994) An Introduction to Stochastic Modelling. London: Academic
Van der Duyn Schouten F, (1996) Maintenance policies for multicomponent systems: an
overview. In Reliability and Maintenance of Complex Systems (Ozekici S). Berlin:
Watson C, (1970) Is preventive maintenance worthwhile? In Operational Research in
Maintenance (Jardine AKS). Manchester: University Press

Artificial Intelligence in Maintenance

Khairy A. H. Kobbacy

9.1 Introduction
Over the past two decades their has been substantial research and development in
operations management including maintenance. Kobbacy et al. (2007) argue that the
continous research in these areas implies that solutions were not found to many
problems. This was attributed to the fact that many of the solutions proposed were
for well-defined problems, that the solutions assumed accurate data were available
and that the solutions were too computationally expensive to be practical. Artificial
intelligence (AI) was recognised by many researchers as a potentially powerful tool
especially when combined with OR techniques to tackle such problems. Indeed,
there has been vast interest in the applications of AI in the maintenance area as
witnessed by the large number of publications in the area. This chapter reviews the
application of AI in maintenance management and planning and introduces the
concept of developing intelligent maintenance optimisation system.
The outline of the chapter is as follows. Section 9.2 deals with various main-
tenance issues including maintenance management, planning and scheduling. Sec-
tion 9.3 introduces a brief definition of AI, some of its techniques that have appli-
cations in maintenance and Decision Support Systems. A review of the literature is
then presented in Section 9.4 covering the applications of AI in maintenance. We
have focused on five AI techniques namely knowledge based systems, case based
reasoning genetic algorithms, neural networks and fuzzy logic. This review also
covers “hybrid” systems where two or more of the above mentioned AI techniques
are used in an application. Other AI techniques seem to have very few applications
in maintenance to date. A discussion of the development of the prototype hybrid
intelligent maintenance optimisation system (HIMOS) which was developed to
evaluate and enhance preventive maintenance (PM) routines of complex engineer-
ing systems follows in Section 9.5. HIMOS uses knowledge based system to
identify suitable models to schedule PM activities and case base reasoning to add
capability to utilise past experience in model selection. Future developments and
210 K. Kobaccy

outline design of an Adaptive Maintenance Measurement and Control Model are

covered in Section 9.6. Concluding remarks are presented in Section 9.7.

The following abbreviations are used throughout this chapter.

AHP: Analytic hierarchy process
AI: Artificial intelligence
CBR: Case based reasoning
CO: Corrective action
DMG: Decision making grid
DSS: Decision support system
FL: Fuzzy logic
GAs: Genetic algorithms
HIMOS: Hybrid intelligent maintenance optimisation system
IDSS: Intelligent decision support system
IMOS: Intelligent maintenance optimisation system
KBS: Knowledge based systems
NHPP: Non-homogeneous Poisson process
NNs: Neural networks
OR: Operational research
PHM: Proportional hazards model
PIM: Proportional intensities model
PM: Preventive maintenance
RBR: Rule based reasoning

9.2 Maintenance Management, Planning and Scheduling

Most industrial organisations have maintenance departments which deal with many
issues regarding operations. For example they can be involved in process design,
inventory, schedulling and staffing. However, the ultimate objective of maintenance
is to keep equipment at acceptable standard. To achieve this objective a variety of
maintenance actions are employed including inspection, repair, planned mainten-
ance and replacement. An adequate planning of type, contents and timing of main-
tenance actions is essential for the success of the maintenance function (Kobbacy
A survey of some 34 companies was carried out in the UK (Kobbacy et al.
2005). It indicated that around half of the work that was carried out by maintenance
departments was on repair; around a quarter was on preventive maintenance and 5%
on inspection. The remaining effort was on other types of maintenance actions
including opportunistic maintenance, condition monitoring and design-out main-
tenance. Repairs represent the largest proportion of maintenance actions carried out
by maintenance department and indeed all departments surveyed carried out repairs.
Repair is the maintenance action that restores the equipment to operating
condition. Some repair actions restore equipment to as new condition while others
are classed as minimal repair, i.e. restore equipment to the condition prior to
failure. In reality, equipment is likely to be restored to a condition between these
two states. Occasionally, repairs may introduce faults to the equipment.
Artificial Intelligence in Maintenance 211

Preventive maintenance is the maintenance action that is undertaken in the

belief that it reduces the occurrence of failures as compared with the alternative of
repairing components only upon failures (Kobbacy et al. 1995a). PM is perhaps the
most intractable of maintenance actions in terms of mathematical modelling. The
main reason is that only one point is usually known on the curve representing cost/
availability against PM interval, and the analyst attempts to predict failure rate at a
range of PM intervals in order to select the optimal interval.
Inspection is the action taken to establish the condition of equipment at some
point in time. It can be triggered by observing unusual performance of equipment,
e.g. noise, or else the inspection can be carried out at regular predetermined
intervals. A major difference between PM and inspection is that PM routines usually
involve planned maintenance action, e.g. replace component, make adjustment, etc.
while inspection involves checking the condition of equipment and carrying out
maintenance action based on the outcome of inspection. In other words inspection
routines, unlike PM, do not contain predetermined restoration of equipment con-
Fault diagnosis is an integral part of maintenance actions and it follows from
realizing that a fault has occurred. This is essentially required before repairs are
carried out following failure, preventive maintenance, inspection or condition
There are two approaches for maintenance planning/management – the en-
gineering approach and the mathematical approach (Gits 1984). The engineering
approach has a broad view of the maintenance problem as the maintenance concept
is determined through consideration of the operations plan, maintenance con-
straints and item behaviour. Thus it emphasises the development of rules or guide-
lines for planning maintenance action. The mathematical approach has more em-
phases on developing optimal maintenance policies, e.g. optimal PM interval. A
major challenge in this field is how to integrate these approaches. Many software
packages have been developed over the years to help in the analysis and modelling
of maintenance situations, though they have their limitations including the inter-
ference of an analyst, which can slow the process or make the analysis almost in-
tractable for large systems.
Scheduling of maintenance actions is a part of maintenance planning. Not all
maintenance actions require scheduling, e.g. repair upon failure and design-out
maintenance. Opportunistic maintenance, by definition, is carried out taking ad-
vantage of the time when equipment is not in use, but planning for spare parts can
be required. Condition monitoring can be a continuous process but often requires
planning of monitoring interval and the subsequent replacement. The other two
major maintenance actions that require scheduling are preventive maintenance and
“planned” inspections. Typically, one needs first to establish a model for failure
pattern, i.e. times between failures. A non-homogeneous Poisson process is usually
the model of first choice for deteriorating repairable systems (Ascher and Kobbacy
1995). There have been many attempts to schedule PM routines, i.e., to decide on
the frequency of PM actions per year. Ascher and Kobbacy (1995) present models
for scheduling PM by minimizing cost/ maximizing availability and using NHPP.
Other attempts use Cox’s proportional hazards model (Kobbacy et al. 1997) and
proportional intensities model (Percy et al. 1998). The latter has proved to be of
212 K. Kobaccy

great promise and indeed being investigated for application in more complex PM
situations, e.g., multiple PM routines.

9.3 AI Techniques
AI is a branch of computer science that develops programmes to allow machines to
perform functions normally requiring human intelligence (Microsoft ENCARTA
College Dictionary 2001). The goal of AI is to teach machines to “think” to a
certain extent under special conditions (Firebaugh 1988). There are many AI
techniques, the most used in maintenance decision support are as follows.
Knowledge based systems (KBS): use of domain specific rules of thumb or
heuristics (production rules) to identify a potential outcome or suitable course of
Case based reasoning (CBR): utilises past experiences to solve new problems. It
uses case index schemes, similarity functions and adaptation. It provides machine
learning through updating of the case base.
Genetic algorithms (GAs): these are based on the principle that solutions can
evolve. Potential promising solutions evolve through mutation and weaker solutions
become extinct.
Neural networks (NNs): use back propagation algorithm to emulate behaviour
of human brain. Both of NNs and GAs are capable of learning how to classify,
cluster and optimise.
Fuzzy logic (FL): allows the representation of information of uncertain nature.
It provides a framework in which membership of a category is graded and hence
quantifies such information for mathematical modelling, etc.
There are several other AI techniques and these include Data Mining, Robotics
and Intelligent Agents. However, to date very few publications are available about
their applications in maintenance.

9.3.1 Intelligent Decision Support Systems

A useful definition of DSS is as follows. It is a computer based system that helps

decision makers confront ill-structured problems through direct interaction through
data and analysis and models (Sprague and Watson 1986).
The result of integrating an AI technique within a DSS is referred to in this
chapter as an Intelligent DSS. This is essentially a DSS as defined above, but has
the additional capabilities to “understand”, “suggest” and “learn” in dealing with
managerial tasks and problems. The method of integration and the features of the
end product depend very much on the area of application.

9.4 AI in Maintenance
AI techniques have been used successfully in the past two decades to model and
optimise maintenance problems. Since the resurgence of AI in the mid-1980s
researchers have consider the applications of AI in this field. The article by
Artificial Intelligence in Maintenance 213

Dhaliwal (1986) is one of the early ones that argued for the appropriateness of
using AI techniques for addressing the issues of operating and maintaining large
and complex engineering systems. Kobbacy (1992) discusses the useful role of
knowledge based systems in the enhacement of maintenance routines. Over the
years the applications of AI in maintenance grew to cover very wide area of appli-
cations using a variety of AI techniques. This can be explained by the individual
nature of each technique. For example GAs and NNs have the advantage of being
useful in optimising complex and nonlinear problems and overcome the limitations
of the classic “black box” approaches, where attempt is made to identify the system
by relating system outputs to inputs without understanding and modelling the un-
derlying process. Hence the widespread applications in the scheduling area and
also in fault diagnosis.
In this section, an up to date survey is presented covering the area of appli-
cation of AI techniques in maintenance including fault diagnosis. This chapter will
only refer to some of the references in the vast applications of AI in fault diagnosis.
Interested readers can refer to the recent comprehensive review by Kobbacy et al.
(2007) on applications of AI in Operations.

9.4.1 Case Based Reasoning (CBR)

CBR is an interesting AI technique which adds learning capabilities to DSS

systems. This may explain the lack of publications on using CBR on its own in
maintenance. Instead there are few hybrid applications which utilises CBR together
with other AI techniques. Details about CBR technique are discussed while pre-
senting the case study in Section 9.5.3.
Yu et al. (2003) present a problem-oriented multi-agent-based E-service system
(POMAESS). The system uses a CBR-based decision support function. The case
study, which is discussed later in this chapter deals with a hybrid KBS/CBR main-
tenance optimisation system (HIMOS).
More publications are found in fault diagnosis including papers on its
application in locomotive diagnostics, e.g. Varma and Roddy (1999). Xia and Rao
(1999) argue the need to develop dynamic CBR which introduces new mechanisms
such as time-tagged indexes and dynamic and multiple indexing to help accurate
solving of problems taking into account system dynamics and fault propagation
phenomena. Cunningham et al. (1998) describe an incremental CBR mechanism
that can initiate the fault diagnosis process with only a few features.
There are also papers on hybrid CBR systems in fault diagnosis including the
use of CBR with Petri nets for induction motor fault diagnosis (Tang et al. 2004),
CBR with FL in fault diagnosis of modern commercial aircraft ( Wu et al. 2004),
CBR with NN in web-based intelligent fault diagnosis system (Hui et al. 2001),
CBR with heuristic reasoning and hypermedia for incident monitoring (Rao et al.
1998) and CBR with KBS in pattern search problem in fault diagnosis (Kohno et
al. 1997).
214 K. Kobaccy

9.4.2 Genetic Algorithms (GAs)

GAs are popular in maintenance applications because of their robust search capabili-
ties that help reduce the computational complexity of large optimisation problems
(Morcous and Lounis 2005), such as large scale maintenance scheduling models.
GAs have applications in infrastructure networks including programming the main-
tenance of concrete bridge decks (Morcous and Lounis 2005; Lee and Kim 2007),
pavement maintenance programme (Chootinan et al. 2006), and optimising highway
life-cycle by considering maintenance of roadside appurtenances (Jha and Abdullah
2006). GAs also have applications in maintenance activities in nuclear power plants
including optimising the technical specification of a nuclear safety system by
coupling GAs and Monte Carlo simulation in attempt to minimise the expected value
of system unavailability and its associated variance (Marseguerra et al. 2004).
Another important area of application is in manufacturing. Ruiz et al. (2006)
present an approach for scheduling of PM in a flowshop problem with the aim of
maximising availability. Sortrakul et al. (2005) present a heuristic based on genetic
algorithms to solve an integrated optimisation model for production scheduling and
preventive maintenance planning. Chan et al. (2006) propose a GA approach to
deal with distributed flexible manufacturing system scheduling problem subject to
machine maintenance constraint. Other popular application areas for GAs include
preventive maintenance scheduling optimisation. Application areas in PM include
chemical process operations (Tan and Kramer 1997), power systems (Huang
1998), single product manufacturing production line (Cavory et al. 2001) and me-
chanical components (Tsai et al. 2001). GAs are also used in deciding on oppor-
tunistic maintenance policies (Saranga 2004; Dragan et al. 1995).
GAs have had some moderate but constant interest over the past decade in the
area of fault diagnosis. Applications range from manufacturing systems (Khoo et
al. 2000), nuclear power plants (Yangping et al. 2000), electrical distribution net-
works (Wen and Chang 1998) to a new area of application in automotive fuel cell
power generators (Hissel et al. 2004).

9.4.3 Neural Networks (NNs)

NNs are popular AI technique applied in the areas of maintenance and in particular
in fault diagnosis. NNs are the primary information processing structure used in
neurocomputing i.e. systems that learn the relationship between data through a
process of training (Dendronic Decisions Ltd 2003).
NNs have many applications in the areas of predictive maintenance and
condition monitoring. Gilabert and Arnaiz (2006) present a case study for non-
critical machinery, where NN is used for elevator monitoring and diagnosis as no
previous experience existed. Al-Garni et al. (2006) also use NN for predicting the
failure rate of an airplane tyres. Gromann de Araujo Goes et al. (2005) have
developed a computerised online reliability monitoring system for nuclear power
plant applications. An interesting application, developed by Garcia et al. (2004),
uses NNs to aid tele-maintenance, where staff can carry out the work remotely and
in collaboration with other experts. Other applications of NNs in condition
monitoring include the work of Bansal et al. (2004) on machine systems, Booth
Artificial Intelligence in Maintenance 215

and McDonald (1998) on electrical power transformers and Spoerre (1997) on

bearings. Shyur et al. (1996) use NNs to predict component inspection require-
ments for ageing aircraft and Eldin and Senouci (1995) use NNs for the condition
rating of joint concrete pavements.
Lin and Wang (1996) developed an approach combining NNs and advanced
vibration monitoring methods for online predictive maintenance of rotating ma-
chinery. Luxhoj and Williams (1996) present a hybrid NN/KBS DSS for aircraft
safety inspection.
NNs suit model based fault detection and isolation when analytical models are
not available. Frank and Koppen-Seliger (1997) define three steps for fault detec-
tion: residual generation, i.e. generation of a signal that reflects the fault, residual
evaluation, i.e. the logical decision making on the time of occurrence and location
of the fault and fault analysis, i.e., determination of the type of fault, its size and
cause. NNs have to be trained for both residual generation and evaluation using
collected or simulated data for the former and residuals in the latter.
There is large number of papers published on the use of NNs in fault diagnosis
covering a wide range of applications. These include diagnosis in induction motors
(Yang and Kim 2006), marine propulsion systems (Kuo and Chang 2004),
supervision of desalination plant during dynamic states, e.g. start up (Tarifa et al.
2003), engineering structures (Chen et al. 2003), navigation systems (Zhang et al.
2001), power plants (Simani and Fantuzzi 2000) and automotive engine
management (Shayler et al. 2000). There are various applications in the chemical
process industry for using NNs in fault diagnosis, e.g., packed towers (Sharma et
al. 2004) and batch processes (Scenna 2000).
There are studies that make use of hybrid NNs systems for fault diagnosis.
Yang et al. (2004) integrate CBR with an ART-KNN to enhance fault diagnosis
when solving a new problem with NN used to make hypotheses and to guide CBR
to search for similar previous cases. Jota et al. (1998) use neuro-fuzzy, neuro-
expert and fuzzy expert algorithms for fault detection in a range of electrical power
system equipment.

9.4.4 Knowledge Based Systems (KBSs)

The use of KBS in maintenance management represents one of the early

applications of AI in maintenance. Martland et al. (1990) developed a knowledge-
based expert system to guide the rail scheduling process, i.e. in developing a plan
for rail relay or replacement. Ahmed et al. (1991) developed an expert system for
offshore structure inspection and maintenance. Kobbacy (1992) argued the use of
KBS in evaluation and enhancement of maintenance routines. Batanov et al.
(1993) developed EXPERT-MM, an expert system that supports maintenance
policy suggestions, machine diagnosis and maintenance scheduling. Feldman et al.
(1992) designed a rule-based expert system to investigate maintenance policies
with regards to replacement, minimal repair or no actions in continuous manu-
facturing environments. Srinivasan et al. (1993) present an intelligent scheduling
system using KBS for application on a power-distributed system. Drury and
Prabhu (1996) provide a framework for information design that captures the
interaction between the inspection task and its information requirements in the
216 K. Kobaccy

operation of commercial aircraft. The framework is used together with the

cognitive control categories of skill-rule-knowledge-based behaviour to analyse
information needs of aircraft inspectors. de Brito et al. (1997) developed a proto-
type system for optimising the inspection and maintenance and repair strategies for
bridges. A fuzzy knowledge based method for maintenance planning in power
system is demonstrated by Sergaki and Kalaitzakis (2002). In addition the work of
Kobbacy and Jeon (2001) is discussed later (Section 9.5.3).
KBSs also have a wide range of applications in fault diagnosis that are showing
an increasing trend unlike applications in maintenance planning. KBSs can be used
in all three phases of fault diagnosis (see Section 9.4.3). In the case of complex
systems where there is insufficient information to formulate a mathematical model,
KBSs have been particularly successful.
Examples of applications of KBS in fault diagnosis include diagnosing electri-
cal failures in induction motors (Acosta et al. 2006), fault diagnosis of rotating
machinery (Yang et al. 2005), CNC machine-tools (Leung and Romagnoli 2002),
industrial gas turbines (Milne et al. 2001), research reactors (Varde et al. 1998),
power transmission networks (Baroni et al. 1997), real-time fault detection of
green house sensors (Beaulah and Chalabi 1997), monitoring, diagnosis and
optimisation of a coal washing plant (Villanueva and Lamba 1997), continuous and
semi-continuous chemical processes (Nam et al. 1996) and in diagnosis and
maintenance of robotic systems (Patel et al. 1995). Miller et al. (1990) developed a
vehicle trouble-shooting expert system which has integrated imaging capability.
The system is used to diagnose maintenance problems in the electrical/ hydraulic
Hybrid KBS systems applications in fault diagnosis include the KBS/NNs
application in batch chemical plants (Ruiz et al. 2001). Frank and Ding (1997)
outline advances of the theory of observed-based fault diagnosis in dynamic
systems covering the use of AI including KBSs and NNs.

9.4.5 Fuzzy Logic (FL)

FL has been used in various applications in the maintenance area to deal with
uncertainity. Oke and Charles-Owaba (2006) apply an FL control model to Gant
charting preventive maintenance scheduling. Al-Najjar and Alsyouf (2003) use a
fuzzy multiple criteria decision making to select in advance the most informative
(efficient) maintenance approach, i.e. strategies, policies or philosophies. Braglia et
al. (2003) adopt FL to help an approach to allow analysts formulating efficiently
assessment of possible causes of failure in mode, effects and criticality analysis.
Sudiarso and Labib (2002) investigated FL approach to an integrated maintenance/
production scheduling algorithm. Jeffries et al. (2001) develop an efficient hybrid
method for capturing machine information in a packaging plant using FL, fuzzy
condition monitoring, in order to reduce wastage and maintenance overheads.
Examples of FL hybrid applications include the use of a KBS for bridge
damage diagnosis which aims at providing information about the impact of design
factors on bridge deterioration with FL used to handle uncertainties (Zhao and
Chen 2001). Sinha and Fieguth (2006) propose a neuro-fuzzy classifier that com-
Artificial Intelligence in Maintenance 217

bines FL and NNs for the classification of defects by extracting features in seg-
mented buried pipe images.
Applications for FL in fault diagnosis include fault diagnosis of railway wheels
(Skarlatos et al. 2004), thrusters for an open- frame underwater vehicle (Omerdic
and Roberts 2004), chemical processes (Dash et al. 2003) and rolling element
bearings in machinery (Mechefske 1998).

9.5 The Hybrid Intelligent Maintenance Optimisation System

In this section we discuss the Hybrid Intelligent Maintenance Optimisation System

9.5.1 Why Intelligent Maintenance DSSs are Needed

Optimisation of the maintenance policies of complex technical systems, such as

telecommunication systems and complex manufacturing plants, can prove to be
difficult. With the developments in information technology over the past two
decades, many organisations with complex technical systems have developed
maintenance databases. Though the stored history data is potentially very useful to
the maintenance engineer aiming to improve maintenance policies, in many cases
the data are mainly used to produce simple statistics for management reporting.
This is not due to the lack of interest on the part of maintenance practitioners, but
to the challenging nature of these systems. The following difficulties are likely to
be encountered while attempting to optimise the maintenance routines of complex

1. The system contains a large number of sub-systems and components. This

gives rise to a wide variety of maintenance situations that can be handled
using different models and methods.
2. For a maintenance engineer, optimising the maintenance routines using
available software packages, a familiarity with maintenance modelling in
addition to engineering expertise is required.
3. Even if engineers with such experience were available, the time required to
examine a large number of components using this type of software can be
4. The changeable nature of large technical systems, e.g. replacement of
components with different types or modification of design, will present
constant challenges.

All these difficulties accentuate the need to develop special computerised

systems that can cope with the management of complex engineering systems.
Intelligent DSSs are a candidate.
218 K. Kobaccy

9.5.2 The Required Functional Features of An Intelligent Maintenance DSS

The main functional features which would be expected of such a system, to cope
with the above situation are (Kobbacy 2004):

1. To access the history data from a maintenance data base.

2. To check the quality of data.
3. To recognise characteristic data patterns.
4. To query the user for additional information, judgement, and criterion.
5. To select the most appropriate PM scheduling model for the decision analysis.
6. To optimise the selected model, evaluate the current policy and propose
optimal maintenance policy.
7. To present the results of the analysis in a flexible format.
8. To respond to user enquiries, perform ‘What if?’ decision modelling and
provide explanations of the recommended decisions.
9. To have learning capabilities.
10. To have a user friendly Windows interface.

In the following section we will present one specific application of intelligent

systems in maintenance, namely the development of an intelligent system to
schedule PM for complex technical systems. This section is based on the work of
the author with others (see references below).

9.5.3 The Hybrid Intelligent Maintenance Optimisation System (HIMOS)

HIMOS aims at deciding the optimal PM cycle interval for a repairable system by
selecting and applying the most appropriate optimisation model automatically and
without the need for expert interference (Kobbacy and Jeon 2001). HIMOS is the
result of developing its predecessor IMOS (Kobbacy et al. 1995b), the intelligent
maintenance optimisation system, which used rule based reasoning to select an
appropriate model for analysis. HIMOS employs hybrid reasoning by combining
rule-based reasoning (RBR) and case-base reasoning (CBR) to choose a model
from a model base for a given data set.
Analysis of a typical large data file by IMOS showed that about two thirds of
components cannot be modelled, mostly because of insufficient history data
needed for model selection (Kobbacy 2004). However, some of the cases which
could not be modelled may have parameters with values close to those of a model’s
acceptance level as stated in the rulebase. By introducing case based reasoning, the
system can model cases which are not identified by the rule base, although it has
analysed similar cases in the past. Thus, such a hybrid (KBS and CBR) system is
expected to increase the previous low percentage of model cases where the system
is able to identify a suitable model.
Artificial Intelligence in Maintenance 219

Figure 9.1. Outline Design of HIMOS (Kobbacy and Jeon 2001)

Figure 9.1 illustrates the conceptual structure of HIMOS which is divided into
two areas. The DSS contribution area contains a database to store maintenance
historical data, a model base for data analysis models and optimisation models, and
a user interface to communicate with the user. In the AI contribution area, there are
two bases which contain experts’ knowledge: knowledge base and case base. HIMIS Procedure

Figure 9.2 illustrates the model selection for a data set consisting of a sequence of
preventive maintenance (PM) and corrective action (CO) events to enable calcu-
lating the optimal PM interval. HIMOS has the ability to use a set of production
rules to select and then optimise a suitable model in order to provide an evaluation
of the current maintenance routine and to propose an optimal policy. These rules
are acquired from experts’ knowledge and may require subjective judgements to be
made. The processor of HIMOS identifies data patterns through data analysis
procedure and then selects the most appropriate model for a given data set by
consulting the rule base. If a data set cannot be matched by any of the KBS rules,
then the system attempts to use CBR to identify a suitable model.
220 K. Kobaccy

Figure 9.2. Model Selection In HIMOS (Kobbacy and Jeon 2001)

Data Formatting and Analysis After reading data from the input data file, the
system formats and checks the data to create a suitable data set for the next step of
analysis. Suspect or missing items of data are flagged in order to be sorted out by
the system or investigated by the user.
The analysis consists of five steps: recognition of PM and CO patterns, calcu-
lation of current availability, Weibull distribution fitting to failure times, trend test
of frequency and severity to establish data stationarity with respect to frequency
and severity or otherwise, and if applicable analysis of Multi-PM cases. In the first
step a basic analysis is carried out to identify the features of the data set such as the
numbers of PM and CO events and the mean lives to failure, so that the data set
can be compared with characteristic data patterns in the model selection process.
The data produced in this process are referred to as ‘metadata’.

Model Base The model base contains two sets of models: the data analysis models
and the PM scheduling optimisation models. The data analysis models identify a
data pattern which together with the RBR/CBR help to select an optimisation
Artificial Intelligence in Maintenance 221

model. The optimisation models are a set of mathematical models of maintenance

policies which evaluate current policies and, in certain circumstances, recommend
optimal policy. These models deal with components rather than systems and they
assume independence of components, i.e. the failure of one component does not
affect the performance of another. The models in the models base are classified
into single-PM and multi-PM models and for the former case into stationary and
nonstationary models. Stationary models deal with the data sets in which no trend
is found. If there is frequency or severity trends, then a nonstationary model can be
used. Multi-PM models are structured to deal with components subject to more
than one PM routine. IMOS model base includes 21 different models. The descrip-
tion of the models used in HIMOS can be found elsewhere ( Kobbacy and Jeon

Model Selection Using the Rule-Base In HIMOS the rule base (or knowledge base)
consists of a list of rules capturing some of the knowledge of experts in maintenance
modelling concerning mathematical modelling techniques and their applicability to
various situations. The rules match data sets to the models by searching for patterns
in the data set for each component such as relative numbers of CO and PM events,
component life distribution, range of PM intervals, etc. The approach used to develop
the rule base is described in Kobbacy (2004). The knowledge base implemented in
HIMOS consists of the set of 15 rules, an example of which is shown below. If the
rule base failed to identify a suitable model the CBR is invoked.

RULE 1: If Not matched

and There are multi-PMs
Apply Multi-PM Model

RULE 2: If Not matched

and Trend test statistics of frequency is significantly large
and Trend test statistics of severity of CO is significantly large
and Trend test statistics of severity of PM is significantly large
Apply NHPPScoSpm Model

RULE 3: If Not matched

and Trend test statistics of frequency is significantly large
and Trend test statistics of severity of CO is significantly large
Apply NHPPSco Model

Model Selection Using CBR CBR is an approach to problem solving that utilises
past experiences to solve new problems. The first step in the operation of a CBR
system is the retrieval in which the inputs are analysed to determine the critical
222 K. Kobaccy

features to use in retrieving past cases from the case database. Among the well
known methods for case retrieval is the nearest neighbour which is used in
HIMOS. To find the nearest neighbour matching the case being considered, the
case with the largest weighted average of similarity functions for selected features
is selected. In HIMOS four features were selected and all given equal weights.
These features are: number of PM, number of CO, trend value and variability of
PM cycle length. The reason for selecting these features is that they were found to
be the main causes for failure to select a suitable model using the rule based
system. The similarity function was selected as the difference between the values
of feature in the current and retrieved cases divided by the standard deviation of the
Once the best matching case has been retrieved, adaptation is carried out to
reduce any prominent difference between the retrieved case and the current case
through the derivational replay method. Thus in the CBR phase, the system uses
rules similar to those used in the KBS phase to find a solution. However some
critical values in the adaptation rules are more relaxed compared with the original
In the evaluation step the system displays multiple candidate models (possible
solutions) with their critical features for the current case (adaptation results). The
user can then evaluate these alternatives and selects one using their expertise.
For the non-expert user, the system itself provides the ‘Recommended Model’
as a result of evaluation. Here the system compares the results of adaptation with
the results of retrieval. If there is no matching model then no recommendation is
made, otherwise the system recommends the matching model. If there is more than
one matching model, the system merely recommends the first ranked (nearest
neighbour) model. Results and Validation of HIMOS

HIMOS results for a component include some basic statistics for, e.g. number of
PM, CO, current availability, etc. The most important result from the decision-
maker’s point of view is the recommended PM interval. The optimal availability
gives an estimate of availability which might be achieved if the recommended PM
policy is implemented.
Table 9.1 shows the percentage success rate of HIMOS in modelling a large
number of components. As can be seen, around two thirds of the components could
not be modelled because no rule matched the data to a specific model. The intro-
duction of case base reasoning can add to the success rate of modelling com-
ponents. The table also shows that the introduction of CBR reduces the percentage
of cases where no suitable model was identified from 68.6 % to 52.7 %. Given the
self-learning nature of CBR where the case base expands with use, it is possible to
improve the success rate with the extended use of the system in certain
Artificial Intelligence in Maintenance 223

Table 9.1. Percentage use of maintenance models for HIMOS when applied to large
systems, 1633 components in three data files (Kobbacy 2004)

Model HIMOS*

Stochastic RP 6.6 12.8
NHPP 1.6 1.6
NRP 2.3 2.3
Total stochastic 10.5 16.4
Geometric I 15.7 23.5
Geometric II 1.7 1.8
Weibull 1.7 3.7
Deterministic 1.8 1.9
No model suitable 68.6 52.7

HIMOS was validated using test cases by comparing the results of analysis of
selected cases by HIMOS with the recommendations of an expert panel.
For the validation HIMOS, eight data sets were used and a panel of five experts
were involved. In general there was agreement between HIMOS and the experts.
The experts had a measure of disagreement in their advices as a result of making
different assumptions in their analysis. Experts also made useful suggestion for the
operation of the system. Table 9.2 is a typical example of HIMOS and the experts’

Table 9.2. Example of validation of IMOS

Data Set 3

HIMOS Increase PM interval from 177 to 403 days (CBR-RPOW model)

There is no evidence of trend. Increase PM interval but should

Expert A
not be allowed to approach 600 days
Unless failure has substantive safety or risk association,
Expert B
PM could be extended.

Expert C Optimal PM interval is found to be 404 days

Expert D Increase PM interval

Expert E Increase PM interval to 250 days

224 K. Kobaccy

9.6 Future Developments

In approaching the problem of maintenance management of complex engineering
systems one can identify two broad levels for tackling the maintenance issues
(Kobbacy and Labib 2003). At a higher decision-making level, one is usually con-
cerned with effectiveness issues such as prioritising machines, modes of failure and
types of maintenance actions that will lead to improving systems operations. At the
lower decision level, one is concerned with maintenance efficiency issues, e.g. PM
interval. Researchers tend to address either the higher-level issues of effectiveness
or the lower decisions level issues of efficiency.
Labib (1998) proposed two techniques to identify effective maintenance poli-
cies at higher levels; namely the rule based decision making grid (DMG) and the
analytic hierarchy process (AHP). The AHP is a technique for prioritisation that
relies on modelling a problem into a hierarchical structure of a goal, at the apex,
and levels of criteria and alternatives at the bottom. The DMG acts as a map where
the machines with the worst performances are placed, based on selected multiple
criteria. These criteria, such as, downtime and frequency of breakdowns, are
determined through prioritisation based on the AHP approach. The objective is to
take maintenance actions to improve the machines’ performance as measured by
the selected multiple criteria. This approach is discussed in Chapter 17.
In order to tackle the issue of efficient maintenance management for complex
engineering systems, Kobbacy (1992) proposed integrating Artificial Intelligent
techniques such as rule based reasoning with mathematical modelling. Such ap-
proach allows automated modelling of large amounts of maintenance data to carry
out analysis and propose optimal maintenance schedule, i.e. frequency of PM. This
approach has been explained in Section 9.5.
Kobbacy and Labib (see Section 9.8) propose merging their approaches of
DMG and HIMOS in order to develop an integrated approach towards developing
‘effective’ and ‘efficient’ maintenance management approach. Figure 9.3 outlines
the design of such a futuristic system. This proposed concept emphasis the sharing
of data and tools between the two models while maintaining their distinct features
and allowing flow of information between them.
Artificial Intelligence in Maintenance 225

Figure 9.3. Outline design of AMMCM (Adaptive Maintenance Measurement and Control

9.7 Concluding Remarks

There has been many developments in the use of AI in the maintenance area.
Hundreds of papers have been published in this area. Kobbacy et al. (2007) have
shown that the number of publications using NNs and GAs in maintenance have had
increasing trends in the past few years which can be explained by their use in
optimising complex and nonlinear problems (see Section 9.4). There is an apparent
increase in using hybrid approaches and utilising their combined strengths. There is
enormous potential for developments in many applications of AI in maintenance by
combining two or more AI techniques. Multiple hybrid intelligent management
systems (MHIMS) are potentially powerful tools that can help making the right
decisions right, i.e. making effective and efficient decisions. The author has a vision
that such MHIMS may be assembled in the future from off the shelf modules,
resulting in reduction in time and cost of development.
226 K. Kobaccy

9.8 Acknowledgments
The author wishes to acknowledge the contributions of those who collaborated at
the various stages of the development of IMOS and HIMOS. In particular I wish to
acknowledge the significant contribution of A.L. Labib in developing the proposal
for the AMMCM presented in Section 9.6.

9.9 References
Acosta, G.G., Verucchi, C.J. and Gelso, E.R. (2006) A current monitoring system for
diagnosing electrical failures in induction motors, Mechanical Systems and Signal
Processing, 20, 953–965.
Ahmed, K., Langdon, A. and Frieze, P.A., (1991), An expert system for offshore structure
inspection and maintenance, Computers and Structures, 40, 143–159.
Al-Garni, A.Z., Jamal, A., Ahmad, A.M. Al-Garni, A.M. and Tozan, M. (2006), Neural
network-based failure rate prediction for De Havilland Dash-8 tires, Engineering
Applications of Artificial Intelligence, 19, 681–691.
Al-Najjar, B. and Alsyouf, I. (2003), Selecting the most efficient maintenance approach
using fuzzy multiple criteria decision making, International Journal of Production
Economics, 84, 85–100.
Ascher, H.E. and Kobbacy, K.A.H. (1995), Modelling preventive maintenance for
deteriorating repairable systems, IMA Journal of Mathematics Applied in Business &
Indistry, 6, 85–99.
Bansal, D., Evans, D.J. and Jones, B. (2004), A real-time predictive maintenance system for
machine systems, International Journal of Machine Tools and Manufacture, 44,
Baroni, P., Canzi, U. and Guida, G. (1997), Fault diagnosis through history reconstruction:
an application to power transmission networks, Expert Systems with Applications, 12,
Batanov, D., Nagarue, N. and Nitikhunkasem, P. (1993) EXPERT-MM: A knowledge-based
system for maintenance management, Artificial Intelligence in Engineering, 8, 283–291.
Beaulah, S.A. and Chalabi, Z.C. (1997), Intelligent real-time fault diagnosis of greenhouse
sensors, Control Engineering Practice, 5, 1573–1580.
Booth, C. and McDonald, J.R. (1998), The use of artificial neural networks for condition
monitoring of electrical power transformers, Neurocomputing, 23, 97–109.
Braglia, M., Frosolini, M. and Montanari, R. (2003), Fuzzy criticality assessment model for
failure modes and effects analysis, International Journal of quality & Reliability
Management, 20, 503–524.
Cavory, G., Dupas R. and Goncalves, G. (2001), A genetic approach to the scheduling of
preventive maintenance tasks on a single product manufacturing production line.
International Journal of Production Economics 74, 135–146.
Chan, F.T.S., Chung, S.H., Chan, L.Y., Finke, G. and Tiwari, M.K. (2006), Solving
distributed FMS scheduling problems subject to maintenance: Genetic algorithms
approaches, Robotics and Computer-Integrated Manufacturing, 22, 493–504.
Chen, Q., Chan, Y.W. and Worden, K. (2003), Strucural fault diagnosis and isolation using
neural networks based on response-only data, Computers & Structures, 81, 2165–2172.
Artificial Intelligence in Maintenance 227

Chootinan, P., Chen, A., Horrocks, M.R. and Bolling, D. (2006), A multi-year pavement
maintenance program using a stochastic simulation-based genetic algorithm approach,
Transportation Research Part A: Policy and Practice, 40, 725–743.
Cunningham, P., Smyth, B. and Bonzano, A. (1998), An incremental retrieval mechanism
for case-based electronic fault diagnosis. Knowledge-Based Systems 11, 239–248.
Dash, S., Rengaswamy, R. and Venkatasubramanian, V. (2003), Fuzzy-logic based trend
classification for fault diagnosis of chemical processes, Computers & Chemical
Engineering, 27, 347–362.
de Brito, J., Branco, F.A., Thoft-Christensen, P. and Sorensen, J.D. (1997), An expert
system for concrete bridge management, Engineering Structures, 19, 519–526.
Dendronic Decisions Ltd (2003),
Dhaliwal, D.S. (1986), The use of AI in maintaining and operating complex engineering
systems, in Expert systems and Optimisation in Process Control, A. Mamdani and J E
Pstachion, eds, 28–33. Gower Technical Press, Aldershot.
Dragan, A.S., Walters, G.A. and Knezevic, J. (1995), Optimal opportunistic maintenance
policy using genetic algorithms, 1 formulation, Journal of Quality in Maintenance
Engineering, 1, 34–49.
Drury, C.G. and Prabhu, P. (1996), Information requirements of aircraft inspection: frame-
work and analysis, International Journal of human-Computer Studies, 45, 679–695.
Eldin, N.N. and Senouci, A.B. (1995), Use of neural networks for condition rating of joint
concrete pavements, Advances in Enginering software, 23, 133–141.
Feldman, R.M., William, M.L., Slade, T., McKee, L.G. and Talbert, A. (1992), The
development of an integrated mathematical and knowledge-based maintenance delivery
system, Computers & Operations Research, 19, 425–434.
Firebaugh, M.W. (1988), Artificial Intelligence: A Knowledge-based Approach, Boyd &
Fraser Publishing Co. Danvers, MA, USA.
Frank, P.M. and Ding, X. (1997), Survey of robust residual gereration and evaluation
methods in observed-based fault detection systems, Journal of Process Control, 7,
Frank, P.M. and Koppen-Seliger, B. (1997), New developments using AI in fault diagnosis,
Engineering applications in Artificial Intelligence, 10, 3–14.
Garcia, E., Guyennet, H., Lapayre, J.C. and Zerhouni, N. (2004), A new industrial
cooperative tele-maintenance platform. Computers & Industrial Engineering 46,
Gilabert, E. and Arnaiz, A. (2006), Intyelligent automation systems for predictive main-
tenance: A case study, Robotics and Computer Integrated Manufacturing, 22, 543–549.
Gits, C.W. (1984), On the maintenance concept for a technical system, PhD Thesis,
Eindhoven Technische Hogeschool, Eindhoven.
Gromann de Araujo Goes, A., Alvarenga, M.A.B. and Frutuoso e Melo, P.F. (2005),
NAROAS: a neural network-based advanced operator support system for the assessment
of systems reliability, Reliability Engineering & System Safety, 87, 149–161.
Hissel, D., Pera, M.C. and Kauffmann, J.M. (2004) Diagnosis of automotive fuel cell power
generators, Journal of power Sources, 128, 239–246.
Huang, S.J. (1998), Hydroelectric generation scheduling – an application of genetic-
embedded fuzzy system approach. Electric Power Systems Research 48, 65–72.
Hui, S.C., Fong, A.C.M. and Jha, G. (2001) A web-based intelligent fault diagnosis system
for customer service support, Engineering Applications of Artificial Intelligence, 14,
Jeffries, M., Lai, E.. Plantenberg, D.H. and Hull, J.B. (2001), A fuzzy approach to the
condition monitoring of a packaging plant, Journal of Materials Processing technology,
109, 83–89.
228 K. Kobaccy

Jha, M.K. and Abdullah, J. (2006) A Markovian approach for optimising highway life-cycle
with genetic algorithms by considering maintenance of roadside appurtenances, Journal
of the Franklin Institute, 343, 404–419.
Jota, P.R.S., Islam, S.M.,Wu, T. and Ledwich, G. (1998), A class of hybrid intelligent
system for fault diagnosis in electric power systems. Neurocomputing 23, 207–224.
Khoo, L.P., Ang, C.L. and Zhang, J. (2000), A Fuzzy-based genetic approsach to the
diagnosis of manufacturing systems, Engineering Applications of artificial Intelligence,
13, 303–310.
Kobbacy, K.A.H. (1992), The use of knowledge-based systems in evaluation and enhance-
ment of maintenance riutines, International Journal of Production Economics, 24, 243–
Kobbacy, K.A.H. (2004), On the evolution of an intelligent maintenance optimisation
system, journal of the Operational Research Society, 55, 139–146
Kobbacy, K.A.H. and Jeon, J. (2001), The development of a hybrid intelligent maintenance
optimisation system (HIMOS), Journal of the Operational Research society, 52,
Kobbacy, K.A.H., Percy, D.F. and Fawzi, B.B. (1995a), Sensitivity analysis for preventive
maintenance modeld, IMA Journal of Mathematics Applied in Business& industry, 6,53–
Kobbacy, K.A.H., Proudlove, N.L. and Harper, M.A. (1995b), Towards an intelligent
maintenance optimisation system, Journal of the Operatonal Research society, 46,
Kobbacy, K.A.H., Fawzi, B.B., Percy, D.F. and Ascher, H.E. (1997), A Full history
proportional hazards model for preventive maintenance modelling, Journal of Quality
and Reliability Engineering Internationa, 13, 187–198.
Kobbacy, K.A.H., Percy, D. F. and Sharp, J.M. (2005), Results of preventive maintenance
survey, unpublished report,University od Salford..
Kobbacy, K.A.H., Vadera, S. and Rasmy, M.H.(2007), AI and OR in management of
operations:history and trends, Journal of the Operational Research Society, 58, 10–28.
Kohno, T., Hamada, S., Araki, D., Kojima, S. and Tanaka, T. (1997) Error repair and
knowlledge acquisition via case-based reasoning, Artificial Intelligence, 91, 85–101.
Kuo, H-C. and Chang, H-K. (2004) A new symbiotic evolution-based fuzzy-neural approach
to fault diagnosis of maine propulsion systems, Engineering Applications of Artificial
Intelligence, 17, 919–930.
Labib, A.W. (1998) World class maintenance using a computerised maintenance
management system, Journal of Quality in Maintenance Engineering,4, 66–75.
Lee, C-K. and Kim, S-K. (2007) GA-based algorithm for selecting optimal repair and
rehabilitation methods for reinforced concrete (RC) bridge decks, Automation in
Construction, 16, 153–164.
Leung, D. and Romagnoli, J. (2002) An integration mechanism for multivariate knowledge-
based fault diagnosis, Journal of Process Control, 12, 15–26.
Lin, C-C. and Wang, H-P. (1996), Performance analysisof routating machinary using
venhanced cerebellar model articulation controller (E-CMAC) neural netyworks,
Computers and industrial Engineering, 30, 227–242.
Luxhoj, J.T. and Williams, T.P. (1996), Integrated decision support for aviation safety
inspectors. Finite Elements in Analysis and Design 23, 381–403.
Marseguerra, M., Zio, E. and Podofillini, L. (2004), A multiobjective genetic algorithm
approach to optimisation of the technical specifications of a nuclear safety system,
Reliability Engineering & System Safety, 84, 87–99.
Martland, C.D., McNeil, S., Axharya, D., Mishalani, R. and Eshelby, J. (1990), Applications
of expert systems in railroad maintenance:Scheduling rail relays, Transportation
Research Part A: General, 24, 39–52.
Artificial Intelligence in Maintenance 229

Mechefske, C.K. (1998), Objective machinery fault diagnosis using fuzzy logic, Mechanical
Systems and signal Processing, 12, 855–862.
Microsoft ENCARTA College Dictionary (2001), StMartin’s Press, N.Y.
Miller, D., Mellichamp, J.M. and Wang, J. (1990), An image enhanced knowledge based
expert system for maintenance trouble shooting, Computers in Industry, 15, 187–202.
Milne, R., Nicole, C. and Trave-Massuyes, L. (2001) TIGER with model based diagnosis:
initial deployment, Knowledge-based Systems, 14, 213–222.
Morcous, G. and Lounis, Z. (2005), Maintenance optimisation of infrastructure networks
using genetic algorithms, Automation in Construction, 14, 129–142.
Nam, D.S., Jeong, C.W., Choe, Y.J. and Yoon, E.S. (1996), Operation-aided system for fault
diagnosis of continuous and semi-continuous processes, Computers& Chemical
Engineering, 20, 793–803.
Oke, S.A. and Charles-Owaba, O.E. (2006), Application of fuzzy logic control model to
Gantt charting preventive maintenance scheduling, International Journal of Quality &
Reliability Management, 23, 441–459.
Omerdic, E. and Roberts, G. (2004), thruster fault diagnosis and accommodation for open-
frame underwater vehicles, Control Engineering Practice, 12, 1575–1598.
Patel, S.A., Kamrani, A.K. and Orady, E. (1995), A knowledge-based system for fault
diagnosis and maintenance of advanced automated systems, Computers & Industrial
Engineering, 29, 147–151.
Percy, D.F., Kobbacy, K.A.H. and Ascher, H.E. (1998), Using proportional intensities
models to schedule preventive maintenance intervals, IMA Journal of Mathematics
Applied in Business& industry, 9, 289–302.
Rao, M., Yang, H. and Yang, H. (1998), Integrated distributed intelligent system archi-
techture for incidents monitoring and diagnosis, Computers in Industry, 37, 143–151.
Ruiz, D., Canton, J., Nougues, J.M., Espuna, A. and Puigjaner, L. (2001), On-line fault
diagnosis system support for reactive scheduling in multipurpose batch chemical plants,
Computers & Chemical Engineering, 25, 829–837.
Ruiz, R., Garcia-Diaz, C. and Maroto, C. (2006), Considering scheduling and preventive
maintenance in the flowshop sequencing problem, Computers & Operations
Rresearch,34, 3314–3330.
Saranga, H. (2004) Opportunistic maintenance using genetic algorithms, Journal of Quality
in Maintenance Engineering, 10, 66–74.
Scenna, N.J. (2000) Some aspects of fault diagnosis in batch processes, Reliability
Engineering & System Safety, 70, 95–110.
Sergaki, A. and Kalaitzakis, K. (2002), Reliability Engineering& System Safety, 77, 19–30.
Sharma, R., Singh, K., Singhal, D. and Ghosh, R. (2004), Neural network applications for
detecting process faults in packed towers. Chemical Engineering and Processing 43,
Shayler, P.J., Goodman, M. and Ma, T. (2000), The exploitation of neural networks in
automative engine management systems, Engineering Applications of Artificial
Intelligence, 13, 147–157.
Shyur, H.J., Luxhoj, J.T. and Williams, T.P. (1996), Using neural networks to predict
component inspection requirements for aging aircraft. Computers & Industrial
Engineering 30, 257–267.
Simani, S. and Fantuzzi, C. (2000), Fault diagnosis in power plant using neural networks,
Information Sciences, 127, 125–136.
Sinha, S.K. and Fieguth, P.W. (2006) Neuro-fuzzy network for the classification of buried
pipe defects, Automation in Construction, 15, 73–83.
Skarlatos, D., Karakasis, K. and Trochidis, A. (2004), Railway wheel fault diagnosis using a
fuzzy-logic method, Applied Acoustics, 65, 951–966.
230 K. Kobaccy

Sortrakul, N., Nachtmann, H.L. and Cassady, C.R. (2005), Genetic algorithms for integrated
preventive maintenance planning and production schedulling for a single machine,
Computers in Industry,56, 161–168.
Spoerre, J.K. (1997), Application of the cascade correlation algorithm (CCA) to bearing
fault classification problems. Computers in Industry 32, 295–304.
Sprague, R.H. and Watson, H.J. (1986) Decision support systems – putting theory into
practice, Prentice Hall, Englewood Cliffs, New Jersey.
Srinivasan, D., Liew, A.C., Chen, J.S.P. and Chang, C.S. (1993) Intelligent maintenance
scheduling of distributed system components with operating constraints, Electric Power
Systems Research, 26, 203–209.
Sudiaros, A. and Labib, A.W. (2002) A fuzzy logic approach to an integrated maintenance/
production scheduling algorithm, International Journal of Production Research, 40,
Tan, J.S. and Kramer, M.A. (1997), A general framework for preventive maintenance
optimization in chemical process operations. Computers & Chemical Engineering 21,
Tang, B-S., Jeong, S.K., Oh, Y-M. and Tan, A.C.C. (2004), Case-based reasoning system
with Petri nets for induction motor fault diagnosis, Expert Systems with Applications, 27,
Tarifa, E.E., Humana, D., Franco, S., Martinez, S.l. Nunez, A.F. and Scenna, N.J. (2003)
Fault diagnosis for MSF using neural networks, Desalination, 152, 215–222.
Tsai, Y-T., Wang, K-S. and Teng, H-Y. (2001), Optimizing preventive maintenance for
mechanical components using genetic algorithms. Reliability Engineering & System
Safety 74, 89–97.
Varde, P.V., Sankar, S. and Verma, A.K. (1998), An operator support system for research
reactor operations and fault diagnosis through a connectionist framework and PSA based
knowledge based system, Reliability Engineering and System safety, 60, 53–69.
Varma, A. and Roddy, N. (1999), ICARUS: design and deployment of a case-based
reasoning system for locomotive diagnostics, Engineering Applications of Artificial
Intelligence 12, 681–690.
Villanueva, H. and Lamba, H. (1997). Operator guidance system for industrial plant
supervision, Expert systems withy Applications, 12, 441–454.
Wen, F. and Chang, C.S. (1998), A new approach to fault diagnosis in electrical distribution
networks using a genetic algorithm. Artificial Intelligence in Engineering 12, 69–80.
Wu, H., Liu, Y., Ding, Y. and Qiu, Y. (2004), Fault diagnosis expert system for modern
commercial aircraft, Aircraft Engineering and Aerospace Technology, 76, 398–403
Xia, Q. and Rao, M. (1999), Dynamic case-based reasoning for process operation support
systems. Engineering Applications of Artificial Intelligence 12, 343–361.
Yang, B-S. and Kim, K.J. (2006) Applications of Dempster-Shafer theory in fault diagnisis
of induction motors, Mechanical systems and Signal Processing, 20, 403–420.
Yang, B-S., Han, T. and Kim, Y-S (2004), Integration of ART-Kohonen neural network and
case-based reasoning for intelligent fault diagnosis, Expert Systems with Applications,
26, 387–395.
Yang, B-S., Lim, D-S. and Tan, A.C.C. (2005), VIBEX : an expert system for vibtation fault
diagnosis of rotating machinery using decision tree and decision table, Expert Systems
with Applications, 28, 735–742.
Yangping, Z., Bingquan, Z. and DongXin, W. (2000), Application of genetic algorithms to
fault diagnosis in nuclear power plants. Reliability Engineering & System Safety, 67,
Yu, R., Iung, B. and Panetto, H. (2003), A Multi-Agents based E-maintenance system with
case-based reasoning decision support, Engineering Applications of Artificial Intelli-
gence, 16, 321–333.
Artificial Intelligence in Maintenance 231

Zhang, H.Y., Chan, C.W., Cheung, K.C. and Ye, Y.J. (2001) Fuzzy artmap neural network
and its application to fault diagnosis of navigation systems, Automatica, 37, 1065–1070.
Zhao, Z. and Chen, C. (2001), concrete bridge deterioration diagnosis using fuzzy inference
system, Advances in Engineering Software, 32, 317–325.
Part D

Problem Specific Models


Maintenance of Repairable Systems

Bo Henry Lindqvist

10.1 Introduction
A commonly used definition of a repairable system (Ascher and Feingold 1984)
states that this is a system which, after failing to perform one or more of its
functions satisfactorily, can be restored to fully satisfactory performance by any
method other than replacement of the entire system. In order to cover more realistic
applications, and to cover much recent literature on the subject, we need to extend
this definition to include the possibility of additional maintenance actions which
aim at servicing the system for better performance. This is referred to as preventive
maintenance (PM), where one may further distinguish between condition based
PM and planned PM. The former type of maintenance is due when the system ex-
hibits inferior performance while the latter is performed at predetermined points in
Traditionally, the literature on repairable systems is concerned with modeling of
the failure times only, using point process theory. A classical reference here is
Ascher and Feingold (1984). The most commonly used models for the failure pro-
cess of a repairable system are renewal processes (RP), including the homogeneous
Poisson processes (HPP), and nonhomogeneous Poisson processes (NHPP). While
such models are often sufficient for simple reliability studies, the need for more
complex models is clear. In this chapter we consider some generalizations and
extensions of the basic models, with the aim to arrive at more realistic models which
give better fit to data. First we consider the trend renewal process (TRP) introduced
and studied in Lindqvist et al. (2003). The TRP includes NHPP and RP as special
cases, and the main new feature is to allow a trend in processes of non-Poisson
(renewal) type.
As exemplified by some real data, in the case where several systems of the
same kind are considered, there may be unobserved heterogeneity between the
systems which, if overlooked, may lead to non-optimal or possibly completely
wrong decisions. We will consider this in the framework of the TRP process,
which in Lindqvist et al. (2003) is extended to the so-called HTRP model which
236 B. Lindqvist

includes the possibility of heterogeneity. Heterogeneity can be thought of as an

effect of an unobserved covariate.
Another extension of the basic models is to allow the systems to be preventive-
ly maintained. We review some recent research in this direction, where this situ-
ation is modeled as a competing risks problem between failure and PM. This leads
to a need for combining the theory of competing risks with repair models and point
process theory. Relevant statistical data for such analyses are found in most
modern reliability databases. The book by Bedford and Cooke (2001) contains a
chapter related to this. A general reference to competing risks is the book by Crowder

Figure 10.1. Event times ( Ti ) and sojourn times ( X i ) of a repairable system

The last extension of the basic models to be considered in the present chapter
consists of using Markov models to model the behavior of periodically inspected
systems in between inspections, with the use of separate Markov models for the
maintenance tasks at inspections.
Recent review articles concerning repairable systems and maintenance include
Peña (2006) and Lindqvist (2006). A review of methods for analysis of recurrent
events with a medical bias is given by Cook and Lawless (2002). General books on
statistical models and methods in reliability, covering much of the topics con-
sidered here, are Meeker and Escobar (1998) and Rausand and Høyland (2004).

10.2 Point Process Approach

10.2.1 Notation and Basic Definitions

Consider a repairable system where time usually runs from t = 0 and events occur
at ordered times T1 , T2 ,…. Here time is not necessarily calendar time, but can be for
example operation time, number of cycles, number of kilometers run, length of a
crack, etc. In the present treatment we shall disregard time durations of repair and
maintenance, and assume that the system is always restarted immediately after
failure or maintenance action. The inter-event, or inter-failure, times will be
denoted X 1 , X 2 ,…. Here X i = Ti − Ti −1 , i = 1, 2,… , where for convenience we de-
fine T0 ≡ 0 . Figure 10.1 illustrates the notation. We also make use of the counting
process representation N (t ) = number of events in (0, t ] .
In order to describe probability models for repairable systems we use some nota-
tion from the theory of point processes. A key reference is Andersen et al. (1993).
H t denotes the history of the failure process up to, but not including, time t .
Maintenance of Repairable Systems 237

The conditional intensity of the process at time t is defined as

Pr (event of type j in [t , t + ∆t ) | H t )
γ (t ) = lim . (10.1)
∆t ↓ 0 ∆t

From this we obtain an expression for the likelihood function, which is needed for
statistical inference. Suppose that a single system as described above is observed
from time 0 to time τ , resulting in observations T1 , T2 ,…, TN (τ ) . The likelihood
function is then given by (Andersen et al. 1993, Section II.7)

⎧⎪ N (τ ) ⎫⎪ τ
L = ⎨∏ γ (Ti ) ⎬ exp − ∫ γ (u ) du .
⎩⎪ i =1 ⎭⎪ 0 } (10.2)

10.2.2 Perfect and Minimal Repair Models

Consider a system with failure rate z (t ) . Suppose first that after each failure, the
system is repaired to a condition as good as new, called a perfect repair. In this
case the failure process can be modeled by a renewal process (RP) with inter-event
time distribution F , denoted RP( F ) . Clearly, the conditional intensity defined in
Equation 10.1 is given by

γ (t ) = z (t − TN (t − ) ),

where t − TN (t − ) is the time since the last failure strictly before time t .
Suppose instead that after a failure, the system is repaired only to the state it
had immediately before the failure, called a minimal repair. This means that the
conditional intensity of the failure process immediately after the failure is the same
as it was immediately before the failure, and hence is exactly as it would be if no
failure had ever occurred. Thus we must have

γ (t ) = z (t ),

and the process is a nonhomogeneous Poisson process (NHPP) with intensity z (t ) ,

denoted NHPP( z (⋅) ). In practice a minimal repair usually corresponds to repairing
or replacing only a minor part of the system.
If z (t ) = λ does not depend on t , then NHPP ( z (⋅)) is a homogeneous Poisson
process which we denote by HPP (λ ) . Note that an HPP is at the same time an RP
with exponential inter-failure times.

10.2.3 The Trend-Renewal Process

The idea behind the trend-renewal process is to generalize the following well known
property of the NHPP. First let the cumulative intensity function corresponding to
238 B. Lindqvist

an intensity function λ (⋅) be defined by Λ (t ) = ∫ λ (u ) du . Then if T1 , T2 ,… is an
NHPP(λ (⋅)) , the time-transformed stochastic process Λ (T1 ), Λ(T2 ),… is HPP(1).
The trend-renewal process (TRP) is defined simply by allowing the above
HPP(1) to be any renewal process RP ( F ) . Thus, in addition to the intensity function
λ (t ) , for a TRP we need to specify a distribution function F of the inter-arrival
times of this renewal process. Formally we can define the process TRP( F , λ (⋅) ) as
Let λ (t ) be a nonnegative function defined for t ≥ 0 , and let Λ (t ) = ∫ λ (u ) du .
The process T1 , T2 ,… is called TRP( F , λ (⋅) ) if the process Λ (T1 ), Λ(T2 ),… is
RP( F ∵ ), that is if the Λ (Ti ) − Λ (Ti −1 ); i = 1, 2,… are i.i.d. with distribution
function F . The function λ (⋅) is called the trend function, while F is called the
renewal distribution. In order to have uniqueness of the model it is usually
assumed that F has expected value 1.
Figure 10.2 illustrates the definition. For the cited property of the NHPP, the
lower axis would be an HPP with unit intensity, HPP(1). For the TRP, this process
is instead taken to be any renewal process, RP(F), where F has expectation 1. This
shows that the TRP includes the NHPP as a special case. Further, if λ (t ) ≡ 1 , then
Λ (Ti ) = Ti , and so T1 , T2 ,… is RP(F).
For an NHPP(λ (⋅)) , the RP( F ) would be HPP(1) . Thus TRP (1 − e − x , λ (⋅)) =
NHPP(λ (⋅)). Also, TRP ( F ,1) = RP( F ) , which shows that the TRP class includes
both the RP and NHPP classes.

Figure 10.2. The defining property of the trend-renewal process

It can be shown (Lindqvist et al. 2003) that the conditional intensity function,
given the history H t , for the TRP( F , λ (⋅)) is

γ (t ) = z (Λ (t ) − Λ (TN (t − ) ))λ (t ) (10.3)

where z (⋅) is the hazard rate corresponding to F . This is a product of one factor,
λ (t ) , which depends on the age t of the system and one factor which depends on
a transformed time from the last previous failure.
Suppose now that a single system has been observed in [0,τ ] , with failures at
T1 , T2 ,…, TN (τ ) . If a TRP( F , λ (⋅) ) is used as a model, then substitution of Equation
10.3 into Equation 10.2 gives the likelihood
Maintenance of Repairable Systems 239

N (τ )
L = {∏ z[Λ(Ti ) − Λ(Ti −1 )]λ (Ti )}exp{−∫ z[Λ(u ) − Λ(TN (u − ) )]λ (u)du}. (10.4)
i =1

For the NHPP (λ (⋅)) we have z (t ) ≡ 1 , so the likelihood simplifies to the well
known expression (Crowder et al. 1991, p 166)

N (τ )
L ={ ∏ λ (T )}exp{−∫
i =1
λ (u ) du}.

Returning to the general case, if f is the density function corresponding to F ,

the we can write the likelihood at Equation 10.4 as

N (τ )
L ={ ∏ f [Λ(T ) − Λ(T
i =1
i i −1 )]λ (Ti )}{1 − F [ Λ(τ ) − Λ(TN (τ ) )]}. (10.5)

This latter form of the likelihood of the TRP follows directly from the
definition, since the conditional density of Ti given T1 = t1 ,…, Ti −1 = ti −1 is
f [Λ (ti ) − Λ (ti −1 )]λ (ti ) , and the probability of no failures in the time interval
(TN (τ ) ,τ ] , given T1 ,…, TN (τ ) , is 1 − F [Λ (τ ) − Λ (TN (τ ) )] .
This again simplifies if λ (t ) ≡ 1 in which case it gives the likelihood of an
RP(F) observed on [0,τ ] .

10.2.4 Observations from Several Similar Systems

Suppose that m systems of the same kind are observed, where the j-th system
( j = 1, 2,…, m ) is observed in the time interval [0,τ j ] . For the j-th system, let N j
denote the number of failures that occur during the observation period, and let the
specific failure times be denoted T1 j < T2 j <  < TN j . Figure 10.3 illustrates the
notation and explains the information given in a so-called event plot which is
provided by computer packages for analysis of this kind of data (see examples

Example 1 Nelson (1995) presented data for times of valve-seat replacements in a

fleet of m = 41 diesel engines. Figure 10.4 shows an event plot for the complete

Example 2 Bhattacharjee et al. (2003) presented failure data for motor operated
closing valves in safety systems at two boiling water reactor plants in Finland.
Failures of the type “external leakage” were considered for 104 valves with a
follow-up time of nine years. An event plot for the 16 valves which experienced at
least on failure, is given in Figure 10.5. The remaining 88 valves had no failures.
240 B. Lindqvist

Figure 10.3. Observation of failure times of m systems. The j-th system is observed over
the time interval [0,τ j ] , with N j ≥ 0 observed failures

Figure 10.4. Event plot for times of valve seat replacements for 41 diesel engines, taken
from Nelson (1995)

When data are available for m systems as described above, one will typically
assume that the systems behave independently but with the same probability laws
(“i.i.d. rules”). The total likelihood for the data will then be the product of the
likelihoods at Equations 10.4 or 10.5, one factor for each of the m systems.
Maintenance of Repairable Systems 241

Figure 10.5. Event plot for times of external leakage from nuclear plant valves, taken from
Bhattacharjee et al. (2003). In addition, 88 valves had no failures in 3286 days (9 years)

However, even if the m systems are considered to be of the same type, they
may well exhibit different probability failure mechanisms. For example, systems
may be used under varying environmental or operational conditions.
To cover such cases we shall assume that failures of the j-th system follow the
process TRP ( F , λ j (⋅)) , j = 1,…, m , where the renewal function F is fixed and
differences between systems are modeled by letting the trend functions λ j (t ) vary
from system to system. The assumption of a fixed F parallels the NHPP case,
where F is the unit exponential distribution.
Assuming that systems work independently of each other, we obtain from
Equation 10.5 the full likelihood L ≡ ∏ j =1 L j where


L j = {∏ f [Λ j (Tij ) − Λ j (Ti −1, j )]λ j (Tij )}{1 − F [ Λ j (τ j ) − Λ j (TN j )]}. (10.6)

i =1

As an example of the use of Equation 10.6, assume that differences between

system performances can be attributed to an observable covariate vector x , and
that the trend λ j (t ) for system j is represented by a proportional trend model

λ j (t ) = g (x j )λ (t ), j = 1,…, m (10.7)

Here λ (⋅) is a basic trend function common to all systems, while g is a

function of the covariate vector x j of system j . The special cases of this model
242 B. Lindqvist

corresponding to NHPP and RP are studied, respectively, by Lawless (1987) and

Follmann and Goldberg (1988).

10.2.5 The Heterogeneous Trend Renewal Process

As noted in the introduction, in addition to observable differences there may be an

unobserved heterogeneity between systems. A common way of incorporating such
heterogeneity is to modify Equation 10.7 to λ j (t ) = a j g (x j )λ (t ) where the a j are
unobservable (positive) random variables taking values independently across
systems (Andersen et al. 1993, Chapter IX).
For simplicity we shall in this chapter restrict attention to the case with no
observed covariates, and instead concentrate on unobserved heterogeneity. In the
following we thus assume the model

λ j (t ) = a j λ (t ) (10.8)

where the a j are independently distributed according to a common probability

distribution H , say, and where for convenience we assume that the expected value
of a j equals 1 . Thus in Equation 10.8, λ (⋅) is regarded as a basic trend function,
while the a j represent a possibly different failure intensity “level” for each system,
averaging to 1 . The special case when a j = 1 with probability 1 will be referred to
as the “no heterogeneity” case.
For given values of the a j the likelihood for the j-th system is, by Equation


L j (a j ) = {∏ f [a j (Λ(Tij ) − Λ(Ti −1, j ))]a j λ (Tij )}{1 − F[ a j ( Λ(τ j ) − Λ(TN j ))]}.

i =1

However, since the a j are unobservable, we need to take the expectation with
respect to the a j , giving

L j = E[ L j (a j )] = L j ( a j ) dH ( a j )

as the contribution to the likelihood from the j-th system. The total likelihood is
then the product
L = ∏ Lj . (10.9)
j =1

We shall use the notation HTRP ( F , λ (⋅), H ) for the model with the likelihood at
Equation 10.9,). Here the renewal distribution F and the heterogeneity distribution
H are distributions corresponding to positive random variables with expected value
1, while the basic trend function λ (t ) is a positive function defined for t ≥ 0 .
Maintenance of Repairable Systems 243

A useful feature of the HTRP model is that several important models for repair-
able systems are easily represented as submodels. With the notation HPP, NHPP,
RP and TRP used as before, we define corresponding models with heterogeneity as
at Equation 10.8 by putting an H in front of the abbreviations. Specifically, from
a full model, HTRP ( F , λ (⋅), H ) , we can identify the seven submodels described in
Table 10.1.

Table 10.1. The seven submodels of HTRP ( F , λ (⋅), H ) . ’exp’ means the unit exponential
distribution, ’1’ means the distribution degenerate at 1 . The third column contains referen-
ces to work on the corresponding models or special cases of them.

Submodel HTRP-formulation

HPP (ν ) HTRP (exp,ν ,1)

RP ( F ,ν ) HTRP ( F ,ν ,1)
NHPP (λ (⋅)) HTRP (exp, λ (⋅),1)
TRP ( F , λ (⋅)) HTRP ( F , λ (⋅),1)
HHPP (ν , H ) HTRP (exp,ν , H )
HRP ( F ,ν , H ) HTRP ( F ,ν , H )
HNHPP (λ (⋅), H ) HTRP (exp, λ (⋅), H )

The HTRP and the seven submodels may also be represented in a cube, as
illustrated in Figures 10.6 and 10.7. Each vertex of the cube represents a model,
and the lines connecting them correspond to changing one of the three “coordi-
nates” in the HTRP notation. Going to the right corresponds to introducing a time
trend, going upwards corresponds to entering a non-Poisson case, and going back-
wards (inwards) corresponds to introducing heterogeneity. In analyzing data by
parametric HTRP models we shall see below how we use the cube to facilitate the
presentation of maximum log-likelihood values for the different models in a con-
venient, visual manner. The log-likelihood cube was introduced in Lindqvist et al.

Example 1 (continued) Figure 10.6 shows the log-likelihood cube of the valve-seat
data. It should be noted that each arrow points in a direction where exactly one
parameter is added (see text of Figure 10.6 for definitions of parameters). Using
standard asymptotic likelihood theory we know that if this parameter has no
influence in the model, then twice the difference in log likelihood is approximately
chi-square distributed with one degree of freedom. For example, if twice the
difference is larger than 3.84, then the p-value of no significant difference is less
than 5% and we have an indication that the extra parameter in fact has some
relevance. Note that adding an extra parameter will always lead to a larger value of
the maximum log likelihood, but from what we just argued, the difference needs to
be more than, say, 3.84 / 2 = 1.92 to be of real interest.
244 B. Lindqvist

Figure 10.6. The log-likelihood cube for the Nelson valve seat data of Nelson (1995), fitted
with a parametric HTRP( F , λ (⋅), H ) model and its sub-models. Here F is a Weibull-
distribution with expected value 1 and shape parameter s , λ (t ) = cbt b −1 is a power
function of t , and H is a gamma-distribution with expected value 1 and variance v . The
maximum value of the log likelihood is denoted l

Looking at the valve-seat data cube in Figure 10.6 we note first that going from
a vertex of the front face to the corresponding vertex of the back face (adding “H”
in front of the model acronym) there is never much to gain (1.17 at most from HPP
to HHPP). This indicates no apparent heterogeneity between the various engines.
By comparing the left and right faces we conclude, however, that there seems to
be a gain in including a time trend. Having already excluded heterogeneity we are
thus faced with the possibilities of either NHPP or TRP. Here the latter model
“wins”, since the difference in log-likelihood is as large as (−343.66) − (−346.49) =
2.83 and twice the difference equal to 5.66 corresponding to an approximate p-value
of 0.017.
The resulting estimated TRP is seen to have a renewal distribution which is
Weibull with shape parameter 0.6806 which implies a decreasing failure rate. This
means that the conditional intensity function will jump upward at each failure,
which may be explained by burn-in problems at each valve-seat replacement.
Further, there will be an estimated time trend of the form
λˆ (t ) = 3.26 × 10−6 × 1.929 × t 0.929 = 6.29 × 10 −6 × t 0.929 which increases with t so
that replacements are becoming more and more frequent.

Example 2 (continued) For the closing valve failures considered by Bhattacharjee

et al. (2003), previous studies had shown significant variations in the number of
Maintenance of Repairable Systems 245

failures of each valve, suggesting a heterogeneity between valves. Bhattacharjee et

al. (2003) thus stressed the importance of taking heterogeneity into consideration
and concluded that even very simple models may describe the heterogeneous
behavior successfully. In particular they considered a model where heterogeneity is
represented by assuming that each valve is either “good” or “bad”.
While Bhattacharjee et al. (2003) used hierarchical Bayes-models, we fitted an
HTRP model and its sub-models, with a trend function of power law type as for the
valve-seat data, λ (t ) = cbt b −1 , but now with a heterogeneity distribution H being a
two-point distribution with values a1 = “good”, a2 = “bad” (so a1 ≤ a2 by
assumption) and P("good") = p . In order to have uniqueness of parameters we
imposed the restriction of expected value 1 for the distribution H , leading to
pa1 + (1 − p)a2 = 1 . The results are given in the log-likelihood cube of Figure 10.7.
By comparing the front and back faces of Figure 10.7 it is clear that there is a
considerable heterogeneity present, leaving us with the back face. Thus we con-
tinue by investigating whether we have Poisson-behavior or renewal-behavior at
failures. This is done by comparing the bottom and top faces, in other words
(HHPP, HNHPP) vs. (HRP, HTRP). The difference from HHPP to HRP happens to
be 1.92 so the p-value is 5%. Thus we might prefer the HRP model. However, in
order to obtain a simple model with a simple interpretation we might go for the
HHPP which gives that the closing valve is a “good” one with probability 0.9524,
with failures following an HPP with rate

1.083 ×10−4 × 0.35 = 3.79 ×10 −5 (per day),

or a “bad” one with probability 0.0476 and rate

1.083 ×10−4 ×14.0 = 1.52 ×10 −3 (per day).

The expected number of failures in 3286 days are hence 0.125 and 4.99 , respec-
tively, for the “good” and “bad” valves.

10.3 A Competing Risks Model for Failure

vs. Preventive Maintenance
10.3.1 A General Setup

Consider again the situation illustrated in Figure 10.1, where the sojourns
X 1 , X 2 ,… are times to failure of a system which is repaired immediately before the
start of the sojourn. In the present section we consider the case when the failure
which we expect at the end of the sojourn X i , may be avoided by a preventive
maintenance (PM) after a time Z i in the sojourn. The experienced sojourn time
will in this case be Yi = min( X i , Z i ) , and it will result in either a failure or a PM
according to whether Yi = X i or Yi = Z i . We thus have a competing risks situation
with two risks, corresponding to failure and PM.
246 B. Lindqvist

Figure 10.7. The log-likelihood cube for the data of Bhattacharjee et al. (2003) concerning
failures of motor operated closing valves in nuclear reactor plants in Finland, fitted with a
parametric HTRP( F , λ (⋅), H ) model and its sub-models. Here F is a Weibull-distribution
b −1
with expected value 1 and shape parameter s , λ (t ) = cbt is a power function of t , and
H is a two-point distribution with unit expectation, giving probability p for the value
“low” and 1 − p for the value “high”. The maximum value of the log likelihood is denoted
by l

Doyen and Gaudoin (2006) recently presented a point process approach for
modeling of such competing risks situations between failure and PM. A general
setup for this kind of processes is furthermore suggested in the review paper
Lindqvist (2006).
For simplicity we shall in this chapter consider only the case where the
component or system is perfectly repaired or maintained at the end of each sojourn.
This will lead to the observation of independent copies of the competing risks
situation in the same way as for a renewal process. We will therefore in the follow-
ing consider only a single sojourn and hence suppress the subscripts of the ob-
served times. Thus we let X and Z be, respectively, the potential times to failure
and time to PM of a single sojourn. Then Y = min( X , Z ) is the observed sojourn,
and in addition we observe the indicator variable δ which we define to be 1 if
there is a PM ( Y = Z ) and 0 if there is a failure ( Y = X ). This situation has been
extensively studied by Cooke (1993, 1996), Bedford and Cooke (2001), Langseth
and Lindqvist (2003, 2006), Lindqvist et al. (2006) and Lindqvist and Langseth
Thus note that the observable result is the pair (Y , δ ) , rather than the underlying
times X and Z , which may often be the times of interest. For example, knowing
Maintenance of Repairable Systems 247

the distribution of X would be important as a basis for maintenance optimization. It

is well known (see Crowder 2001, Chapter 7), however, that in a competing risks
case as described here, the marginal distributions of X and Z are not identifiable
from observation of (Y , δ ) alone unless specific assumptions are made on the
dependence between X and Z . The most frequently used assumption of this kind
is to let X and Z be independent, in which case identifiability follows. This as-
sumption is not reasonable in our application, however, since the maintenance crew
is likely to have some information regarding the system’s state during operation.
This insight is used to perform maintenance in order to avoid failures. We are thus
in practice usually faced with a situation of dependent competing risks between X
and Z .

10.3.2 Random Signs Censoring

Cooke (1993, 1996) suggested that the competing risks situation between failure
and PM will often satisfy what he called the random signs censoring property. The
important features of random signs censoring are that the marginal distribution of
X is always identifiable, and that an indication of the validity of this type of cen-
soring could be found from data plotting.
A lifetime Z is said to be a random signs censoring of X if the event {Z < X }
is stochastically independent of X , i.e. if the event of having a PM before failure is
not influenced by the time X at which the system fails or would have failed
without PM. The idea is that the system emits some kind of signal before failure,
and that this signal is discovered with a probability which does not depend on the
age of the system.
We now introduce some notation. Below we assume without further mention
that X , Z are positive and continuous random variables, with P( X = Z ) = 0 . We
let FX (t ) = P( X ≤ t ) and FZ (t ) = P( Z ≤ t ) be the cumulative distribution func-
tions of X and Z , respectively. The subdistribution functions of X and Z are
defined as, respectively, FX∗ (t ) = P( X ≤ t , X < Z ) and FZ∗ (t ) = P( Z ≤ t , Z < X ) .
Note that the functions FX∗ and FZ∗ are nondecreasing with FX∗ (0) = 0 and
FZ (0) = 0 . Moreover, we have FX∗ (∞) + FZ∗ (∞) = 1 .

We will also use the notion of conditional distribution functions, defined by

F X (t ) = P( X ≤ t | X < Z ) and F Z (t ) = P( Z ≤ t | Z < X ) . Note then that

∗ ∗ ∗ ∗
F X (t ) = FX (t ) /FX (∞) , F Z (t ) = FZ (t ) /FZ (∞) .
It is important to note that the functions FX∗ , FZ∗ , F X , F Z are identifiable from
data of the form (Y , δ ) , since they are given in terms of probabilities of events that
can be expressed by (Y , δ ) . For example, FX∗ (t ) = P(Y ≤ t , δ = 0) and can hence be
estimated consistently from a sample of values of (Y , δ ) .
On the other hand, as already mentioned, the marginal distribution functions
FX , FZ are not identifiable in general since they are not probabilities of events that
can be expressed by (Y , δ ) .
We now show that the marginal distribution of X is identifiable under random
signs censoring. In fact this follows directly from the definition, since we must have
248 B. Lindqvist

F X (t ) = P( X ≤ t | X < Z ) = P( X ≤ t ) = FX (t ) (10.10)

by independence of X and the event X < Z . As verified above, F X (t ) can al-

ways be estimated consistently from data, and thus this holds for FX (t ) as well by
Equation 10.10. Hence we have the somewhat surprising result under random signs
censoring that the marginal distribution of X is the same as the distribution of the
observed occurrences of X .
Cooke (1993) showed that under random signs censoring we have

F X (t ) < F Z (t ) for all t > 0. (10.11)

Moreover, he showed the kind of inverse statement that whenever Equation 10.11
holds, there exists a joint distribution of ( X , Z ) satisfying the requirements of
random signs censoring and giving the same sub-distribution functions.
On the other hand, if F X (t ) ≥ F Z (t ) for some t , then there is no joint distribu-
tion of ( X , Z ) for which the random signs requirement holds. For more discussion
on random signs censoring and its applications we refer to Cooke (1993, 1996) and
Bedford and Cooke (2001, Chapter 9). One idea is to estimate the functions F X (t )
and F Z (t ) from data to check whether Equation 10.11 may possibly hold and
when this is the case to suggest a model that satisfies the random signs property.

10.3.3 The Repair Alert Model

Lindqvist et al. (2006) introduced the so-called repair alert model which extends
the idea of random signs censoring by defining an additional repair alert function
which describes the “alertness” of the maintenance crew as a function of time. The
definition can be given as follows:
The pair ( X , Z ) of life variables satisfies the requirements of the repair alert
model provided the following two conditions both hold:

(i) Z is a random signs censoring of X b

(ii) There exists an increasing function G defined on [0, ∞) with G (0) = 0 , such
that for all x > 0 ,

G( z)
P( Z ≤ z | Z < X , X = x) = , 0 < z ≤ x.
G ( x)

The function G is called the cumulative repair alert function. Its derivative g
(when it exists) is called the repair alert function. The repair alert model is hence a
specialization of random signs censoring, obtained by introducing the repair alert
function G .
Part (ii) of the above definition means that, given that there would be a failure
at time X = x , and given that the maintenance crew will perform a PM before that
Maintenance of Repairable Systems 249

time (i.e. given that Z < X ), the conditional density of the time Z of this PM is
proportional to the repair alert function g .
Lindqvist et al. (2006) showed that whenever Equation 10.11 holds there is a
unique repair alert model giving the same sub-distribution functions. Thus, restrict-
ing to repair alert models we are able to strengthen the corresponding result for
random signs censoring which does not guarantee uniqueness.
The repair alert function is meant to reflect the reaction of the maintenance
crew. More precisely, g (t ) ought to be high at times t for which failures are ex-
pected and the alert therefore should be high. Langseth and Lindqvist (2003) sim-
ply put g (t ) = λ (t ) where λ (t ) is the failure rate of the marginal distribution of
X . This property of g (t ) of course simplifies analyses since it reduces the number
of parameters, but at the same time it seems fairly reasonable given a competent
maintenance crew. In a subsequent paper, Langseth and Lindqvist (2006) present
ways to test whether g (t ) can be assumed equal to the hazard function λ (t ) .
It follows from the construction in Lindqvist et al. (2006) that the repair alert
model is completely determined by the marginal distribution function FX of X ,
the cumulative repair alert function G , the probability q ≡ P( Z < X ) , and the
assumption that X is independent of the event {Z < X } (i.e. random signs cen-
soring). Thus, given statistical data, the inference problem consists of estimating
FX (t ) (possibly on parametric form), the repair alert function g (or G ), and the
probability q of PM. We refer to Lindqvist et al. (2006) and Lindqvist and
Langseth (2005) for details on such statistical inferences.
The following is a simple example of a repair alert model.

Example 3 Let ( X , Z ) be a pair of life variables with joint density parameterized

by λ > 0 and 0 < q < 1 ,

f XZ ( x, z; λ , q ) = (q /x)λ e− λ x for x > 0, 0 < z < x /q.

The marginal distribution of X is the exponential distribution with density

f X ( x) = λ e − λ x , while the conditional distribution of Z given X = x is the
uniform distribution on (0, x /q) . From this we obtain P( Z < X | X = x) = q for all
x > 0 . Thus the event Z < X is independent of X and condition (i) of the
definition is satisfied. The following computation shows that condition (ii) holds as
well. Let 0 < z < x . Then

P ( Z ≤ z, Z < X | X = x)
P( Z ≤ z | Z < X , X = x) =
P( Z < X | X = x)
P( Z ≤ z | X = x)
z ( q /x ) z
= = ,
q x

which implies condition (ii) of Definition 2 with G (t ) = t .

250 B. Lindqvist

The practical interpretation of this example is as follows. We consider a com-

ponent or system with lifetime X which is exponentially distributed with failure
rate λ . With probability q a PM is performed before X , at a time which for
given X = x is uniformly distributed on the interval from 0 to x .

10.3.4 Further Properties of The Repair Alert Model

The following formula (taken from Lindqvist et al. 2006) shows in particular why
Equation 10.11 holds under the repair alert model:

∞ f X ( y)
F Z (t ) = FX (t ) + G (t ) ∫ t G( y)
dy. (10.12)

Note that for random signs and hence for the repair alert model we have
F X (t ) = FX (t ) .
We next discuss some implications of the repair alert model, in particular how
the parameters q and G influence the observed performance of PM and failures.
In order to help intuition, we sometimes consider the power version G (t ) = t β
where β > 0 is a parameter. Then g (t ) = β t β −1 so β = 1 means a constant repair
alert function, while β < 1 and β > 1 correspond to, respectively, a decreasing and
increasing repair alert function.
Under the random signs assumption, the parameter q = P( Z < X ) is connected
to the ability to discover “signals” regarding a possibly approaching failure. More
precisely, q is understood as the probability that a failure is avoided by a pre-
ceding PM.
Given that there will be a PM, one should ideally have the time of PM immedi-
ately before the failure. It is seen that this issue is connected to the function G . For
example, large values of β will correspond to distributions with most of its mass
near x .
Moreover, it follows from Equation 10.12 that

∞ ⎡ M (X )⎤
E (Z | Z < X ) = ∫ 0
(1 − F Z ( z ))dz = E ( X ) − E ⎢ ⎥
⎣ G( X ) ⎦
where M ( x) = ∫ G (t )dt . For the special case when G (t ) = t β , we obtain the
simple result

E (Z | Z < X ) = E( X ) (10.13)
β +1

which clearly indicates that good PM performance corresponds to large values of

β . An interesting observation is, furthermore, that Equation 10.13 can be used to
estimate β from a sample of (Y , δ ) . In fact, E ( Z | Z < X ) can be estimated simp-
ly by the average of the observed Z , and since E ( X ) = E ( X | X < Z ) for random
Maintenance of Repairable Systems 251

signs censoring, we can estimate E ( X ) similarly by the average of the observed

X . An estimate of the quotient β/ ( β + 1) and hence of β follow.
Instead of merely considering the conditional expectation E ( Z | Z < X ) one
may more generally study the conditional distribution of Z given Z < X , or the
conditional distribution of X − Z given Z = z, Z < X . A good PM performance
would then mean that the former distribution is stochastically as large as possible,
while the latter distribution should be small (stochastically). For precise results in
this direction we refer to Lindqvist et al. (2006).
Consider next Y = min( X , Z ) , which is the actual sojourn time. The following
results are hence of practical interest, and may in addition shed light on the
influence of the parameters of the repair alert model:

∞ f X ( y)
P (Y ≤ t ) = FX (t ) + qG (t ) ∫ t G( y)

⎡ M (X )⎤ x
E (Y ) = E ( X ) − qE ⎢ ⎥ , where M ( x) =
⎣ G( X ) ⎦
∫ 0
G (t )dt.

Furthermore, if G (t ) = t β , then

⎛ q ⎞
E (Y ) = E ( X ) ⎜1 − ⎟. (10.14)
⎝ β +1⎠

We finally give a simple illustration of how the parameters q and β (assum-

ing G (t ) = t β for simplicity) influence the long run cost per time unit under the
repair alert model. Let CPM , CF be costs of PM and failure, respectively, for a
single sojourn. Assume now that following an event (PM or failure), the operation
is restarted with a system assumed to be as good as new, and that this process
continues. This leads to a sequence of observations of (Y , δ ) , which we shall
assume are independent and identically distributed. The theory of renewal reward
processes (e.g. Ross 1983, p 78) implies that the expected cost per unit time in the
long run equals the expected cost per sojourn divided by the expected length of a
sojourn, i.e.

qCPM + (1 − q)CF
E ( X ) 1 − βq+1 )
where we used Equation 10.14.
This is a decreasing function of β , which seems reasonable. On the other
hand, it is a decreasing function of q provided β > CPM / (CF − CPM ) . This last
inequality is likely to hold in many practical cases since the right hand side will
usually be much less than 1, while β should for a competent maintenance crew be
larger than 1. Thus a high value of q is usually preferable.
252 B. Lindqvist

10.4 Periodically Tested Systems

Certain systems, for example alarm systems, are tested only at fixed times which
are usually periodic. If the system is found in a failed state, then it is repaired or
replaced. Thus repair is usually not done at the same time as the failure, and the
situation is hence not covered by the methods considered earlier in this chapter. A
simple model of this situation was suggested by Hokstad and Frøvig (1996) and
further studied and extended by Lindqvist and Amundrustad (1998) which is the
main source for the present section.
The approach of Lindqvist and Amundrustad (1998) involves a continuous time
Markov model for the system state when time runs between testing epochs, and in
addition two discrete time Markov chains for the states of the system reported im-
mediately before and after each test, respectively. As will be seen, the given frame-
work also allows in an easy manner the potentially useful extension to modeling of
incomplete repairs or maintenance actions.
We consider a standby system observed from time 0 , with testing, repair and
PM performed periodically at times

τ , 2τ , 3τ ,…,

called PM epochs. Here τ > 0 is the length of what we shall call the PM interval.

10.4.1 The Markov Model

Let X (t ) ∈ S denote the state of the system at time t , where the set S of possible
states is finite. It is assumed that X (t ) behaves like a time homogeneous Markov
chain as long as time runs inside PM intervals, i.e. inside time intervals
nτ ≤ t < (n + 1)τ for n = 0,1,…. This Markov chain is governed by an infinitesimal
intensity matrix A , where the entry a jk of A for j ≠ k is the transition intensitiy
from state j to state k ; see for example Taylor and Karlin (1984, p 254). An
example of an intensity matrix A is given by Equation 10.15, an illustration of
which is provided by the state diagram in Figure 10.9. Let

Pjk (t ) = P( X (t ) = k | X (0) = j ); j, k ∈ S , t > 0

denote transition probabilities for the Markov chain governed by A and let

P(t ) = ( Pjk (t ); j, k ∈ S )

be the corresponding transition matrix.

In order to specify the effect of maintenance and repair at PM epochs, we next
introduce for n = 1, 2,…,

Yn = X (nτ −) ≡ lim X (t ),
t ↑ nτ
Maintenance of Repairable Systems 253

which is the state of the system immediately before the n -th PM epoch. The effect
of PM at time nτ is to change the state of the system from Yn to Z n according to
a transition matrix R = ( R jk ) , where

P( Z n = k | Yn = j ) = R jk ; j, k ∈ S .

Moreover, given Yn it is assumed that Z n is independent of all transitions of the

system state before time nτ .
The definitions of the Yn and Z n are illustrated in Figure 10.8.

Figure 10.8. The definition of Yn and Z n

The model description is completed by defining the initial state of the Markov
chain X (t ) running inside the PM interval [nτ , (n + 1)τ ) to be X (nτ ) ≡ Z n
( n = 0,1,…), where Z 0 is the initial state of the system, usually the perfect state in
S . It is furthermore assumed that the Markov chain X (t ) on [nτ , (n + 1)τ ) , given
its initial state Z n , is independent of all transitions occurring before time nτ .
Let the distribution of Z 0 ≡ X (0) be denoted ρ = ( ρ j ; j ∈ S ) , where
ρ j = P( Z 0 = j ) . Then for any k ∈ S ,

P(Y1 = k ) = P( X (τ −) = k )
= ∑ P( X (τ −) = k | X (0) = j ) P( X (0) = j )

= ∑ ρ j Pjk (τ ) = [ ρ P(τ )]k .


Thus the distribution of Y1 is given by the vector-matrix product ρ P(τ ) . Further,

for n ≥ 1 ,

P(Yn +1 = k | Yn = j ) = ∑ P(Y
n +1 = k | Z n = , Yn = j ) P( Z n =  | Yn = j )

= ∑P
k (τ ) R j = [ RP(τ )] jk .

It follows that Y1 , Y2 ,… is a discrete time Markov chain on S with transition matrix

Q = RP(τ ).
254 B. Lindqvist

On the other hand,

P( Z n +1 = k | Z n = j ) = ∑ P( Z
n +1 = k | Yn +1 = , Z n = j )

×P(Yn +1 =  | Z n = j )
= ∑P
j (τ ) Rk = [ P(τ ) R] jk .

Thus, our assumptions imply that Z 0 , Z1 ,… is a discrete time Markov chain on S

with transition matrix

T = P(τ ) R.

10.4.2 Reliability Measures

The approach may now be used to compute interesting reliability measures. Average rate of Critical Failures

Let π = (π j , j ∈ S ) be the stationary distribution of the Markov chain Y1 , Y2 ,…, i.e.
π is the unique probability vector satisfying the equation

π Q ≡ π RP(τ ) = π .

For any subset G ⊂ S , define π G = j∈G ∑

π j . This is the expected relative number
of PM epochs, in the long run, where the system is found to be in G . Moreover,
1/π G is the mean time, in the long run, between visits to G (measured with time
unit τ ). These facts are well known from the theory of Markov chains (Taylor and
Karlin 1984, Chapter 4).
Let in the following G be the subset of S defining the critical failure states of
the system. Then as in Hokstad and Frøvig (1996) we define the mean time between
critical failures to be

MTBFcrit = τ/π G

and the average rate of critical failures to be

λcrit = 1/MTBFcrit = π G /τ . Critical Safety Unavailability

Consider a PM interval [nτ , (n + 1)τ ) . The expected relative amount of time in this
interval that the system is in a critical state, i.e. in G , is

1 ( n +1)τ
τ ∫ nτ
Un = P ( X (t ) ∈ G )dt .
Maintenance of Repairable Systems 255

By our assumptions, X (t ) behaves in the interval [nτ , (n + 1)τ ) in the same manner
as if it was run in the interval [0,τ ) and started in state Z n . Thus

1 τ
Un =
τ ∫ ∑P0
jG (t ) P( Z n = j )dt

where PjG (t ) = ∑ k∈G Pjk (t ) .

Letting n tend to infinity, and the P( Z n = j ) tend to the limiting values γ j
defined from the stationary distribution γ = (γ j ) of the Markov chain Z 0 , Z1 ,… .
This distribution is found by solving the equations

γ T ≡ γ P(τ ) R = γ .

Following Hokstad and Frøvig (1996) we shall define the critical safety unavail-
ability (CSU) of the system by

CSU = lim U n
n →∞

1 τ
τ ∫ ∑P 0
(t )γ j dt = ∑γ Q
j j


1 τ
Qj =
τ ∫ 0
PjG (t )dt

is the critical safety unavailability given that the system state is j at the beginning
of the PM interval.

10.4.3 The Failure Model of Hokstad and Frøvig

As an illustration we shall reconsider the most general failure model of Hokstad

and Frøvig (1996), namely their Failure Mechanism III. Here the state space is

S = {O, D, K I , K II },

where O = the system is as good as new, D = the system has a failure classified as
degraded (noncritical), K I = the system has a failure classified as critical, caused
by a sudden shock, K II = the system has a failure classified as critical, caused by
the degradation process.
It is assumed that the Markov chain X (t ) is defined by the state diagram of
Figure 10.9, and thus has infinitesimal transition matrix
256 B. Lindqvist

⎡ −λd − λk λd λk 0⎤
⎢ 0 −λk − λdk λk λdk ⎥⎥
A=⎢ (10.15)
⎢ 0 0 0 0⎥
⎢ ⎥
⎣ 0 0 0 0⎦

Note that both K I and K II are absorbing states.

Figure 10.9. State diagram for the failure mechanism of Hokstad and Frøvig (1996)

The model assumes that no repairs are done in the time intervals between PM
epochs. Moreover, since A is upper triangular, we can obtain P(t ) = etA rather
easily. It is clear that P(t ) can be written

⎡ POO (t ) POD (t ) POK (t ) POK (t ) ⎤


⎢ ⎥
⎢ 0 PDD (t ) PDK (t ) PDK (t ) ⎥

⎢ 0 0 1 0 ⎥
⎢ ⎥
⎢⎣ 0 0 0 1 ⎥⎦

where expressions for the entries are found in Lindqvist and Amunrustad (1998).
In practice it is of interest to quantify the effect of various forms of preventive
maintenance. This can be done in the presented framework by means of the repair
matrix R . Some examples are given below.
If all failures are repaired at PM epochs, then the PM always returns the system
back to state O , and we have

⎡1 0 0 0⎤
⎢1 0 0 0 ⎥⎥
⎢1 0 0 0⎥
⎢ ⎥
⎣1 0 0 0⎦
Maintenance of Repairable Systems 257

Next, if only critical failures are repaired at PM epochs, then the appropriate R
matrix is

⎡1 0 0 0⎤
⎢0 1 0 0 ⎥⎥
⎢1 0 0 0⎥
⎢ ⎥
⎣1 0 0 0⎦

More generally one may consider an extension of this by assuming that all
critical failures are repaired, while degraded failures are repaired with probability
1 − r and remain unrepaired with probability r , 0 ≤ r ≤ 1 . The repair strategy is
thus determined by the parameter r .
This clearly leads to the matrix

⎡ 1 0 0 0⎤
⎢1 − r r 0 0 ⎥
R=⎢ ⎥
⎢ 1 0 0 0⎥
⎢ ⎥
⎣ 1 0 0 0⎦

A more general imperfect repair model can be defined by

⎢ 1 0 0 0 ⎤

⎢ ⎥
⎢ 1− r r 0 0 ⎥
R= ⎢

⎢ 1 − rk1 0 rk1 0 ⎥
⎢ ⎥
⎢ ⎥
1 − rk 2
⎢⎣ 0 0 r
k 2 ⎥⎦

Here r has the same meaning as before, while 1 − rk1 is the probability of success-
ful repair of a K I failure and 1 − rk 2 is the similar for K II .

10.5 Concluding Remarks

In the present chapter we have considered some aspects of the modeling and
analysis of repaired and maintained systems. Rather than giving a comprehensive
review of the field we have concentrated on a few points, partly chosen by the
interest of the author. It is believed, however, that the chapter touches some topics
that have to a certain degree been overlooked in much of reliability practice.
The first point concerns the use of the NHPP as the single model for repairable
systems with trend. Although this is appropriate in perhaps most cases, there are
cases where renewal effects caused by repair or maintenance destroy the random-
ness associated with Poisson processes. One way of checking NHPP models is to
embed them in larger models, and here the TRP can serve as a means of model
258 B. Lindqvist

checking (see for example the consideration of maximum log likelihoods in the
examples of Section 10.2.5). Another way of extending the NHPP processes is via
the large class of imperfect repair models. The classical model is here the one
suggested by Brown and Proschan (1983) (see the review paper Lindqvist 2006 for
an introduction to the subsequent literature). Imperfect repair models combine two
basic ingredients, a hazard rate z (t ) of a new system together with a particular
repair strategy which governs a so called virtual age process. The idea is that the
virtual age of the system is reduced at repairs by a certain amount which depends
on the repair strategy. The extreme cases are the perfect repair (renewal) models
where the virtual age is set to 0 after each repair, and the minimal repair (NHPP)
models where the virtual age is not reduced at repairs and hence always equals the
actual age.
Second, we have put some emphasis on the consideration of possible hetero-
geneity between systems of the same kind. Recall our Example 2 based on data
from Bhattacharjee et al. (2003). The authors write in their conclusion: “The
heterogeneity of failure behaviour of safety related components, such as valves in
our case study, may have important implications for reliability analysis of safety
systems. If such heterogeneity is not identified and taken into account, the deci-
sions made to maintain or to enhance safety can be non-optimal or even erroneous.
This non-optimality is more serious if the safety related decisions are made on the
basis of failure histories of the components”. Still it is believed that heterogeneity
has been neglected in many reliability applications. In fact, analyses of reliability
data will often lead to an apparent decreasing failure rate which is counterintuitive
in view of wear and ageing effects. Proschan (1963) pointed out that such observed
decreasing rates could be caused by unobserved heterogeneity. Proschan presented
failure data from 17 air conditioner systems on Boeing 720 airplanes, concluding
that an HPP model was appropriate for each plane, but that the rates differed from
plane to plane. This is a classical example of heterogeneity in reliability. If times
between failures had been treated as independent and identically distributed across
planes, the conclusion would have been that these times between failures had a
decreasing failure rate.
It has long been known in biostatistics that neglecting individual heterogeneity
may lead to severe bias in estimates of lifetime distributions. The idea is that
individuals have different “frailties”, and that those who are most “frail” will die or
fail earlier than the others. This in turn leads to a decreasing population hazard,
which has often been misinterpreted in the same manner as mentioned for the
reliability applications. Important references on heterogeneity in the biostatistics
literature are Vaupel et al. (1979), Hougaard (1984) and Aalen (1988). It should be
noted that heterogeneity is in general unidentifiable if being considered an indi-
vidual quantity. For identifiability it is necessary that frailty is common to several
individuals, for example in family studies in biostatistics, or if several events are
observed for each individual, such as for the repairable systems considered in this
paper. The presence of heterogeneity is often apparent for data from repairable
systems if there is a large variation in the number of events per system. However, it
is not really possible to distinguish between heterogeneity and dependence of the
intensity on past events for a single process.
Maintenance of Repairable Systems 259

The third point to be mentioned regards the use, or lack of use, of methods for
competing risks in reliability applications. The following is a citation from
Crowder (2004) appearing in the article on Competing Risks in Encyclopedia of
Actuarial Sciences: “If something can fail, it can often fail in one of several ways
and sometimes in more than one way at a time. In the real world, the cause, mode,
or type of failure is usually just as important as the time to failure. It is therefore
remarkable that in most of the published work to date in reliability and survival
analysis there is no mention of competing risks. The situation hitherto might be
referred to as a lost case”. Fortunately, some work has been done recently in order
to include competing risks in the study of repaired and maintained systems. Much
of this work, partly reviewed in Section 10.3, has been motivated by the work of
Cooke (1996) and his collaborators. His point of departure was formulated in the
conclusion of Cooke (1996): “The main themes of Parts I and II of this article are
that current RDB (Reliability Data Bank) designs: 1. are not giving RDB users
what they need; 2. are not doing a good job of analyzing competing risk data; 3. are
not doing a good job in handling uncertainty. Improvements in all these areas are
possible. However, it must be acknowledged that the models and methods pre-
sented here merely scratch the surface. It is therefore appropriate to conclude with
a summary of open issues...”
The final section of the present chapter considers an example of an approach
which in some sense generalizes the competing risks issue, namely using Markov
chains to model failure mechanisms of various equipment.
The chapter has mostly considered the modeling of repairable systems, with
less mention of statistical methods. It is believed that much of future research on
maintenance of repairable systems will still be centered around modeling, possibly
with an increased emphasis on point process models including multiple types of
events (see for example Doyen and Gaudoin 2006). More detailed models of the
underlying failure and maintenance mechanisms may indeed be of great value for
planning and optimization of maintenance actions. On the other hand, the new
advances in modeling certainly lead to considerable statistical challenges. This
point was touched on by Cooke (1996) as cited above, and it is clear that the in-
formation in reliability databases could and should be handled by more sophisti-
cated methods than the ones that are traditionally used. Here there is much to learn
from the biostatistics literature where there has for a long time been an emphasis
on nonparametric methods and on regression methods using covariate information.

10.6 References
Aalen OO, (1988) Heterogeneity in survival analysis. Statistics in Medicine 7:1121–1137.
Andersen P, Borgan O, Gill R, Keiding, N, (1993) Statistical Models Based on Counting
Processes. Springer, New York.
Ascher H, Feingold H, (1984) Repairable Systems – Modeling, inference, misconceptions
and their causes. Marcel Dekker, New York.
Bedford T, Cooke RM, (2001) Probabilistic Risk Analysis: Foundations and Methods;
Cambridge University Press: Cambridge.
260 B. Lindqvist

Bhattacharjee M, Arjas E, Pulkkinen, U, (2003) Modeling heterogeneity in nuclear power

plant valve failure data. In: Mathematical and Statistical Methods in Reliability
(Lindqvist BH, Doksum KA, eds.) World Scientific Publishing, Singapore, pp 341–353.
Brown M, Proschan F, (1983) Imperfect repair. Journal of Applied Probability 20:851–859.
Cook RJ, Lawless JF, (2002) Analysis of repeated events. Statistical Methods in Medical
Research 11:141–166.
Cooke RM, (1993) The total time on test statistics and age-dependent censoring. Statistics
and Probability Letters 18:307–312.
Cooke RM, (1996). The design of reliability databases, Part I and II. Reliability Engineering
and System Safety 51:137–146 and 209–223.
Crowder MJ, (2001) Classical competing risks. Chapman & Hall/CRC, Boca Raton.
Crowder MJ, (2004) Competing risks. In: Encyclopedia of actuarial science (Teugels JL,
Sundt B, eds.) Wiley, Chichester, pp. 305–313.
Crowder MJ, Kimber AC, Smith RL, Sweeting TJ, (1991) Statistical Analysis of Reliability
Data. Chapman & Hall, Great Britain.
Doyen L, Gaudoin O, (2006) Imperfect maintenance in a generalized competing risks
framework. Journal of Applied Probability 43:825-839.
Follmann DA, Goldberg MS, (1988) Distinguishing heterogeneity from decreasing hazard
rate. Technometrics 30:389–396.
Hokstad P, Frøvig AT, (1996) The modelling of degraded and critical failures for compo-
nents with dormant failures. Reliability Engineering and System Safety 51:189–199.
Hougaard P, (1984) Life table methods for heterogeneous populations: Distributions
describing the heterogeneity. Biometrika 71:75–83.
Langseth H, Lindqvist BH, (2003) A maintenance model for components exposed to several
failure mechanisms and imperfect repair. In: Mathematical and Statistical Methods in
Reliability (Lindqvist BH, Doksum KA, eds.). World Scientific Publishing, Singapore,
pp 415-430.
Langseth H, Lindqvist BH, (2006) Competing risks for repairable systems: A data study.
Journal of Statistical Planning and Inference 136:1687–1700.
Lawless JF, (1987) Regression methods for Poisson process data. Journal of American
Statistical Association 82:808–815.
Lindqvist BH, (2006) On the statistical modelling and analysis of repairable systems.
Statistical Science 21:532–551.
Lindqvist BH, Amundrustad H, (1998) Markov models for periodically tested components.
In: Safety and Reliability. Proceedings of the European Conference on Safety and
Reliability - ESREL ’98 (Lydersen S, Hansen GK, Sandtorv HA). AA Balkema,
Rotterdam, pp 191–197.
Lindqvist BH, Langseth H, (2005) Statistical modelling and inference for component failure
times under preventive maintenance and independent censoring. In: Modern Statistical
and Mathematical Methods in Reliability (Wilson A, Limnios N, Keller-McNulty S,
Armijo Y). World Scientific Publishing, Singapore, pp. 323–337.
Lindqvist BH, Elvebakk G, Heggland K, (2003) The trend-renewal process for statistical
analysis of repairable systems. Technometrics 45:31–44.
Lindqvist BH, Støve B, Langseth H, (2006) Modelling of dependence between critical
failure and preventive maintenance: The repair alert model. Journal of Statistical
Planning and Inference 136:1701–1717.
Meeker WQ, Escobar LA, (1998) Statistical methods for reliability data. Wiley, New York.
Nelson W, (1995) Confidence limits for recurrence data – applied to cost or number of pro-
duct reapair. Technometrics 37:147–157.
Peña EA, (2006) Dynamic modelling and statistical analysis of event times. Statistical
Science 21:487–500.
Maintenance of Repairable Systems 261

Proschan F, (1963) Theoretical explanation of observed decreasing failure rates. Techno-

metrics 5:375–383.
Rausand M, Høyland A, (2004) System reliability theory: Models, statistical methods, and
applications. 2nd ed. Wiley-Interscience, Hoboken, N.J.
Ross SM, (1983) Stochastic Processes. Wiley, New York.
Taylor HM, Karlin S, (1984) An introduction to stochastic modeling. Academic Press,
Vaupel JW, Manton KG, Stallard E, (1979) The impact of heterogeneity in individual frailty
on the dynamics of mortality. Demography 16:439–454.

Optimal Maintenance of Multi-component Systems:

A Review

Robin P. Nicolai and Rommert Dekker

11.1 Introduction
Over the last few decades the maintenance of systems has become more and more
complex. One reason for this is that systems consist of many components which
depend on each other. On the one hand, interactions between components compli-
cate the modelling and optimization of maintenance. On the other hand, interactions
also offer the opportunity to group maintenance which may save costs. It follows
that planning maintenance actions is a big challenge and it is not surprising that
many scholars have studied maintenance optimization problems for multi-compo-
nent systems. In some articles new solution methods for existing problems are
proposed, in other articles new maintenance policies for multi-component systems
are studied. Moreover, the number of papers with practical applications of optimal
maintenance of multi-component systems is still growing.
Cho and Parlar (1991) give the following definition of multi-component
maintenance models: “Multi-component maintenance models are concerned with
optimal maintenance policies for a system consisting of several units of machines
or many pieces of equipment, which may or may not depend on each other
(economically/stochastically/structurally).” So, in these models it is all about making
an optimal maintenance planning for systems consisting of components that interact
with each other. We will come back later to the concepts of optimality and inter-
action. For now it is important to remember that the condition of the systems depends
on (the state of) the components which will only function if adequate maintenance
actions are performed.
In this chapter we will give an up-to-date review of the literature on multi-
component maintenance optimization. Let us start with a brief summary of the
overview articles that have appeared in the past. Cho and Parlar (1991) review
articles from 1976 to 1991. The authors divide the literature into five topical
categories: machine-interference/repair models, group/block/cannibalization/oppor-
tunistic models, inventory/maintenance models, other maintenance/replacement
models and inspection/maintenance models. Dekker et al. (1996) deal exclusively
264 R. Nicolai and R. Dekker

with multi-component maintenance models that are based on economic dependence.

Emphasis is put on articles that have been published after 1991, but there is an over-
lap with the review of Cho and Parlar (1991). The classification scheme of Dekker
et al. (1996) differs from that of Cho and Parlar (1991). First, models are classified
based on the planning aspect of the model: stationary (long-term) and dynamic
(short-term). Second, the stationary-grouping models are divided in the categories
grouping corrective maintenance, grouping preventive maintenance and opportunis-
tic grouping maintenance. Here, opportunistic grouping is grouping both preventive
and corrective maintenance. The dynamic grouping models are divided into two
categories: those with a finite horizon and those with a rolling horizon. In a recent
article Wang (2002) gives an overview of maintenance policies of deteriorating
systems. The emphasis is on policies for single component systems. One section is
devoted to opportunistic maintenance policies for multi-component systems. The
author primarily considers models with economic dependence.
The existing review articles indicate that there are several ways to categorize
articles and models. In Section 11.2 of this chapter we structure the field and
present our comprehensive classification scheme. It differs from the schemes used
in the review articles discussed earlier. First of all, we distinguish between models
with economic, structural and stochastic dependence. Economic dependence
implies that grouping maintenance actions either save costs (economies of scale) or
result in higher costs (because of, e.g. high down-time costs), as compared to indi-
vidual maintenance. Stochastic dependence occurs if the condition of components
influences the lifetime distribution of other components. Structural dependence
applies if components structurally form a part, so that maintenance of a failed
component implies maintenance of working components. In Sections 11.3–11.5,
we discuss papers concerning economic, stochastic and structural dependence be-
tween components.
In Section 11.6 we classify articles according to the planning aspect of the
maintenance model and the method used to optimize the model. Following the
review of Dekker et al. (1996) we distinguish between models with finite and
infinite planning horizons. Models with an infinite planning horizon are called
stationary, since they usually provide static rules for maintenance, which do not
change over the planning horizon. Finite horizon models are called dynamic, since
these models can generate dynamic decisions that may change over the planning
horizon. In these models short-term information can be taken into account. With
respect to the optimization methods, we divide the papers into three categories:
exact, heuristic and policy optimization.
Section 11.7 covers trends and open research areas in multi-component main-
tenance. Conclusions are drawn in Section 11.8.

11.2 Structuring the Field

In Section 11.2.1 we give a short review of the terminology used in multi-compo-
nent maintenance optimization models and explain how we searched the literature.
In Section 11.2.2 we present our comprehensive classification scheme.
Optimal Maintenance of Multi-component Systems: A Review 265

11.2.1 Search Strategy and Terminology

Presenting a scientific review on a certain topic implies that one tries to discuss all
relevant articles. Finding these articles, however, is very difficult. It depends on the
search engines and databases used, electronic availability of articles and the search
strategy. We used Google Scholar, Scirus and Scopus as search engines, and used
ScienceDirect, JStor and MathSciNet as (online) database. We primarily searched
on key words, abstracts and titles, but we also searched within the papers for
relevant references. Note that papers published in books or proceedings that are not
electronically available, are likely to have not been identified.
Terminology is another important issue, as the use of other terms can hide a
very interesting paper. The field has been delineated by maintenance, replacement
or inspection on one hand and optimization on the other. This combination, how-
ever, provides almost 5000 hits in Google Scholar.
Next, the term multi-component has been used in junction with related terms as
opportunistic maintenance (policies), piggyback(ing), joint replacement, joint
overhaul, combining maintenance, grouping maintenance, economies of scale and
economic dependence. With respect to the term stochastic dependence, we have
also searched for synonyms and related terms such as failure interaction,
probabilistic dependence and shock damage interaction. This yields approximately
500 hits. Relevant articles have been selected from this set by scanning the articles.
The vast literature on maintenance of multi-component systems has been
reviewed earlier by others. Therefore, we have also consulted existing reviews and
overview articles in this field. Moreover, we have applied a citation search
(looking both backwards and forwards in time for citations) to all articles found.
This citation search is an indirect search method, whereas the above methods are
direct methods. The advantage of this method is that one can easily distinguish
clusters of related articles.

11.2.2 Classification Scheme

First of all, we classify the multi-component maintenance models on the basis of

the dependence/interaction between components in the system considered. Thomas
(1986) defines three different types of interactions: economic, structural and sto-
chastic dependence.
Simply said, economic dependence implies that the cost of joint maintenance of
a group of components does not equal the total cost of individual maintenance of
these components. The effect of this dependence comes to the fore in the execution
of maintenance activities. On the one hand, the joint execution of maintenance
activities can save costs in some cases (e.g. due to economies of scale). On the
other hand, grouping maintenance may also lead to higher costs (e.g. due to
manpower restrictions) or may not be allowed. For this reason, we will subdivide
the models with economic dependence into two categories: positive and negative
economic dependence. That is, we refine the definition of economic dependence as
compared to the definition used in the review article of Dekker et al. (1996). Note
that in many systems both positive and negative economic dependence between
266 R. Nicolai and R. Dekker

components are present. We give special attention to the modelling of maintenance

optimization of these systems, in particular the k-out-of-n system.
Stochastic dependence occurs if the condition of components influences the
lifetime distribution of other components. Synonyms of stochastic dependence are
failure interaction or probabilistic dependence. This kind of dependence defines a
relationship between components upon failure of a component. For example, it
may be the case that the failure of one component induces the failure of other
components or causes a shock to other components.
Structural dependence applies if components structurally form a part, so that
maintenance of a failed component implies maintenance of working components,
or at least dismantling them. So, structural dependence restricts the maintenance
manager in his decision on the grouping of maintenance activities.
A second classification of the models is based on the planning aspect:
stationary or dynamic. That is, do we make a short-term/operational or a long-term/
strategic planning for the maintenance activities? Is the planning horizon finite or
infinite? In stationary models, a long-term stable situation is assumed and mostly
these models assume an infinite planning horizon. Models of this kind provide
static rules for maintenance, which do not change over the planning horizon. They
generate for example long-term maintenance frequencies for groups of related
activities or control-limits for carrying out maintenance depending on the state of
components. In dynamic grouping models, short-term information such as a
varying deterioration of components or unexpected opportunities can be taken into
account. These models generate dynamic decisions that may change over the
planning horizon.
The last classification we consider is based on the type of optimization method
used. This can be an exact method, a heuristic or a search within classes of
policies. Exact optimization methods are designed to find the real optimal solution
of a problem. However, if the computing time of the optimization method increases
exponentially with the number of components, then exact methods are only
desirable to a certain extent. In that case solving problems with many components
is impossible and heuristics should be used. Heuristics are local optimization
methods that do not pretend to find the global optimum, but can be applied to find
a solution to the problem in reasonable time. The quality of such a solution
depends on the problem instance. In some cases it is possible to give an upper
bound on the gap between the optimal solution and the solution found by the
In many papers a maintenance planning is made by optimizing a certain type of
policy. Well known maintenance policies are the age and block replacement
policies and their extensions. The advantage of policy optimization over other
optimization methods is that it gives more insight into the solution of the problem.
Note that policy optimization will not always result in the global optimal solution,
since there may be another policy that results in a better solution. In some cases
however, it can be proved that applying a certain maintenance policy results in the
exact (global) optimal solution.
Optimal Maintenance of Multi-component Systems: A Review 267

11.3 Economic Dependence

In this section we review articles on multi-component systems with economic
dependence. We focus on articles appearing since the review of Dekker et al.
(1996). In Sections 11.3.1 and 11.3.2 we discuss models with positive and negative
dependence, respectively. In Section 11.3.3 we discuss articles on k-out-of-n
systems, in which both positive and negative dependence between components are

11.3.1 Positive Economic Dependence

Positive economic dependence implies that costs can be saved when several
components are jointly instead of separately maintained. Compared with the review
of Dekker et al. (1996) we refine the concept of (positive) economic dependence
and distinguish the following forms:
• Economies of scale
– General
– Single set-up
– Multiple set-ups
o Hierarchy of set-ups
• Downtime opportunity
The term economies of scale is often used to indicate that combining mainten-
ance activities is cheaper than performing maintenance on components separately.
The term economies of scale is very general and it seems to be similar to positive
economic dependence. In this chapter we will speak of economies of scale when the
maintenance cost per component decreases with the number of maintained com-
ponents. Economies of scale can result from preparatory or set-up activities that can
be shared when several components are maintained simultaneously. The cost of this
set-up work is often called the set-up cost. Set-up costs can be saved when main-
tenance activities on different components are executed simultaneously, since exe-
cution of a group of activities requires only one set-up.
In this overview we distinguish between single set-ups and multiple set-ups. In
the latter case there usually is a hierarchy of set-ups. For instance, consider a
system consisting of two components, which both consist of two subcomponents.
Maintenance of the subcomponents of the components may require a set-up at
system level and component level. First, this means that the set-up cost at com-
ponent level is paid only once when the maintenance of two subcomponents of a
component is combined. Second, the set-up cost at system level is paid only once
when all subcomponents are maintained at the same time. Set-up costs usually
come back in the objective function of the maintenance problem. If economies of
scale are not explicitly modelled by including set-up costs in the objective func-
tion, then we classify the model in the category ‘general’.
Another form of positive dependence is the downtime opportunity. Component
failures can often be regarded as opportunities for preventive maintenance of non-
failed components. In a series system a component failure results in a non-operating
268 R. Nicolai and R. Dekker

system. In that case it may be worthwhile to replace other components preventively

at the same time. This way the system downtime results in cost savings since more
components can be replaced at the same time. Moreover, by grouping corrective and
preventive maintenance the downtime can be regulated and in some cases it can
even be reduced. Note that if the downtime cost is included in the set-up cost in a
certain paper, then we will not classify the paper in the category ‘downtime oppor-
tunity’, but in the category ‘set-up cost’. In general, however, it is difficult to assess
the cost associated with the downtime (see, e.g. Smith and Dekker (1997), who
approximate the availability and the cost of downtime for a 1-out-of-n system).
Therefore, the downtime cost is usually not included in the set-up cost.
In the paragraphs below we discuss articles dealing with positive economic
dependence. Our main focus is on the modelling of this dependence. Economies of Scale

In comparison with Dekker et al. (1996) the category ‘general economies of scale’ is
new. The papers in this category deal with multi-component systems for which joint
maintenance of components is cheaper than individual maintenance of components.
This form of economies of scale cannot be modelled by introducing a single set-up
cost. The cost associated with the maintenance of components is often concave in the
number of components that are maintained simultaneously.
Dekker et al. (1998a) evaluate a new maintenance concept for the preservation
of highways. In road maintenance cost savings can be realized by maintaining
larger sections instead of small patches. The road is divided into sectors of 100-m
length. Set-up costs are present in the form of the direct costs associated with the
maintenance of different parts of the road. The set-up cost is a function of the
number of these parts in a maintenance group. A heuristic search procedure is pro-
posed to find the optimal maintenance planning.
Papadakis and Kleindorfer (2005) introduce the concept of network topology
dependencies (NTD) for infrastructure networks. In these networks two types of
NTD can be distinguished: contiguity and set-up discounts. Both types define
positive economic dependence between components. In the former case savings are
realized when costs are paid once when contiguous sections are maintained at the
same time. In the latter case savings are realized when costs are paid once for a
neighbourhood of the infrastructure network, independently of how much work is
carried out on it. For both types of dependencies a non-linear discount function is
defined. The authors consider the problem of maintaining an infrastructure network.
It is modelled as an undirected network. Risk measures or failure probabilities for
the segments of this network are assumed to be known. A maximum flow minimum
cut formulation of the problem is developed. This formulation makes it easier to
solve the problem exactly and efficiently.

Single Set-up
Nearly all articles reviewed by Dekker et al. (1996) fall into this category. The
objective function of the maintenance optimization model usually consists of a
Optimal Maintenance of Multi-component Systems: A Review 269

fixed cost (the set-up cost) and variable costs. In the articles discussed below, this
will not be different.
Castanier et al. (2005) consider a two-component series system. Economic
dependence between the two components is present in the following way. The set-
up cost for inspecting or replacing a component is charged only once if the actions
on both components are combined. That is, joint maintenance of components saves
costs. In this article the condition of the components is modelled by a stochastic
process and it is monitored by non-periodic inspections. In the opportunistic
maintenance policy several thresholds are defined for doing inspections, corrective
and preventive replacements, and opportunistic maintenance. These thresholds are
decision variables. Many articles on this type of models have appeared, but most of
these articles only consider single component models.
The articles of Scarf and Deara (1998, 2003) consider both economic and
stochastic dependence between components in a series system. This combination is
scarce in the literature. Positive economic dependence is modelled on the basis that
the cost of replacement of one or more components includes a one-off set-up cost
whose magnitude does not depend on the number of components replaced. We will
discuss these articles in more detail in Section 11.4.
In one of the few case studies found in the literature, Van der Duyn Schouten et
al. (1998) investigate the problem of replacing light bulbs in traffic control signals.
Each installation consists of three compartments for the green, red, and yellow
lights. Maintenance of light bulbs means replacement, either correctively or
preventively. First, positive economic dependence is present in the form of set-up
cost, because each replacement action requires a fixed cost in the form of
transportation of manpower and equipment. Second, the failure of individual bulbs
is an opportunity for doing preventive maintenance on other bulbs. The authors
propose two types of maintenance policies. In the first policy, also known as the
standard indirect-grouping strategy (introduced in maintenance by Goyal and Kusy
1985; for a review of this strategy we refer to Dekker et al. 1996), corrective and
preventive replacements are strictly separated. Economies of scale can thus only be
achieved by combining preventive replacements of the bulbs. The authors also
propose the following opportunistic age-based grouping policy. Upon failure of a
light bulb, the failed bulbs and all other bulbs older than a certain age are replaced.
Budai et al. (2006) consider a preventive maintenance scheduling problem
(PMSP) for a railway system. In this problem (short) routine activities and (long)
unique projects for one track have to be scheduled in a certain period. To reduce
costs and inconvenience for the travellers and operators, these activities should be
scheduled together as much as possible. With respect to the latter, maintenance of
different components of one track simultaneously requires only one track possession.
Time is discretized and the PMSP is written as a mixed-integer linear programming
model. Positive dependence is taken into account by the objective function, which is
the sum of the total track possession cost and the maintenance cost over a finite
horizon. To reduce possible end-of-horizon effects an end-of-horizon valuation is
also incorporated in the objective function. Note that the possession cost can be seen
as a downtime cost. The cost is modelled as a fixed/ set-up cost. This is the reason
that it is classified in this category. Besides this positive dependence there also exists
negative dependence between components, since some activities exclude each other.
270 R. Nicolai and R. Dekker

The advantage of a discrete time model is that negative dependence can be

incorporated in the model by adding additional restrictions. It appears that the PMSP
is a NP-hard problem. Heuristics are proposed to find near-optimal solutions in
reasonable time.

Multiple Set-ups
This is also a new category. The maintenance of different components may require
different set-up activities. These set-up activities may be combined when several
components are maintained at the same time. We have found one article in this
category; it assumes a complex hierarchical set-up structure.
Van Dijkhuizen (2000) studies the problem of clustering preventive main-
tenance jobs in a multiple set-up multi-component production system. As far as the
authors know, this is the first attempt to model a maintenance problem with a
hierarchical (tree-like) set-up structure. Different set-up activities have to be done
at different levels in the production system before maintenance can be done. Each
component is maintained preventively at an integer multiple of a certain basic
interval, which is the same for all components, and corrective maintenance is
carried out in between whenever necessary. So, every component has its own
maintenance frequency — the frequencies are based on the optimal maintenance
planning for single components. Obviously, set-up activities may be combined
when several components are maintained at the same time. The problem is to find
the maintenance frequencies that minimize the average cost per unit of time. This
problem is an extension of the standard-indirect grouping problem (for an
overview of this problem see Dekker et al. 1996). Downtime Opportunity

As we stated earlier, the downtime of a system is often an opportunity to combine
preventive and corrective maintenance. This is specially true for series systems,
where a single failure results in a system breakdown. Of course, non-failed compo-
nents should not be replaced when they are in a good condition, because useful
lifetime can be wasted. The maintenance policies proposed in the articles discussed
below use this idea.
Gürler and Kaya (2002) propose a new opportunistic maintenance policy for a
series system with identical items. The article is an extension of the work by Van
der Duyn Schouten and Vanneste (1993), who also propose an opportunistic policy
for such a system. In their model, the lifetime of the components is described by
several stages, which are classified as good, doubtful, preventive maintenance due
and failed. Gürler and Kaya (2002) classify the stages in the same way, but the
stages good and doubtful are subdivided into a number of states. The proposed
policy is of the control-limit type. Components which are PM due (failed) are
preventively (correctively) replaced immediately. The entire system is replaced
when a component is PM due or down and the number of components in doubtful
states is at least N. Here, N is a decision variable. It appears that this policy
achieves significant savings over a policy where the components are maintained
individually without any system replacement.
Popova and Wilson (1999) consider m-failure, T-age and (m,T) failure group
policies for a system of identical components operating in parallel. According to
Optimal Maintenance of Multi-component Systems: A Review 271

these policies the system is replaced at the time of the m-th failure, every T time
units, and at the minimum time of these events, respectively. These policies were
first introduced by Assaf and Shanthikumar (1987), Okumoto and Elsayed (1983)
and Ritchken and Wilson (1990), respectively. Popova and Wilson (1999) assume
that downtime costs are incurred when failed components are not repaired or
replaced. So, when the system operates there is also negative dependence between
the components. After all, when the components are left in a failed condition, with
the intention to group corrective maintenance, then downtime costs are incurred. In
the maintenance policies a trade-off between the downtime costs and the advan-
tages of grouping (corrective) maintenance is made.
Sheu and Jhang (1996) propose a new two-phase opportunistic maintenance
policy for a group of independent identical repairable units. Their model takes into
account downtime costs and the maintenance policy includes minimal repair,
overhaul, and replacement. In the first phase, (0,T], minor failures are removed by
minimal repairs and ‘catastrophic’ failures by replacements. In the second phase,
(T,T+W], minor failures are also removed by minimal repairs, but ‘catastrophic’
failures are left idle. Group maintenance is conducted at time T+W or upon the k-th
idle, whichever comes first. The generalized group maintenance policy requires
inspection at either the fixed time T+W or the time when exactly k units are left
idle, whichever comes first. At an inspection, all idle components are replaced with
new ones and all operating components are overhauled so that they become as
good as new.
Higgins (1998) studies the problem of scheduling railway track maintenance
activities and crews. In this problem positive economic dependence is present in the
following way. The occupancy of track segments due to maintenance prevents all
train movements on those segments. The costs associated with this can be regarded
as downtime costs. The maintenance scheduling problem is modelled as a large scale
0-1 programming problem with many (non-linear) restrictions. The objective is to
minimize expected interference delay with the train schedule and prioritized finishing
time. The downtime costs are modelled by including downtime probabilities in the
objective function. The author proposes tabu search to solve the problem. The
neighbourhood, which plays a prominent role in local search techniques, is easily de-
fined by swapping the order of activities or maintenance crews.
The article of Sriskandarajah et al. (1998) discusses the maintenance scheduling
of rolling stock. Multiple train units have to be overhauled before a certain due date.
The aim is to find a suitable common due date for each train so that the due dates of
individual units do not deviate too much from the common due date. Maintenance
carried out too early or too late is costly since this may cause loss of use of a train.
A genetic algorithm is proposed to solve this scheduling problem.

11.3.2 Negative Economic Dependence

Negative economic dependence between components occurs when maintaining

components simultaneously is more expensive than maintaining components indi-
vidually. There can be several reasons for this:
272 R. Nicolai and R. Dekker

• Manpower restrictions
• Safety requirements
• Redundancy/production-loss
First grouping maintenance results in a peak in manpower needs. Manpower
restrictions may even be violated and additional labour needs to be hired, which is
costly. The problem here is to find the balance between workload fluctuation and
grouping maintenance.
Second, there are often restrictions on the use of equipment, when executing
maintenance activities simultaneously. For instance, use of equipment may hamper
use of other equipment and cause unsafe operations. Legal and/or safety require-
ments often prohibit joint operation.
Third, joint (corrective) maintenance of components in systems in which some
kind of redundancy is available may not be beneficial. Although there may exist
economies of scale through simultaneous repair of a number of (identical) com-
ponents, leaving components in a failed condition for some time increases the risk
of costly production losses. We will come back to this in Section 11.3.3. Produc-
tion loss may increase more than linearly with the number of components out of
operation. For an example of this type of economic dependence we refer to Stengos
and Thomas (1980). The authors give an example of the maintenance of blast
furnaces. The disturbance due to maintenance is substantially more, the more fur-
naces that are out of operation. That is, the cost of overhauling the furnaces in-
creases more than linearly with the number of furnaces out of action.
It appears that maintenance of systems with negative dependence is often
modelled in discrete time. The models can be regarded as scheduling problems with
many restrictions. These restrictions can easily be incorporated in discrete time
models such as (mixed) integer programming models. With respect to these models,
there is always the question whether the exact solution can be found efficiently. In
other words, the question arises whether the problem is NP-hard. An example of
discrete time modelling is given by the article of Grigoriev et al. (2006). In this
article the so-called periodic maintenance problem (PMP) is studied. In this problem
machines have to be serviced regularly to prevent costly production losses. The
failures causing these production losses are not modelled. Time is discretized into
unit-length periods. In each period at most one machine can be serviced. Apparently
negative economic dependence in the form of manpower restrictions or safety
measures play a role in the maintenance of the machines. The problem is to find a
cyclic maintenance schedule of a given length T that minimizes total service and
operating costs. The operating costs of a machine increase linearly with the number
of periods elapsed since last servicing that machine. PMP appears to be an NP-hard
problem and the authors propose a number of solution methods. This leads to the
first exact solutions for larger sized problems.
In Stengos and Thomas (1980) time is also discretized but the maintenance
problem, scheduling the overhaul of two pieces of equipment, is set up as a
Markov decision process. The pieces can be in different states and the probability
of failure increases with the time since the last overhaul. So in comparison with the
problem of Grigoriev et al. (2006), pieces can fail during operation. Negative
economic dependence is modelled as follows. The cost of overhauling the pieces
Optimal Maintenance of Multi-component Systems: A Review 273

increases more than linearly with the number of pieces out of action. The objective
is to minimize the ‘loss of production’ cost, which is incurred when a piece is
overhauled. The optimal policy is found by a relative value successive approxima-
tion algorithm.
In Langdon and Treleaven (1997) the problem of scheduling maintenance for
electrical power transmission networks is studied. There is negative economic
dependence in the network due to redundancy/production-loss. Grouping certain
maintenance activities in the network may prevent a cheap electricity generator
from running, so requiring a more expensive generator to be run in its place. That
is, some parts of the network should not be maintained simultaneously. These
exclusions are modelled by adding restrictions to the MIP formulation of the prob-
lem. The authors propose several genetic algorithms and other heuristics to solve
the problem.

11.3.3 k-out-of-n Systems

In this section we discuss the different dependencies in the k-out-of-n system in

more detail. This system is a typical example of a system with both positive and
negative economic dependence between components. A k-out-of-n system func-
tions if at least k components function. If k = 1, then it is a parallel system; if k = n,
then it is a series system. Let us for the moment distinguish between the cases k = n
and k < n.
In the series system (k = n), there is positive economic dependence due to
downtime opportunities. The failure of one component results in an expensive
downtime of the system and this time can be used to group preventive and correc-
tive maintenance. Negative economic dependence is not explicitly present in the
series system.
If k < n, then there is redundancy in the system and it fails less often than its
individual components. This way a specified reliability can be guaranteed. Typi-
cally, the components of this system are identical which allows for economies of
scale in the execution of maintenance activities. It is not only possible to obtain
savings by grouping preventive maintenance, but also by grouping corrective
maintenance. Note that the latter form of grouping is not advantageous in series
systems. In other words, the redundant components introduce additional positive
dependence in the system. Whereas positive economic dependence is present upon
failure of a component, negative economic dependence plays a role as long as the
system operates. A single failure of a component may not always be an opportunity
to combine maintenance activities. First, grouping corrective and preventive main-
tenance upon the failure of the component increases the probability of system
failure and costly production losses. Second, leaving components in a failed
condition for some time, with the intention to group corrective maintenance at a
later stage, has the same effect. So, there is a trade-off between the potential loss
resulting from a system failure and the benefit of joint maintenance.
One problem of optimizing (age-based) maintenance in k-out-of-n systems is
the determination of downtime costs, as a failure does not directly result in system
failure. Smith and Dekker (1997) derive the uptime, downtime and costs of main-
tenance in a 1-out-of-n system (with cold standby), but in general it is very difficult
274 R. Nicolai and R. Dekker

to assess the availability and the downtime costs of a k-out-of-n system. In their
article, Smith and Dekker (1997) optimize the following age-replacement policy. A
component is taken out for preventive maintenance and replaced by a stand-by one,
if its age has reached a certain value Tpm. Moreover, they determine the number of
redundant components needed in the system.
In the maintenance policies considered in the articles below, an attempt is made
to balance the negative aspects of downtime costs and the positive aspects of
grouping (corrective) maintenance. The opportunistic maintenance policies proposed
in these articles are age-based and also contain a threshold for the number of failures
(except for the policy introduced by Sheu and Kuo 1994).
In Dekker et al. (1998b) the maintenance of light-standards is studied. A light
standard consists of n independent and identical lamps screwed on a lamp assem-
bly. To guarantee a minimum luminance, the lamps are replaced if the number of
failed lamps reaches a pre-specified number m. In order to replace the lamps the
assembly has to be lowered. This set-up activity is an opportunity to combine
corrective and preventive maintenance. Several opportunistic age-based variants of
the m-failure group replacement policy (in its original form only corrective main-
tenance is grouped) are considered in this paper. Simulation optimization is used to
determine the optimal opportunistic age threshold.
Pham and Wang (2000) introduce imperfect PM and partial failure in a k-out-
of-n system. They propose a two-stage opportunistic maintenance policy for the
system. In the first stage failures are removed by minimal repair; in the second
stage failed components are jointly replaced with operating components when m
components have failed, or the entire system is replaced at time T, whichever
occurs first. Positive economic dependence is of an opportunistic nature. Joint
maintenance requires less time than individual maintenance.
Sheu and Kuo (1994) introduce a general age replacement policy for a k-out-of-
n system. Their model includes minimal repair, planned and unplanned replace-
ments, and general random repair costs. The system is replaced when it reaches age
T. The long-run expected cost rate is obtained. The aim of the paper is to find the
optimal age replacement time T that minimizes the long-run expected cost per unit
time of the policy.
The article of Sheu and Liou (1992) will be discussed in Section 11.4, because
they assume stochastic dependence between the components of a k-out-of-n system.

11.4 Stochastic Dependence

In the survey of Thomas (1986) multi-component maintenance models with sto-
chastic dependence are considered as a separate class of models. In the more recent
review articles this is not the case. In Cho and Parlar (1991) some articles dealing
with failure interaction are discussed, but the modelling of failure interaction
between components is not. In Wang (2002) nothing is said about systems with
failure interaction; articles on this kind of systems only appear in the references.
Actually, this is the first publication, since the survey of Thomas (1986), to give a
comprehensive review of multi-component maintenance models with stochastic
dependence. We do not aim to give solely a list of papers that have appeared.
Optimal Maintenance of Multi-component Systems: A Review 275

Instead, we want to give insight into the different ways of modelling failure
interaction between components and explain the implications of certain approaches
and assumptions with respect to practical applicability.
Stochastic dependence, also referred to as failure interaction or probabilistic
dependence, implies that the state of components can influence the state of the
other components. Here, the state can be given by the age, the failure rate, state of
failure or any other condition measure. In their seminal work on stochastic de-
pendence, Murthy and Nguyen (1985b) introduce three different types of failure
interaction in a two-component system.
Type I failure interaction implies that the failure of a component can induce a
failure of the other component with probability p (q), and has no effect on the other
component with probability 1 – p (1 – q). It follows that there are two types of
failures: natural and induced. The natural failures are modelled by random
variables and the induced failures are characterized by the probabilities p and q. In
Murthy and Nguyen (1985a) the authors extend type I failure interaction to systems
with multiple components. It is assumed that whenever a component fails it
induces a total failure of the system with probability p and has no effect on the
other components with probability (1 – p). In this chapter we will consider this to
be the definition of type I failure interaction.
Type II failure interaction in a two-component system is defined as follows.
The failure of component 2 can induce a failure of component 1 with probability q,
whereas every failure of component 1 acts as a shock to component 2, without
inducing an instantaneous failure, but affecting its failure rate.
Type III failure interaction implies that the failure of each component affects
the failure rate of the other component. That is, every failure of one of the compo-
nents acts as a shock to the other component.
A potential problem of the failure rate interaction defined by the last two types,
is determining the size of the shock. In practice it is very difficult to assess the
effect of a failure of one component on the failure rate of another component.
Usually there is not much data on the course of the failure rate of a component
after the occurrence of a shock. Shocks can also be modelled by adding a (random)
amount of damage to the state of another component. Natural failures then occur if
the state of a component (measured by the cumulative damage) exceeds a certain
level. In this paper we will bring this modelling of type II and III failure interaction
together in one definition. That is, we renew the definition of type II failure
interaction for multi-component systems. It reads as follows. The system consists of
several components and the failure of a component affects either the failure rate of
or causes a (random) amount of damage to the state of one or more of the
remaining components. It follows that we regard a mixture of induced failures and
shock damage as type II failure interaction. Models with type II failure interaction
will also be called shock damage models.
In general, the maintenance policies considered in the literature on stochastic
dependence, are mainly of an opportunistic nature, since the failure of one compo-
nent is potential harmful for the other component(s). Modelling failure interaction
appears to be quite elaborate. Therefore, most articles only consider two-compo-
nent systems. Below we review the articles on failure interaction in the following
order. First, we will discuss the type I interaction models. For this type of inter-
276 R. Nicolai and R. Dekker

action different opportunistic versions of the well known age and block replace-
ment policies have been proposed. Second, the articles on type II interaction will
be reviewed. We will see that in most of these articles the occurrence of shocks is
modelled as a non-homogeneous Poisson process (NHPP) or that the failure rate of
components is adjusted upon failure of other components. Third, we pay attention
to articles that consider both types of failure interaction. Finally, we discuss other
forms of modelling failure interaction.

11.4.1 Type I Failure Interaction

Murthy and Nguyen (1985a) consider two maintenance policies in a multi-

component system with type I failure interaction. Under the first policy all failed
components are replaced by new ones. When there is no total system failure, then
only the single failed component is replaced. Under the second policy all com-
ponents, also the functioning component(s), are replaced. When there is no total
system failure, then the single failed component is subjected to minimal repair and
made operational. The failure rate of the failed component after repair is the same
as that just before failure. The authors deduce both the expected cost of keeping the
system operational for a finite time period as well as the expected cost per unit
time, of keeping the system operational for an infinite time period.
Sheu and Liou (1992) consider an optimal replacement policy for a k-out-of-n
system subject to shocks. Shocks arrive according to a NHPP. The system is
replaced preventively whenever it reaches age T > 0 at a fixed cost c0. If the m-th
shock arrives at age Sm < T, it can cause the simultaneous failure of i components

at the same time with probability pi(Sm) for i = 0, 1,..., n, where i =0
pi ( S m ) = 1 . If
i ≥ k, then the k-out-of-n system is replaced by a new one at a cost c∞ (unplanned
failure replacement). So, the downtime is used to replace all components. If 0 ≤ i <
k, then the system is minimally repaired with cost ci(Sm). After a complete
replacement (either a planned or a failure replacement), the shock process is set to
zero. All failures subject to shocks are assumed to be instantly detected and
repaired. The aim of the paper is to find the optimal T that minimizes the long run
expected cost per unit time of the maintenance policy.
The articles of Scarf and Deara (1998, 2003) consider failure-based, (oppor-
tunistic) age and (opportunistic) block replacement policies for a labelled two-
component series system with type I failure interaction. The articles can be seen as
an extension of the article of Murthy and Nguyen (1985b) on failure-based replace-
ment for such systems. Note that since we deal with a series system, the failure of
either component causes a system downtime. So, if the system is down, this does
not necessarily mean that both components have failed. Economic dependence is
modelled on the basis that the cost of replacement of one or more components
includes a one-off set-up cost whose magnitude does not depend on the number of
components replaced.
The maintenance policies considered in Scarf and Deara (1998) are of the age-
based replacement type: replace a component on failure or at age T, whichever is
sooner. Failure-based maintenance is viewed as the limiting case (T → ∞) of age-
based replacement. As there is also economic dependence between components,
Optimal Maintenance of Multi-component Systems: A Review 277

the authors consider opportunistic age-based replacement policies: replace a

component on failure or at age T or at age T' < T if an opportunity exists.
The policies considered in Scarf and Deara (2003) are of the block replacement
type and are extended for two-component systems. The independent block
replacement policy is a single component policy and it is of the following form:
replace all failed components, replace component 1 at times k∆1, k = 1, 2,... and
replace component 2 at times k∆2, k = 1, 2,... . Block replacement can be grouped:
replace failed components and replace the system at times k∆, k = 1, 2,... . It can
also be combined: replace both components (whether failed or not) on failure of
the system and replace the system at times k∆, k = 1, 2,... . In modified block
replacement policies for a two-component system, a component is only replaced at
the block replacement times if its age is greater than some critical value. The block
replacement times may be independent or grouped, or the components may be
combined. Opportunistic modified block replacement policies are of the form: on
failure of component 1, if the age of component 2, τ2, is greater than b2′ , then
replace both components; otherwise just replace component 1. On failure of
component 2, if the age of component 1, τ1, is greater than b1′ , then replace both
components; otherwise just replace component 2. At block replacement times for
component 1, k∆1, k = 1, 2,..., replace component 1 if τ1 > b1 and replace
component 2 if τ2 > b2′ ; at block replacement times for component 2, k∆1, k = 1,
2,..., replace component 2 if τ2 > b2 and replace component 1 if τ1 > b1′ (for suitable
chosen thresholds, b1, b2, b1′ and b2′ .
In both articles the maintenance policies are considered in the context of the
clutch system used in a bus fleet. This system consists of the clutch assembly
(component 2) and the clutch controller (component 1). Actually, the failure of the
controller causes a failure of the assembly with probability 1 and the failure of the
assembly causes a failure of the controller with probability 0. It is important to
mention that the maintenance policies are not only compared on the basis of cost,
but also on ease of implementation and system reliability. It is found that an age-
based policy is best, but since this implies that components ages have to be
monitored, the authors propose to implement a block or modified block policy.
Combined modified block replacement seems to be the best alternative for the
clutch system under consideration. Combining maintenance of components has the
advantage that the system is in general more reliable, although the long run costs
per unit time are higher. The economic gains from using a complex policy have to
be weighed up against the addition of investment required to implement such
Jhang and Sheu (2000) address the problem of analyzing preventive main-
tenance policies in a multi-component system with type I failure interaction. The i-
th component 1 ≤ i ≤ N has two types of failures. Type 1 failures are minor failures
and are rectified through minimal repair. Type 2 failures are catastrophic failures
and induce a total failure of the system (i.e. failure of all other components in the
system). Type 2 failures are removed by an unplanned/unscheduled replacement of
the system. The model takes into account costs for minimal repairs, replacements
and preventive maintenance. Generalized age and block replacement policies are
proposed. The age replacement policy implies preventive replacement of all com-
278 R. Nicolai and R. Dekker

ponents whenever an operating system reaches age T. In the case of a block

replacement policy the system is preventively replaced every T years. The expected
long-run cost per unit time for each policy is derived and it is discussed how the
optimal T can be determined. Various special cases are discussed in detail. Finally,
the authors mention the application of their model to the maintenance of mining
cables used in hoisting load.

11.4.2 Type II Failure Interaction

Satow and Osaki (2003) consider a two-component parallel system. Component 1

is repairable and at failure minimal repair is done. Failures of component 1 occur
according to a NHPP. Whenever the component fails it induces a random amount
of damage to component 2. The damage is additive and component 2 fails
whenever the total damage exceeds a certain failure level. A system failure always
occurs whenever component 2 fails, because both components fail simultaneously.
By assumption component 2 is not repairable. This means that a failed system
needs to be replaced by a new one. Since preventive replacement is cheaper than
failure replacement, a two-parameter preventive replacement policy is analyzed.
The policy takes into account both system age and the total damage of component
2. The system is replaced preventively whenever the total damage of component 2
exceeds k or at time T and it is replaced correctively at system failures. An
expression for the expected cost per unit time for long run operation is derived and
the policy is optimized analytically for two special cases (the one-parameter
policies). Numerical examples show that the policy imposing a limit on the total
damage (k) of component 2 outperforms the age T policy. It appears that the two-
parameter preventive maintenance policy does not necessarily lead to lower
expected costs. This is because in this model the state of component 2 is best
indicated by the total damage and its age does not provide any additional infor-
Zequeira and Bérenguer (2005) study inspection policies for a two-component
parallel standby system. The system operates successfully if at least one com-
ponent functions. Failures can be detected only by periodic inspections. The failure
times are modelled as independent random variables. Type II failure interaction is
modelled as follows. The failure of one component modifies the (conditional)
failure probability of the other component with probability p and does not
influence the failure time with probability 1 – p. Within this respect, the model
extends the failure rate interaction models proposed by Murthy and Nguyen
(1985b). Inspections are either staggered, i.e. the components are inspected one at a
time, or non-staggered, i.e. the components are inspected simultaneously at the
same time. It is assumed that there are no economies of scale by doing non-
staggered inspections. Numerical experiments prove that for the case of constant
hazard rates, staggered inspections outperform non-staggered inspections on the
expected average cost per unit time criterion. The authors explain this counter-
intuitive result as follows. When inspections are staggered, at least one component
is in an operating condition more frequently than when inspections are not
Optimal Maintenance of Multi-component Systems: A Review 279

Lai and Chen (2006) consider a two-component system with failure rate
interaction. The lifetimes of the components are modelled by random variables
with increasing failure rates. Component 1 is repairable and it undergoes minimal
repair at failures. That is, component 1 failures occur according to a NHPP. Upon
failure of component 1 the failure rate of component 2 is modified (increased).
Failures of component 2 induce the failure of component 1 and consequently the
failure of the system. The authors propose the following maintenance policy. The
system is completely replaced upon failure, or preventively replaced at age T,
whichever occurs first. The expected average cost per unit time is derived and the
policy is optimized with respect to parameter T. The optimum turns out to be
Barros et al. (2006) introduce imperfect monitoring in a two-component
parallel system. It is assumed that the failure of component i is detected with
probability 1 – pi and is not detected with probability pi. The components have
exponential lifetimes and when a component fails the extra stress is placed on the
surviving one for which the failure rate is increased. Moreover, independent shocks
occur according to a Poisson process. These shocks correspond to common cause
failures and induce a system failure. The following maintenance policy is proposed.
Replace the system upon failure (either due to a shock or failure of the components
separately), or preventively at time T, whichever occurs first. Assuming that
preventive replacement is cheaper, the total expected discounted cost over an
unbounded horizon is minimized. Numerical examples show the relevance of taking
into account monitoring problems in the maintenance model. The model is applied
to a parallel system of electronic components. When one fails, the surviving one is
overworked so as keep the delivery rate not affected.

11.4.3 Types I and II failure interaction

Murthy and Nguyen (1985b) derive the expected cost of operating a two-com-
ponent system with type I or type II failure interaction for both a finite and an
infinite time period. They consider a simple, non-opportunistic, maintenance
policy. Always replace failed components immediately. This means that the system
is only renewed if a natural failure induces a failure of the other component.
Nakagawa and Murthy (1993) elaborate on the ideas of Murthy and Nguyen
(1985b). They consider two types of failure interaction between two components.
In the first case the failure of component 1 induces a failure of component 2 with a
certain probability. In the second case the failure of component 1 causes a random
amount of damage to the other component. In the latter case the damage
accumulates and the system fails when the total damage exceeds a specified level.
Failures of component 1 are modelled as an NHPP with increasing intensity
function. The following maintenance policy is examined. The system is replaced at
failure of component 2 or at the N-th failure of component 1, whichever occurs
first. For both models the optimal number of failures before replacing the system as
to minimize the expected cost per unit time over an infinite horizon is derived. The
maintenance policy for the shock damage model is extended as follows: the system
is also replaced at time T. This results in a two-parameter maintenance policy,
which is also optimized. The authors give an application of their models to the
280 R. Nicolai and R. Dekker

chemical industry; component 1 is a pneumatic pump and component 2 is a metal

container. The failure of the pneumatic pump may either lead to an explosion,
causing system failure (model 1), or lead to a reduction in the wall thickness of the
container (model 2). The extension of model 2 captures the introduction of
preventive maintenance of the container at time T.

11.4.4 Other Types of Failure Interaction

Özekici (1988) considers a reliability system of n components. The state of the

system is given by the random vector Xt of the ages of the components at time t,
that is Xt = ( X1t ,..., Xtn ) . It is assumed that Xit ≥ 0 for all t > 0 and i = 1,...,N,
where Xit = ∞ implies that component i is in a failed state at time t. The stochastic
structure of the system is that the stochastic process with state-space [0, ∞) is a
positive, increasing, right-continuous, and quasi-left continuous, strong Markov
process. Stochastic dependence between the components is modelled by making
the age (state) of a component at time t dependent on the age of the system up to
time t. The failure interaction considered here differs from type I and II failure
interaction defined above. It is worth to mention that this paper is written inde-
pendently of the work of Murthy and Nguyen (1985a,b).
Maintenance is modelled as follows. There are periodic overhauls at which the
state of the system is inspected and a replacement decision is made on the
components based on the observation of the system. Here the cost structure of the
maintenance decision is very general and consists of two types: costs which only
depend on the number of replaced components and costs which depend on the state
of the system at the time of inspection. Economic dependence between components
is ‘hidden’ in the former costs. Replacing a group of components together is
cheaper than replacing the components separately or in a smaller subgroup. The
optimal replacement problem is formulated as a Markov decision process. The
author proposes a very general class of replacement policies, for which the
decision to replace a component depends on the age of all components. It appears
to be possible to characterize the optimal solution to the replacement problem.
Unfortunately, it cannot be proved that there exists a single critical age for the
system, which describes the optimal replacement problem. The author provides
some intuitive results, e.g. it is not always optimal to replace new components and
if the age of components that have to be replaced is increased, then the optimal
policy does not change. He also gives an important counter-intuitive result: it is not
true that more components are replaced as the system gets older.

11.5 Structural Dependence

Structural dependence means that some operating components have to be replaced,
or at least dismantled, before failed components can be replaced or repaired. In
other words, structural dependence between components indicates that they cannot
be maintained independently. This is not failure dependence, but maintenance
dependence. Since the failure of a component offers an opportunity to replace other
Optimal Maintenance of Multi-component Systems: A Review 281

components, opportunistic policies are expected to perform well on systems with

structural dependence between components. Obviously, preventive maintenance
may also be advantageous, since maintenance of structural dependent components
can be grouped.
There may be several reasons for structural dependence. For example, a bicycle
chain and a cassette form a union, which should always be replaced together rather
than individually. Another example is from Dekker et al. (1998a), which considers
road maintenance. Several deterioration mechanisms affect roads, e.g. longitudinal
and transversal unevenness, cracking and ravelling. For each mechanism one may
define a virtual component, but if one applies a maintenance action to such a
component it also affects the state with respect to the other failure mechanisms.
The seminal paper in this category is from Sasieni (1956). He considers the
production of rubber tyres. The machine that produces the tyres consists of two
“bladders”; one tyre is produced on each bladder simultaneously. Upon failure of a
bladder, the machine must be stripped down before replacement can be done. This
means that the other bladder can be replaced at the same time. Note that immediate
replacement is not mandatory, but a failed bladder will produce faulty tyres. Two
maintenance policies are analyzed and optimized. The first is a preventive main-
tenance policy. Bladders which have made a predetermined number of tyres (m)
without failure are replaced. The second is an opportunistic version of the first po-
licy. When a machine is stripped to replace one bladder, replace the other bladder
if it has produced more than n ≤ m tyres.

11.6 Planning Horizon and Optimization Methods

In this section we will classify articles on the basis of the planning horizon of the
maintenance model and the optimization methods used to solve this model.
Actually, these two concepts are related. The majority of the articles reviewed here
assume an infinite horizon. This assumption facilitates the mathematical analysis;
it is often possible to derive analytical expressions for optimal control parameters
and the corresponding optimal costs. So, in the category infinite horizon (stationary
grouping) models policy optimization is the most popular optimization method.
For convenience we will not review the articles in this category.
Finite-horizon models consider the system in this horizon only, and hence
assume implicitly that the system is not used afterwards, unless a so-called residual
value is incorporated to estimate the industrial value of the system at the end of the
horizon. In the article of Budai et al. (2006) the so-called end-of-horizon effect is
eliminated by adding an additional term to the objective function. This term values
the last interval.
282 R. Nicolai and R. Dekker

The optimization methods applied to finite horizon models are either exact
methods or heuristics1. Exact methods always find the global optimum solution of
a problem. If the complexity of an optimization problem is high and the computing
time of the exact method increases exponentially with the size of the problem, then
heuristics can be used to find a near-optimal solution in reasonable time.
The scheduling problem studied by Grigoriev et al. (2006) appears to be NP-
hard. Instead of defining heuristics, the authors choose to work on a relatively fast
exact method. Column-generation and a branch-and-price technique are utilized to
find the exact solution of larger-sized problems. The problem considered by
Papadakis and Kleindorfer (2005) is first modelled as a mixed integer linear pro-
gramming problem, but it appears that it can also be formulated as a max-flow
min-cut problem in an undirected network. For this problem efficient algorithms
exist and thus, an exact method is applicable.
Langdon and Treleaven (1997), Sriskandarajah et al. (1998), Higgins (1998)
and Budai et al. (2006) propose heuristics to solve complex scheduling problems.
The first two articles utilize genetic algorithms. Higgins (1998) applies tabu search
and Budai et al. (2006) define different heuristics that are based on intuitive
arguments. In all four articles the heuristics perform well; a good solution is found
within reasonable time.

11.7 Trends and Open Areas

In this section we comment on the future research of optimal maintenance of multi-
component systems. We first analyze the trends in modelling multi-component
maintenance and then discuss the future research areas in this field.

11.7.1 Trends
In the last few years several articles have appeared on optimal maintenance of
systems with stochastic dependence. In particular, the shock-damage models have
received much attention. One explanation for this is that type II failure interaction
can be modelled in several ways, whereas there is not much room for extensions in
the type I failure model. Another reason is that since the field of stochastic de-
pendence is not very broad yet, it is easy to add a new feature such as minimal
repair or imperfect monitoring to an existing model. Third, many existing oppor-
tunistic maintenance policies for systems with economic dependence have not yet
been applied to systems with (type II) failure interaction.
Another upcoming field in multi-component maintenance modelling is the class
of finite horizon maintenance scheduling problems. Finite horizon models can be

Actually, if the maintenance policy is relatively easy, it is sometimes possible to determine
the expected maintenance costs over a finite period of time. For instance, Murthy and
Nguyen (1985a,b) consider failure-based policies in a system with stochastic dependence
and derive an expression for the expected cost of operating the system for a finite time.
Optimal Maintenance of Multi-component Systems: A Review 283

regarded as dynamic models, because short-term information can be taken into

account. Maintenance scheduling problems are often modelled in discrete time as
mixed integer linear programming problems. These problems can be NP-hard and in
that case heuristics or local search methods have to be developed in order to solve the
problems to near-optimality efficiently. In the last decade tabu search, genetic algo-
rithms and problem specific heuristics have already been applied to maintenance
scheduling problems (see Langdon and Treleaven 1997, Sriskandarajah et al. 1998,
Higgins 1998 and Budai et al. 2006). However, there is still need for better local
search algorithms.

11.7.2 Open Areas

There is scope for more work in the following areas. Finite Horizon Models

On one hand, the class of infinite horizon models has been studied extensively in
the literature. Based on the renewal-reward theory many maintenance policies for
stationary grouping models have been analyzed. On the other hand, the class of
finite horizon models, which includes many maintenance scheduling problems, has
never had that much attention. However, maintenance of multi-component systems
has to be made operational. Therefore, finite horizon and especially rolling horizon
models, which also take short-term into account, have to be developed. In order to
solve these models heuristics/local search methods should be further developed.
Exact algorithms also need more attention. The article of Grigoriev et al. (2006)
shows that some scheduling problems of reasonable size can be optimized exactly
in a reasonable time. Case-studies
This review shows that case-studies are not represented very well in the field. This
is surprising, since maintenance is an applied topic. In our opinion many models
are just (mathematical) extensions of existing models and most of the times models
are not validated empirically. Case-studies can lead to new models, both in the
context of cost structures and dependencies between components. Modelling Multiple Set-up Activities

In this article we have subdivided the category “economic dependence” into a
number of subcategories. It appears that examples of modelling maintenance of
systems with multiple set-up activities are scarce. Therefore, this seems to be a
promising field for further research. After all, in many production systems complex
set-up structures exist. Structural Dependence

The field of structural dependence is wide open. In our opinion there have only
been a few articles published on this topic.
284 R. Nicolai and R. Dekker Stochastic Dependence

Two decades ago Murthy and Nguyen published two articles on the maintenance of
systems with stochastic dependence. Although this topic has had much attention
since then, most articles still deal with two-component systems. So, there is still a
lot of work to do on modelling maintenance of systems with failure interaction
consisting of more than two components. Combination of Dependencies

In this article we have seen one example of the combination of structural and
economic dependence (Scarf and Deara 1998, 2003). We have also reviewed some
papers with both positive and negative economic dependence. Obviously, the
combination of different types of interaction results in difficult optimization
models. So, this is also an opportunity for researchers to come up with some new
models. Simulation Optimization

We have already said that much work has been done on maintenance policies for
the class of infinite horizon models. Many maintenance policies are not analyti-
cally tractable and simulation is needed to analyze these policies. We observe that
the optimization of policies via simulation is often done by using algorithms for
deterministic optimization problems. Methods such as simulated annealing and
response surface methodology may be more efficient. This should be investigated

11.8 Conclusions
In this chapter we have reviewed the literature on optimal maintenance of multi-
component maintenance. We first classified articles on the basis of the type of
dependence between components: economic, stochastic and structural dependence.
Subsequently, we subdivided these classes into new categories. For example, we
have introduced the categories positive and negative economic dependence. We
have paid attention to articles with both forms of interaction. Moreover, we have
defined several subcategories in the class of models with positive economic de-
pendence. With respect to articles in the class of stochastic dependence, we are the
first to review these articles systematically.
Another classification has been made on the basis of the planning horizon
models and optimization methods. We have focussed our attention on the use of
heuristics and exact methods in finite horizon models. We have concluded that this
is a promising open research area.
We have discussed the trends and the open areas of research reported in the
literature on multi-component maintenance. We have observed a shift from infinite
horizon models to finite horizon models and from economic to stochastic depen-
dence. This immediately defines the open research areas, which also include topics
such as case studies, modelling combinations of dependencies between compo-
nents and modelling multiple set-up activities.
Optimal Maintenance of Multi-component Systems: A Review 285

11.9 References
Assaf D, Shanthikumar J, (1987) Optimal group maintenance policies with continuous and
periodic inspections. Management Science 33:1440–1452
Barros A, Bérenguer C, Grall A, (2006) A maintenance policy for two-unit parallel systems
based on imperfect monitoring information. Reliability Engineering and System Safety
Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance
activities. Journal of the Operational Research Society 57:1035–1044
Castanier B, Grall A, Bérenguer C (2005) A condition-based maintenance policy with non-
periodic inspections for a two-unit series system. Reliability Engineering & System Safety
Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European
Journal of Operational Research 51:1–23
Dekker R, Plasmeijer R, Swart J, (1998a) Evaluation of a new maintenance concept for the
preservation of highways. IMA Journal of Mathematics applied in Business and Industry
Dekker R, van der Duyn Schouten F, Wildeman R, (1996) A review of multi-component
maintenance models with economic dependence. Mathematical Methods of Operations
Research 45:411–435
Dekker R, van der Meer J, Plasmeijer R, Wildeman R, (1998b) Maintenance of light-
standards: a case-study. Journal of the Operational Research Society 49:132–143
Goyal, S, Kusy M, (1985) Determining economic maintenance frequency for a family of
machines. Journal of the Operational Research Society 36:1125–1128
Grigoriev A, van de Klundert J, Spieksma F, (2006) Modeling and solving the periodic
maintenance problem. European Journal of Operational Research 172:783–797
Gürler Ü, Kaya A, (2002) A maintenance policy for a system with multi-state components: an
approximate solution. Reliability Engineering & System Safety 76:117–127
Higgins A, (1998) Scheduling of railway track maintenance activities and crews. Journal of
the Operational Research Society 49:1026–1033
Jhang J, Sheu S, (2000) Optimal age and block replacement policies for a multi-component
system with failure interaction. International Journal of Systems Science 31:593–603
Lai M, Chen Y, (2006) Optimal periodic replacement policy for a two-unit system with
failure rate interaction. The International Journal of Advanced Manufacturing and
Technology 29:367–371
Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission
networks using genetic programming. In Warwick K, Ekwue A, Aggarwal R, (eds.)
Artificial intelligence techniques in power systems, Institution of Electrical Engineers,
Stevenage, UK, 220–237
Murthy D, Nguyen D, (1985a) Study of a multi-component system with failure interaction.
European Journal of Operational Research 21:330–338
Murthy D, Nguyen, D (1985b) Study of two-component system with failure interaction.
Naval Research Logistics Quarterly 32:239–247
Nakagawa T, Murthy D, (1993) Optimal replacement policies for a two-unit system with
failure interactions. RAIRO Recherche operationelle / Operations Research 27:427–438
Okumoto K, Elsayed E, (1983) An optimum group maintenance policy. Naval Research
Logistics Quarterly 30:667–674
Özekici S, (1988) Optimal periodic replacement of multicomponent reliability systems.
Operations Research 36:542–552
Papadakis I, Kleindorfer P, (2005) Optimizing infrastructure network maintenance when
benefits are interdependent. OR Spectrum 27:63–84
286 R. Nicolai and R. Dekker

Pham H, Wang H, (2000) Optimal (τ, T) opportunistic maintenance of a k-out-of-n:G system

with imperfect PM and partial failure. Naval Research Logistics 47:223–239
Popova E, Wilson J, (1999) Group replacement policies for parallel systems whose
components have phase distributed failure times. Annals of Operations Research 91:
Ritchken P, Wilson J, (1990) (m; T) group maintenance policies. Management Science
Sasieni M, (1956) A Markov chain process in industrial replacement. Operational Research
Quarterly 7:148–155
Satow T, Osaki S, (2003) Optimal replacement policies for a two-unit system with shock
damage interaction. Computers and Mathematics with Applications 46:1129–1138
Scarf P, Deara M, (1998) On the development and application of maintenance policies for a
two-component system with failure dependence. IMA Journal of Mathematics Applied in
Business & Industry 9:91–107
Scarf P, Deara M, (2003) Block replacement policies for a two-component system with
failure dependence. Naval Research Logistics 50:70–87
Sheu S, Jhang J, (1996) A generalized group maintenance policy, European Journal of
Operational Research 96:232–247
Sheu S, Kuo C, (1994) Optimal age replacement policy with minimal repair and general
random repair costs for a multi-unit system. RAIRO Recherche Operationelle/Operations
Research 28:85–95
Sheu S, Liou C, (1992) Optimal replacement of a k-out-of-n system subject to shocks. Micro-
electronics Reliability 32:649–655
Smith M, Dekker R, (1997) Preventive maintenance in a 1 out of n system: the uptime,
downtime and costs. European Journal of Operational Research 99:565–583
Sriskandarajah C, Jardine A, Chan C, (1998) Maintenance scheduling of rolling stock using a
genetic algorithm. Journal of the Operational Research Society 49:1130–1145
Stengos D, Thomas L, (1980) The blast furnaces problem. European Journal of Operational
Research 4:330–336
Thomas L, (1986) A survey of maintenance and replacement models for maintainability and
reliability of multi-item systems. Reliability Engineering 16:297–309
Van der Duyn Schouten F, Vanneste S, (1993) Two simple control policies for a
multicomponent maintenance system. Operations Research 41:1125–1136
Van der Duyn Schouten F, van Vlijmen B, Vos de Wael S, (1998) Replacement policies for
traffic control signals. IMA Journal of Mathematics Applied in Business & Industry
Van Dijkhuizen G, (2000) Maintenance grouping in multi-setup multi-component production
systems. In Ben-Daya M, Duffuaa S, Raouf A, (eds.) Maintenance, Modeling and
Optimization, Kluwer Academic Publishers, Boston, 283–306
Wang H, (2002) A survey of maintenance policies of deteriorating systems. European
Journal of Operational Research 139:469–489
Zequeira R, Bérenguer C, (2005) On the inspection policy of a two-component parallel
system with failure interaction. Reliability Engineering and System Safety 88:99–107

Replacement of Capital Equipment

P.A. Scarf and J.C. Hartman

12.1 Introduction
Businesses require equipment in order to function and deliver their outputs. In the
global, competitive environment, this equipment is critical to success. However,
equipment generally degrades with age and usage, and investment is required to
maintain the functional performance of equipment. For example, in mass urban
transportation, annual expenditure on equipment replacement for the Hong Kong
underground is of the order of $50 million, and further, the Hong Kong underground
network is a fraction of the size of that in London, Paris or New York. Where
equipment replacement impacts significantly on the bottom line of a corporation and
decision-making about such expenditure is under the control of the company
executive, the modelling of such decision making is within the scope of this chapter.
Capital equipment investment projects are typically driven by operating cost
control, technical obsolescence, requirements for performance and functionality
improvements, and safety. That is, rational decision-making about capital equip-
ment replacement will take account of engineering, economic, and safety require-
ments. In this chapter we will assume that the engineering requirements concerning
replacement will define certain choices for equipment replacement. For example,
engineers would normally propose a number of options for providing the continuity
of equipment function: retain the current equipment as is, refurbish the equipment in
order to improve operation and functionality, or replace the equipment with new im-
proved technology. We will further assume that safety requirements are addressed
when these options are analysed by engineers. Consequently, we argue that rational
choice between the defined replacement options is an economic question. Thus, a
logistics corporation may be considering replacement of certain assets in its road
transportation fleet. The organisation may have to raise capital to fund such
replacement. There is the expectation that engineers for the corporation will offer a
number of choices for replacement (e.g. buy tractors from company X or Y, buy
tractors now or in N years time, or scrap or retain existing tractors as spares) that
meet future functional and safety requirements. In this way, decision making about
288 P. Scarf and J. Hartman

replacement then necessarily considers the costs of the replacement options over
some suitable planning horizon. As capital equipment replacement potentially in-
curs significant costs, the cost of capital is a factor in the decision problem and
models to support decision making typically take account of the time value of
capital through discounting.
Capital equipment is a significant asset of a business. It consists of necessarily
complex systems and a business would typically own or operate a fleet of equipment:
the Mass Transit Railway Corporation Limited of Hong Kong operates hundreds of
escalators; Fed Ex Express, the cargo airline corporation operates more than 600
aircraft; electricity distribution systems comprise thousands of kilometres of cable
and hundreds of thousands of items such as transformers and switches; water supply
networks are on a similar scale. We can appeal to the law of large numbers and
assume with some justification that the economic costs that enter capital equipment
replacement decisions are deterministic. Consequently, we consider deterministic
models in this chapter and model rational decision making throughout using net
present value techniques (e.g. see Arnold 2006; Northcott 1985).
When considering optimal equipment replacement in an uncertain environment,
authors have argued the case for using real options (Dixit and Pindyck 1994; Bowe
and Lee 2004). Whenever replacement decisions may be exercised continuously, it
is argued that the choice to replace an existing asset with a new asset at a specified
time is characteristic of an American call option—this approach seeks to value the
opportunity to replace the asset. Such a modelling approach would be valuable
when considering expansion of assets, for example, through the building of a new
transportation link for which the likely return on investment would be highly
uncertain. However, we do not consider this approach in this chapter.
We do not consider problems of component replacement in which the func-
tionality of repairable systems is optimized either on a cost basis or a required
reliability basis. Such maintenance does not typically involve capital expenditure,
and the models used are often stochastic in nature—times to failure are considered
to be random. For a recent review of such models, see Wang 2002.
The outline of the chapter is as follows. In Section 12.2 we describe the frame-
work for the classification of models that are discussed in this chapter. This frame-
work considers the nature of capital equipment replacement problems in general
and presents further detail regarding the nature of cost factors that contribute to
replacement decisions. Section 12.3 looks at economic life models and discusses
several models and an application of one of the models. Section 12.4 deals with
replacement of a network system. Dynamic programming models are discussed in
Section 12.5 and the chapter concludes with a discussion of topics for future
research in Section 12.6.

12.2 Framework for Replacement Modelling

of Capital Equipment
The composition of a fleet may be classified as singular (one operating plant),
multiple identical (homogeneous), or multiple non-identical (inhomogeneous). Re-
placement policies may be classified as single plant replacement, sub-fleet replace-
Replacement of Capital Equipment 289

ment, or entire fleet replacement (Scarf and Christer 1997). The capital replace-
ment models that are considered in this chapter may be classified as economic life
models or dynamic programming models. The former are concerned with deter-
mining the optimal lifetime of an item of equipment, taking account of costs over
some planning horizon. The latter considers replacement decisions dynamically,
determining whether plant should be retained or replaced after each period. Eco-
nomic life models may be further classified according to the length of the planning
horizon: infinite, variable finite (with length of the horizon a function of decision
variables), or fixed (with a variable number of replacement cycles). Dynamic
programming models generally require a finite horizon, but may be used to identify
the optimal time zero decision for an infinite horizon.
Early models (e.g. Eilon et al. 1966) were formulated in continuous time with
optimum policy obtained using calculus. More complex models are simpler to
implement under a discrete time formulation. In the case of economic life models,
optimization may be performed using a crude search when there exists a small
number of decision variables. For fleets with many items, the discrete time
formulation naturally gives rise to mathematical programming problems. Dynamic
programming models necessarily require a discrete time formulation. Real options
models are formulated in continuous time.
We begin by looking at simple economic life models. These are applied in a
case study on escalator replacement. Economic life models are then extended to
consider first an inhomogeneous fleet and second a network system viewed as an
inhomogeneous fleet with interacting items. A number of different dynamic pro-
gramming models are introduced for singular systems and then expanded to homo-
geneous and inhomogeneous fleets and networks of assets.
It is assumed that data relating to maintenance are available and sufficient for
modelling purposes. Data on other “age” related operating costs, such as fuel costs
and failures (breakdowns), would also ideally be available. Where usage of plant is
non-uniform, particularly if decreasing with age, usage data are also required for
replacement policy to be meaningful. This is because, for example, maintenance
costs for older plant may be artificially low due to under utilization or neglect of
good maintenance practice for plant near the end of their useful life. Some plant
may even be retired as occasional spares. Under reporting and thus bias of main-
tenance cost data may also be significant (Scarf 1994). Replacement models have
also been considered when cost information is obtained subjectively (Apeland and
Scarf 2003).
Penalty costs play a role in all replacement decisions (Christer and Scarf 1994).
It is only the extent to which penalty cost is quantified in the modelling process
that varies. Rather than attempt to estimate the values of “difficult to quantify”
parameters such as penalty cost and then determine optimal policy, the influence of
these parameters on the decision should be quantified. In this latter approach,
threshold values that lead to a step-change in optimum policy can be investigated
and presented and the decision makers can then consider whether they believe that
such values are realistic within the context of the problem. Thus, the penalty cost
can be used to measure in part the subjective component of a replacement choice.
All costs considered in the modelling will be discounted to net present value
through the use of a constant discount factor. We refer the reader to Kobbacy and
290 P. Scarf and J. Hartman

Nicol (1994) for a detailed discussion of the role of discounting in capital replace-
ment. Appropriate functions describing resale values are assumed to be known, as
are purchase costs. Tax considerations in particular contexts should be taken into
account and modelled.

12.3 Economic Life Models

12.3.1 A Simple Model for Individual Plant

Early economic life models such as Eilon et al. (1966) considered an idealised
equipment replaced at age T, that is, replacement every T time units, in perpetuity.
In this idealised framework, for T small, frequent replacement leads to high
replacement or capital costs. Infrequent replacement (large T), on the other hand,
results in high operating or revenue costs (assuming that operating costs increase
with the age of equipment). Trading-off capital costs against revenue costs leads to
an optimum age at replacement, T*, the so-called economic life. The decision
criterion is typically the total cost per unit time or the annuity—this latter term has
been called the rent by Christer (1984). In the case without discounting, the total
cost per unit time, c(T), and the annuity are equivalent and

c(T ) = {∫ m0 (t ) dt + R}/ T , (12.1)

where m0 (t ) is the operating cost rate and R is the replacement cost, and assuming
no residual value. From Equation 12.1, it follows that T* is the solution of


∫ 0
m0 (t ) dt + R = T * m0 (T *) ,

provided it exists. In its discrete time form the total cost per unit time is
c(T ) = {∑ i =1 m0i + R}/ T , where m0i is the operating cost in time period i. With a

discount factor ν , discounting to year end, and a residual value function S(T), the
net present value (NPV) of all future costs in perpetuity is

cNPV (T ) = (1 + ν T + ν 2T + ...){ i =1
m0iν i + ν T [ R − S (T )]}

= (1 −ν T ) −1{∑ i =1 m0iν i + ν T [ R − S (T )]}.


An objection to this criterion is that as ν → 1, cNPV (T ) → ∞ . Consequently, we

recommend the annuity or rent (the amount paid annually and in perpetuity that is
necessary to meet the total discounted cost) given by

(1 + ν + ν 2 + ...) crent (T ) = (1 + ν T +ν 2T + ...){ i =1
m0iν i +ν T [ R − S (T )]} ,
Replacement of Capital Equipment 291


(1 − ν ) T
crent (T ) = T
(1 − ν )
∑i =1 m0iν i + ν T [ R − S (T )]} .

Notice that as ν → 1, crent (T ) → c(T ), the total cost per unit time. The eco-
nomic life can be obtained by minimising crent (T ) , typically using a spreadsheet
by considering a range of values of T.

12.3.2 Analysing Technological Change Using a Two-cycle Model

The economic life model can be adapted to consider technological change in a

number of ways. One can consider economic factors for new models of equipment
(future operating costs) in a parametric fashion, specifying a model for technologi-
cal change which then implies operating cost functions, replacement cost and resid-
ual values for each replacement cycle into the future (Elton and Gruber 1976).
Alternatively, one can model replacement over a limited time scale, either by
fixing the time horizon, or by fixing the number of replacement cycles. Christer
(1984) did the latter and described a two-cycle model which models the immediate
replacement decision problem by considering existing plant as having age τ and
age-related operating cost m0i , and new plant as having operating cost m1i . In its
discrete form, the annuity for this model is

∑ m0(i +τ )ν i + ν K {R1 − S0 ( K + τ ) + ∑ i =1 m1iν i +ν L [ R1 − S1 ( L)]}

2 i =1
crent ( K , L) = .(12.2)

K +L
i =1

Here K and L are decision variables, with K modelling the time (from now) to
replacement of the existing asset; K+L is the time to second replacement. The
advantage of this model is that one only need estimate the operating cost of the
existing and new assets (as functions of age), the capital cost for the new asset, R1 ,
and the age-related resale or residual value of new and existing assets, S0 , S1 .

12.3.3 A Fixed Planning Horizon Model

In the financial appraisal of projects, a standard approach fixes the time horizon
and determines the NPV of future costs over this horizon (e.g. Northcott 1985).
This fixed horizon model has been studied by Scarf and Hashem (2003) and its
simplicity lends itself to application in complex contexts (e.g. Scarf and Martin
2001). The annuity for this model can be derived from Equation 12.2 above simply
by setting X = K and K + L = h , the length of the planning horizon, and then
considering h as fixed. Whence, there is only one decision variable, X, the time to
replacement. Given the possibility that X = h , that is, no replacement over the
planning horizon whence we retain the current asset, the annuity function has a
discontinuity at X = h , and X * = h implies that it is not optimal to undertake the
292 P. Scarf and J. Hartman

(replacement) project. Furthermore, since the replacement at the end of the horizon
has a fixed cost (with respect to the decision variable X) its inclusion or exclusion
has no effect on the optimal time to replacement. It is natural not to include the
replacement cost at the horizon-end since a standard financial appraisal approach
would only account for revenue costs up to project execution, capital costs at
project execution, subsequent revenue costs up to the horizon-end, and residual
values. Including the replacement at h on the other hand allows cost comparisons
with the two-cycle model and the associated rent, Equation 12.2. We take the
former approach here however and the annuity is

∑ ∑
X h
⎧ { m0(i +τ )ν i + ν X [ R1 − S0 ( X + τ )] + m1(i − X )ν i
⎪ i =1 i = X +1

h h
crent ( X ) = ⎨ −ν h S1 (h − X )}/ i =1
νi, X < h, (12.3)

∑ ∑
h h
⎪{ m0(i +τ )ν i −ν h S0 (h + τ )}/ νi, X = h.
⎩ i =1 i =1

12.3.4 A Modified Two-cycle Model

It is interesting to consider the behaviour of these models at Equations 12.2 and

12.3 when the operating costs are constant (or increasing only slowly), since it is
not unusual for plant to age only slowly. Of course, replacement of an existing
asset in these circumstances would only be contemplated if the operating cost (or
functionality) of the new asset is significantly lower (or functionality higher), e.g.
electricity supply network components; see Brint et al. (1998). The behaviour is
simplest to follow for the continuous time formulation when the discount factor is
unity (no discounting) and residual values are zero. Under these circumstances, the
costs per unit time (annuity) for the two-cycle model and the fixed horizon model

crent ( K , L) = ( Km0 + R + Lm1 + R ) /( K + L) , (12.4)


h ⎧[ Xm0 + R + (h − X )m1 ] / h, X < h,

crent (X ) = ⎨
⎩ m0 X = h.

respectively. From Equation 12.4 we get

dcrent ( K , L) / dK = [ L(m0 − m1 ) − 2 R] /( K + L) 2 . Thus there is no K such that
2 2
dcrent ( K , L) / dK = 0, but that dcrent ( K , L) / dK > 0 ⇒ K * = 0 if L(m0 − m1 ) > 2 R
for any fixed L. Furthermore, dcrent ( K , L) / dL = [ K (m1 − m0 ) − 2 R] /( K + L) 2 , and
so if m0 > m1 (which is a necessary condition for dcrent ( K , L) / dK > 0 )
dcrent ( K , L) / dL < 0 and so L* does not exist. However, in any practical implemen-
tation of the two-cycle model, one would bound L with some upper value lmax ,
say. Then the optimal policy would be K*=0 (L*= lmax ) if
Replacement of Capital Equipment 293

lmax (m0 − m1 ) > 2 R . (12.5)

We can consider a similar argument for the fixed horizon model. Thus,
h h
dcrent ( X ) / dX = ( m0 − m1 ) / h, X < h and so dcrent ( X ) / dX > 0 if m0 > m1 . How-
ever, since crent ( X ) has a discontinuity at X=h, X*=0 is optimal only if m0 > m1
h h
and crent (0) < crent (h) . That is, if ( R + hm1 ) / h < m0 , that is, if

h(m0 − m1 ) > R . (12.6)

Thus, comparison of inequalities at Equations 12.5 and 12.6 shows that the two
models have different properties in terms of the behaviour of optimal policy as a
function of cost parameters. Thus the two-cycle model is inconsistent with standard
financial models. However, a simple modification to the model will correct this
inconsistency. Scarf et al. (2006) suggest simply to omit the replacement at the end
of the second cycle. For the constant revenue case above, the rent becomes
c rent ( K , L) = ( Km 0 + R + Lm1 ) /( K + L) and optimal policy would be K*=0
( L* = lmax ) if lmax (m0 − m1 ) > R , which is consistent with the fixed horizon model
and hence with standard financial appraisal models.
However, it would appear that the two-cycle model with its two replacements
(at t=K and at t=K+L) is applicable for the case of increasing operating costs and
that a modified two-cycle model with one replacement (at t=K only) for operating
costs that are constant or increasing only slowly. However, this issue can be
resolved. When operating costs are increasing only slowly, typically L* does not
exist, and, in practice L must be constrained such that L ≤ lmax (as pointed out
above) since numerically we can only search for L* over a finite space. In
constraining L ≤ lmax under the two replacements formulation, we impose a
replacement at lmax when in fact there should not be a second replacement since
L* does not exist. This then suggests that the two-cycle replacement model should
be modified in the following subtle way: if there does not exist an L such that
crent ( K , L) has a minimum strictly within the search space, that is, within
{( K , L) : 0 < K < K max ,0 < L < lmax } then, when determining that K which
minimises crent ( K , lmax ) , no replacement cost should be incurred at t = K + lmax .
Thus the model should be modified so that there is only one replacement.
Otherwise the “cost hurdle” for replacement of the current asset will be set
artificially high (inequality at Equation 12.5). Thus, in all practical situations for
which operating costs are increasing only slowly, one should use this modified
two-cycle model or the fixed horizon model as a special case.

12.3.5 Discussion of Finite Horizon Replacement Models

Using the fixed horizon model or equivalently using the modified two-cycle model
with a finite search space may lead to significant end-of-horizon effects (since
costs beyond the horizon-end are ignored). Thus time to first replacement will
depend on h (or equivalently lmax ). Choice of h (or lmax ) will need to be con-
sidered carefully; in practice the horizon may be specified by company policy on
accounting methods and discounting may reduce those costs incurred in the distant
294 P. Scarf and J. Hartman

future to a small or insignificant level. Furthermore, specification of the residual

value may be problematic, particularly for non-movable assets with either constant
or slowly changing operating revenue. This is because the market resale value of
the asset is arguably zero. However, the residual value, as measured by the benefit
of the function the asset performs rather than its value if sold, may be non-zero. In
this case company policy may prescribe a “straight line” depreciation so that the
residual value is proportional to the estimated asset life fraction remaining at
replacement or horizon end. However, such an approach may be difficult to justify
since the asset life is unknown and linearity is a strong assumption. One possible
approach here would be to look at sensitivity to the parameters in a residual value
model such as this but there would be a number of parameters and this may
become over-complex. An alternative would be to equate the residual value at the
horizon-end to the cost-benefit of the replacement (whenever it took place) over
the next m years. But this then amounts to extending the planning horizon from
K+L to K+L+m or from h to h+m. This of course will lead to models the same as
those considered at present but with longer horizons (or to a three cycle model if a
subsequent refurbishment is also considered). Thus if one accepts that a two-cycle
model is sufficient for modelling purposes, then, logically, consideration of resid-
ual values for a non-movable asset amounts to considering sensitivity to horizon
length (either h or K+L whichever model is used).
Restricting replacement models to (at most) two cycles may lead to sub-optimal
replacement policies. However, for typical discount rates and planning horizons the
modelling of operation and replacement beyond the end of the second cycle may
have only a small effect on the time to first replacement, the issue of principal
interest in practice. Replacement models that are not restricted to (at most) two
replacement cycles are considered in Section 12.3.

12.3.6 An Application of Finite Horizon Repacement Models

Example 12.1
Decision making regarding the replacement of escalators on a mass transit rail
system in a particular city has been considered over a number of years by the
corporation that owns and operates the system (Scarf et al. 2006). Maintenance of
escalators is generally outsourced to equipment suppliers due to the difficulty that
alternative contractors have in obtaining proprietary spares. The original manu-
facturers can keep costs down as a result of the economy of scale that is achievable
through maintaining equipment over a large number of client organisations. Cur-
rently, the corporation operates of the order of 600 escalators and the annual
maintenance contract price is over $10 million. Escalator replacement is therefore a
significant issue within the organisation.
Studies by the corporation suggest that the economic life of escalators is of the
order of 25 years but that, based on overseas experience, escalator life can be
extended to up to 40 years. However, given the size of the fleet, a strategy has to be
set to manage escalator maintenance and to deal with the replacement or refurbish-
ment of older escalator assets. A key factor in this strategy is the approach of the
organisation to the re-negotiation of maintenance contracts and in particular to
Replacement of Capital Equipment 295

determine the scale of refurbishment of older assets and the level of major parts
replacement and supply within the negotiated contract.
For the presentation of the modelling work in this example, it is necessary to
consider the asset management options open to the corporation in a simple manner,
and a homogeneous sub-fleet of the escalators is considered, with modeling carried
out for a typical escalator—this is a reasonable simplification since all escalators in
the sub-fleet were installed at approximately the same time. For this group,
replacement, although crudely costed, was not really a viable option—economic
costs were too high and disruption unacceptable given the duration of replacement
work. Refurbishment by the original manufacturer, replacing worn parts, upgrading
the control system and maintenance access was being carefully considered by the
corporation as a viable strategy for managing the asset life. Cost savings could be
achieved through a reduction in the annual maintenance contract price subsequent
to refurbishment. Thus, put simply, for the escalator group, the corporation was
faced with the decision: continue with the current relatively higher-price main-
tenance contract or refurbish and benefit from a new relatively lower-priced
maintenance contract. Other benefits would also accrue from refurbishment for
both contractor and the corporation. For the contractor, improved access and safety
for maintenance was part of the refurbishment package. For the corporation, up-
grade of the control system would result in fewer unplanned escalator stoppages.
We consider some four asset management options: “do nothing”—continue
with high-price maintenance contract; “refurb”—renew worn parts, retro-fit new
control system and proceed with lower-price maintenance contract; “delay
refurb”—delay refurbishment for up to n years; “replace”—a full replacement
option with nominal costs included for comparison purposes. The costs of refur-
bishment (per escalator) in the present study were obtained from initial quotations
from the respective manufacturers: these are $63K for refurbishment. On-going
annual maintenance contract costs (per escalator) are: $9K pre-refurbishment; $7K
post-refurbishment. Prior to refurbishment the cost of replacement of major parts is
in addition to the annual maintenance contract and major parts are replaced on the
basis of condition. Post-refurbishment, the annual maintenance contract includes
replacement of major parts at no extra cost. Given that we might expect major parts
to be replaced somewhat less frequently than dictated by their recommended lives,
we introduce a cost parameter to model such life-extension—this is called the
effective life factor, ρ . ρ = 1 implies that major parts are replaced at a frequency
corresponding to their recommended life (for example, once every 25 years for the
steps at a cost of $48K), and the replacement frequency ∝ 1 / ρ ( ρ = 2 implies
replacement of steps once every 50 years). The cost of a replacement ($170K) is a
nominal figure and used mainly for crude comparison with refurbishment. In
practice, replacement may cost significantly more than this.
The corporation recommend a discount rate of r= 0.11 and a projected inflation
rate of i= 0.05. This corresponds to an effective discount factor, ν , of 0.057
( 1 /(1 +ν ) = (1 + i ) /(1 + r ) ). Integral to the refurbishment option is the up-grading of
the escalator control system to allow “power-dip ride-through”—this facility
prevents unnecessary emergency stops caused by momentary power loss that can
cause injuries to passengers. However, the effectiveness of the “ride-through”
facility is uncertain; hence we introduce another cost parameter, control system
296 P. Scarf and J. Hartman

retro-fit effectiveness, which represents the percentage of passenger injuries due to

power dips that would be prevented by up-grading of the control system at
refurbishment. Also, for the purposes of sensitivity analysis it is necessary to place
a cost on an emergency stop due to a power dip. We call this the penalty cost of
failure. Historic records of the number of such stops (approximately 0.5 stops per
escalator per year) and the retro-fit effectiveness are used to calculate a total
penalty cost (saving) per escalator per year post-refurbishment. Another parameter
that is difficult to quantify was also considered, the passenger delay cost due to
refurbishment, but the results are omitted here.
In Table 12.1, we look at annuities for the modified two-cycle model and the
fixed horizon model for a range values of the respective decision variables. Note
that the annuities for the fixed horizon model lie on the diagonal indicated. This is
because the fixed horizon model is equivalent to the modified two-cycle model
with the additional constraint that K+L= h. These annuities are also presented in
Figure 12.1a. Figure 12.1b shows the annuities for the two models in the case of no
discounting—discounting has the effect of slightly extending the economic life
(since the NPV of future costs are reduced) and this accounts for the small
difference in optimum policy between the fixed horizon model and the modified
two-cycle model in Figure 12.1(a), X* = 1, K* = 4 (years).

Table 12.1. Annuities ($000s per escalator per year) escalator for modified two-cycle model
with refurbishment at K years from now and again after a further L years. Annuities for
fixed horizon model with h = 22 years highlighted, except for X* = 22 (no replacement) for
which annuity = $139.4K. Cost parameters as follows: refurbishment cost, $62.9K; effective
discount factor, 0.06 (equivalent to inflation rate of approximately 0.05 and discount rate of
0.11); penalty cost of failure, $5K; effective life parameter, 1.5; control system retro-fit
effectiveness, 75%; cost of refurbishment delay, $10K; annual maintenance contract pre-
refurbishment, $8.8K (per escalator); annual maintenance contract post-refurbishment,
$6.9K (per escalator).
L, length of the second cycle, years
1 3 5 7 9 11 13 15 17 19 21
1 491.8 298.0 233.0 200.3 180.7 167.8 158.6 151.8 146.6 142.5 139.2
3 302.4 236.1 202.7 182.8 169.6 160.2 153.3 148.0 143.8 140.5 137.8
5 241.1 206.8 186.2 172.5 162.9 155.8 150.3 146.0 142.5 139.7 137.4
length of
7 211.6 190.2 176.1 166.1 158.7 153.1 148.6 145.0 142.1 139.7 137.7
the first
9 193.9 179.3 169.0 161.4 155.5 150.9 147.2 144.2 141.7 139.6 137.9
11 182.2 171.6 163.7 157.7 153.0 149.2 146.1 143.6 141.4 139.6 138.1
13 173.9 165.9 159.7 154.9 151.0 147.8 145.2 143.0 141.2 139.6 138.3
15 167.8 161.5 156.5 152.6 149.3 146.7 144.4 142.5 140.9 139.6 138.4
17 163.1 158.0 154.0 150.7 148.0 145.7 143.8 142.1 140.8 139.6 138.5
19 159.4 155.3 151.9 149.1 146.8 144.9 143.2 141.8 140.6 139.5 138.6
21 156.4 153.0 150.2 147.8 145.9 144.2 142.8 141.5 140.4 139.5 138.7

The cost parameters in Table 12.1 and Figure 12.1 are held at intermediate
values. In Figure 12.2, we present annuities for a number of “replacement” options
as a function of each of the cost parameters. These replacement options correspond
to those considered by the corporation, with “refurb” referring to immediate refur-
bishment (in year 1), and “delay refurb” referring to refurbishment in year 10 (from
Replacement of Capital Equipment 297

time of study). Given the size of the fleet, a constraint on the number of escalators
that can be refurbished at any one time and the duration of refurbishment, we
would expect the refurbishment programme to last some 15 years and therefore a
significant proportion of the fleet would experience this kind of delay prior to
refurbishment. Therefore we include it as a particular policy for indicative pur-
poses. We use the fixed horizon model here in order to make comparisons between
annuities—this is because one would wish to compare the cost of different options
over the same horizon. Equivalently, we could use the modified two-cycle model
with the additional constraint K+L= h= 22 (years), say.

170.0 150.0

annuity per escalator, HK$K

annuity per escalator, HK$K

160.0 140.0

150.0 130.0

140.0 120.0

130.0 110.0
1 5 9 13 17 21 1 5 9 13 17 21
K, years K, years
a b

L=10 L=12 L=14 L=16 L=18 L=21 fixed

1 7 0 0

1 6 0 0

1 5 0 0
1 4 0 0

1 3 0 0

1 5 9 1 3 1 7 2 1

Figure 12.1a,b. Annuities ($000s per escalator per year) for modified two-cycle model with
refurbishment at K years from now and operation for a further L years. Annuities for fixed
horizon model with h = 22 years also shown (X < 22: bold, solid curve; X = 22: ■). Cost
parameters as Table 12.1 except: a effective discount factor equals 0.06 (equivalent to
inflation rate of approximately 0.05 and discount rate of 0.11); b no discounting.

From Figure 12.2 we can see that optimum policy is certainly sensitive to these
cost factors with the influence of cost parameters as expected. Threshold values that
lead to a step-change in the optimum policy (option) can be observed from these
figures. Thus while estimation of the penalty cost of failure, for example, may be
difficult and contentious, the importance of its effect can be observed. This may
then provide an incentive for further investigation of this parameter or discussion
about whether its true value is above or below the threshold of policy change.
As a final note for the escalator replacement problem in particular, one could
argue that the cost of differing options or policies will reflect the maintenance
contractor’s profit requirement, whatever the details of the arrangement, and there-
fore the total costs of options would expect to vary very little. What can differ,
however, is that some options may lead to lower risk (for example where the
contractor bears the cost of major parts’ wear-out which may be subject to signifi-
cant uncertainty) and lower risk is certainly desirable from the point of view of the
298 P. Scarf and J. Hartman

12.3.7 A Model for an Inhomogeneous Fleet

Consider now a fleet consisting of sub-fleets classified on the basis of class (e.g.
vehicle-type) and age (or condition) so that the operator of the fleet is concerned with
the replacement of sub-fleets, and not with replacement of individual equipment or
with replacement of the entire fleet. For this fleet, it is natural to focus on the replace-
ment of particular sub-fleet(s). The economic life models of the previous section
must be extended given that the replacement of particular sub-fleets has cost implica-
tions for the rest of the fleet.

200 200

annuity per escalator / HK$000s

annuity per escalator / HK$000s


180 180

170 170

160 160

150 150

140 140

130 130
1.00 1.25 1.50 1.75 2.00 0 20 40 60 80
effective life factor penalty cost of failure (HK$K)
a b
200 200
annuity per escalator / HK$000s

190 190
annuity per escalator / HK$000s

180 180

170 170

160 160

150 150

140 140

130 130
0.05 0.07 0.09 0.11 0.13 20 22 24 26 28
discount rate horizon / years
c d

8 0
1 7
6 0

refurb delay refurb do nothing replace

1 5 0
1 4
3 0

2 0 2 2 2 4 2 6 2 8

Figure 12.2a–d. Annuities (per escalator) as a function of cost parameters for fixed horizon
model, with h = 22 years for various refurbishment/replacement options: a annuity vs.
effective life parameter; b annuity vs. penalty cost of failure; c annuity vs. nominal discount
rate; d annuity vs. horizon length h. Cost parameter values when not varying set at: effective
life, 1.5; penalty cost of failure, $5K; nominal discount rate, 0.11; control system retro-fit
effectiveness, 75%; refurbishment delay cost, $10K.
Replacement of Capital Equipment 299

Considering a “rolling schedule” of replacements, the questions of interest are

then: which sub-fleet should be replaced first (second, third)?; when should they be
replaced?; and what model (equipment specification) should be purchased at
replacements? The order in which the sub-fleets are replaced we call the replace-
ment schedule. We call a particular replacement schedule along with the time scale
for replacements and the choice of model for purchase a replacement policy. It is
expected that the operator would have significant input into the choice of sub-fleets
to be replaced, based on experience with the fleet. Also the choice of model for
purchase at replacements is also likely to be decided in advance by the operator.
This would give the operator a level of control over the modelling process which is
highly desirable in practice (Russell 1982). The purpose of the modelling is there-
fore to provide decision support for the operator on: (i) the cost implications of
alternative replacement schedules; (ii) the time scale for replacements and budget
requirements; (iii) the cost implications of particular sub-optimal policies necessi-
tated by technological obsolescence or changing economic considerations (e.g.
Suzukia and Pautschb 2005).
The most important considerations are the choice of sub-fleet for first replace-
ment and the time to first replacement. It is expected that the optimal policy will be
updated periodically, as the fleet evolves, and when new information about main-
tenance costs and new (equipment) models becomes available.
To model the replacement scenario described, we consider an extension of the
simple fixed horizon model (Equation 12.3) in which we have a variable number,
N, of replacement cycles. Let the inhomogeneous fleet comprise of r sub-fleets,
with the current sub-fleets indexed by k = 1,..,r. New replacement sub-fleets are
indexed by k = r+1,..,r+N. For a fixed planning horizon of length h, and given
replacement schedule and choice of class for the replacement sub-fleets, the
decision variables are then: number of cycles, N(>1); and time from beginning of
i-th cycle to the replacement of sub-fleet i, Li (i=1,...,N). The whole fleet is
operated over cycle i, which ends with the resale of sub-fleet i of size ni and
purchase of sub-fleet r+i of size n r +i . Sub-fleets need not be homogeneous and
the current ages of plant are denoted by τ ij (i = 1,..., r + N ) . The fleet size may be
constant ( ni = nr +i all i) or variable. For a given replacement schedule, the total
discounted cost over the horizon can be expressed as

∑ ∑
N ti
ctdc ( N , L; h) = i =1
ν ti { s = ti −1 +1
mi ( s )ν s −ti + nr + i Rr + i − Si (ti )} (12.7)

where t i = ∑ij−=10 Li with L0 = 0 . Here mi (.) is the age related operating cost of
the whole fleet in cycle i; S i (.) is the age related resale value of plant in sub-fleet
i; Rr +i is the cost each of replacement plant in sub-fleet r+i; and v is the discount
rate. The costs mi (.) and S i (.) may be expressed as

mi ( s ) = ∑ k =i ∑
r + i −1 nk
j =1
M k (τ kj + s ), (i = 1,..., N ),

Si ( Li ) = ∑ j =1 Si1 (τ ij + Li ), (i = 1,..., N ) ,
300 P. Scarf and J. Hartman

where M k (.) is the age related operating cost per unit time for an individual plant
in sub-fleet k (k = 1,..., r + N ) , and S i1 (.) is the age related resale value for
individual plant in sub-fleet i. (Also, τ kj = 0 for k>r). Appropriate penalty costs,
associated with failures, may be incorporated into the operating costs.
The annuity, c tdc ( N , L; h) / ∑ih=1ν i , or other suitable objective function may
then be minimized subject to the constraint ∑iN=1 Li = h . Technological change is
allowed for in that costs relating to proposed plant for cycles 2,..,N may be
assigned as appropriate. The optimum replacement schedule may be obtained by
minimizing the objective function over all possible schedules. In practice the range
of possible schedules would be narrowed greatly by the experience of the operator.
However, as the decision-maker will not have a firm value for the horizon length,
the optimum policy must be robust to variation in h. Furthermore, because the fleet
is mixed, both different replacement schedules and different planning horizon
lengths will give rise to different age compositions of the fleet at the end of the
horizon. Thus replacement policies may need to be compared not just on the basis
of cost but also on the basis of the age composition of the fleet at the end of the
planning horizon. This final age composition can be considered as quantifying the
end-of-horizon effect.
Non-uniform usage, particularly between sub-fleets, may be allowed for by
varying the fleet size at replacements. For example, if older plant are under-
utilized, a smaller number of new plant would be required to meet the demand
currently placed on an older sub-fleet. This effectively reduces the replacement
cost for that sub-fleet by factor which is the ratio of the utilization of the old to the
new sub-fleet. Of course, other more complex methods of accounting for differing
usage may be considered. Given sufficient data, operating costs could be quantified
in terms of usage and optimum policy may be obtained given forecasts for usage of
sub-fleets over the planning horizon.
The models may be extended to the case in which sub-fleets are retired as
spares. The number of sub-fleets would simply increase by one at each replace-
ment, with the costs associated with retired sub-fleet added. Predicting operating
costs for a retired sub-fleet would be difficult however, as it is likely that no data
would be available for this. Also it is assumed that equipment is bought new: in
principle it is a simple matter to extend Equation 12.7 to the case in which used
equipment may be purchased.
Note that the formulation as presented allows for the possibility for a sub-fleet
to be composed of a single unit of equipment. This may be appropriate if the fleet
is small. The complexity of the computational problem increases rapidly as the
number of sub-fleets increases. However we do not consider efficient algorithms
for determining optimum policy here.

12.3.8 Application of a Replacement Model for an Inhomogeneous Fleet

Example 12.2
Scarf and Hashem (1997) consider the inter-city coach fleet operated by Express
National Berhad in Malaysia. The fleet comprised of 160 vehicles of 5 vehicle-
types of varying ages, with maintenance cost modelled as M (τ ) = aτ b and resale
Replacement of Capital Equipment 301

values S (τ ) = 0.6 R(0.81)τ , for replacement cost R (Table 12.2). The data available
were not sufficient for obtaining the maintenance cost model for all vehicle-
types—for example, for the MAN, only data relating to their first year of operation
were available. Furthermore, for older vehicles the costs appeared to be decreasing.
This could perhaps be put down to under-utilization (partial retirement) and also
neglect of vehicles reaching the end of their useful life. It was therefore necessary
to pool the data to obtain reasonable cost models. The fitted maintenance cost
models for the Cummins, Isuzu CJR and MAN were obtained by first fitting an
overall cost model to data on vehicles up to eight years old, and then scaling this
model to the costs of the individual vehicletypes in the manner described in
Christer (1988). The costs for the older sub-fleets, the Mitsubishi and Isuzu CSA,
were taken as constant. Penalty costs for breakdowns on the road were also
modelled—see Scarf and Hashem (1997) for a full discussion of this. It was known
that the Mitsubishi and Isuzu sub-fleets were in partial retirement and candidates
for immediate replacement, capital expenditure permitting. The usage of sub-fleets
was unknown, although with a daily requirement for 125 vehicles, it was reason-
able to suppose that the usage level for the Mitsubishi and Isuzu sub-fleets was
about half that of the other newer sub-fleets. This assumption led to the null
“optimal policy”—replace the Mitsubishi and Isuzu CSA sub-fleets as soon as
possible—which is uninteresting from a model validation point of view. Therefore
in order to illustrate the replacement model, we consider the following sub-
problem in detail: investigate replacement policy for the “fleet” comprising of the
Cummins, Isuzu CJR and MAN, assuming a fixed fleet size (93 vehicles) and
uniform usage.

Table 12.2. Fleet composition by vehicle type showing purchase cost, R, maintenance cost
parameters and age distribution at time of replacement study
Age distribution: number of vehicles in
model M (τ ) = aτ b each age group (2 year intervals)
R, M$000s


a b





IsuzuCSA 750 55.6 0 9 30

Mitsubishi 800 57.8 0 28
Cummins 500 24.7 0.72 18
IsuzuCJR 300 11.1 0.72 8 22
MAN 450 18.4 0.72 45

For the three sub-fleets problem, optimal policy is presented for each of the six
replacement schedules in Table 12.3. Horizon lengths, h, of 120, 150 and 180
months (15 years) are considered. For the fleet as a whole it is difficult to deter-
mine optimal replacement policy as two sub-fleets are partially retired, and usage
levels are unknown. The problem is made more difficult because it is likely that
the maintenance of these sub-fleets is less thorough than that for the newer sub-
fleets. Under simple usage assumptions, the optimum policy is to replace the
Mitsubishi and Isuzu CSA sub-fleets immediately. For the particular sub-problem
302 P. Scarf and J. Hartman

relating to the Cummins, Isuzu CJR and MAN, it appears that the optimum
replacement schedule depends on the length of the horizon. Also, the end-of-
horizon effect, as represented by the mean age of the fleet, also varies with the
replacement schedule. The choice of optimal policy is therefore not straight-
forward. Over a fifteen year planning horizon, there is little to choose between the
three schedules Cummins-IsuzuCJR-MAN, Cummins-MAN-IsuzuCJR and MAN-
Cummins-IsuzuCJR, both in terms of cost and age. Sensitivity to model para-
meters is considered more fully in Scarf and Hashem (1997).

Table 12.3. Optimum policy for each schedule for various horizon lengths, h=120,150,180
months; penalty cost, M$2000, annual discount rate 0.97. Cost of equivalent rent (M$000s
per month for whole fleet), average age of fleet at end of horizon, and optimum cycle
lengths. Replacement schedules: CIM – Cummins-IsuzuCJR-MAN, etc.

h Schedule Cost/month Age/years L1 L1 L3 L4

120 CIM 745.8 7.0 6 114
CMI 763.5 9.9 120
ICM 816.7 8.4 120
IMC 816.7 8.4 120
MCI 782.1 5.1 24 6 90
MIC 838.1 6.5 42 78
150 CIM 779.4 8.6 6 144
CMI 767.3 5.6 6 54 90
ICM 844.7 4.6 54 6 6 84
IMC 848.1 4.4 60 6 6 78
MCI 782.8 5.7 42 6 102
MIC 850.6 7.7 60 90
180 CIM 787.9 6.0 18 72 6 84
CMI 778.1 6.0 18 60 36 66
ICM 841.0 5.2 72 6 6 96
IMC 843.9 5.4 72 6 6 96
MCI 794.6 6.7 54 6 120
MIC 859.9 4.4 72 6 6 96

12.4 Capital Replacement for a Network System

Consider now a network viewed as a fleet of connected or dependent assets. For
such a network, replacements can be considered as capital projects identified using
engineering considerations. A simple way to proceed is then to consider a fixed
horizon of length h over which we will evaluate the consequences of such projects,
P. For a network, a potential project may be the replacement or refurbishment of a
component or group of components which comprise part of the network, or a
network expansion, or a network re-design. Define ft1 to be the operating cash
flow (or performance) relating to a project, P, in year t after the release of project
P. Define ft 0 to be the baseline operating cash flow (or performance) relating to P
Replacement of Capital Equipment 303

in year t. For network expansion projects ft 0 = 0 . Let C (>0) be the capital cost of
project P. Assume income cashflows are negative and expenditure cashflows are
positive, and that all cashflows are incurred at the year end and discounted at rate
v. If project P is released in year x from now then the total cashflow over h years
from now will be

x −1
t =1
ft 0ν t + ν x (∑ h− x
t =0
ft1ν t + C . ) (12.8)

If project P is not released then the cashflow over the horizon will be ∑ t =1 ft 0ν t .

Define the “gain” from releasing project P in year x to be the difference between
these cashflows:

g P ( x; h) = ∑ t =1 ft 0ν t −1 − [∑ t =1 ft 0ν t + ν x {∑ t = 0 ft1ν t + C}]
h x −1 h− x

= ∑ t = x ( ft 0 − ft1− x )ν t −ν x C.

Release of project P in year x (x<h) will be economic if g P , x (h) > 0. We can

optimize project release by choosing that x for which the gain is a maximum and
positive. If g P , x (h) ≤ 0 for all x = 1,..., h then release of project P will not be
recommended (over the horizon). The consequences of project release may also be
measured in terms of performance (Scarf and Martin, 2001).
For a large system comprising many potential projects, the outcome of this
modelling approach will typically be a list of projects that should be released
immediately, and a list of those that should not. Of course, the release of projects
will, in both cases, be limited by the budget for capital expenditure. For prioritizing
project release, we can consider all projects over the horizon (0,h). Let g ij (h) be
the gain when project i is released in year j ( 1 ≤ j ≤ h ), and suppose that, in year j,
n projects have positive gains. These projects might be listed in order of magnitude
of their gains. They might also be listed in order of magnitude of the “profitability”
index, gij (h) / Ci , where Ci is the capital cost of project i. If rational decision
criteria are to be used to determine policy then projects at the top of the list should
be given priority, since they would be associated with the largest expected gain
over the planning horizon. Under capital rationing with a fixed budget, this project
priority list would indicate which projects can be released in the current year. For
consequences considered in cashflow terms, appropriate discount rates may be
chosen to reflect the investment risks of projects. A higher required rate of return
(smaller discount rate) might be imposed on network expansion projects than on
replacement of existing assets. Also, factors other than cost may impact on
decisions: safety related or projects with high customer benefit may take priority.
A capital rationing model to prioritize project release over k years (k ≤ h) may
be formulated as a linear program (LP), similarly to that proposed in vehicle fleet
management by Karabakal et al. (1994). Suppose the capital investment budget for
year j is Bj (j = 1,...,h). Introduce the indicator variable xij which takes the value 1 if
project i is released in year j and 0 otherwise. Then seek those values of xij
(i = 1,..,n; j =1,...,k) which maximize the total gain over (0,h) of all projects re-
304 P. Scarf and J. Hartman

leased subject to the constraints that the capital investment budget is not exceeded
in each year. That is maximize

∑ ∑
n k
i =1
x gij (h)
j =1 ij

subject to

i =1 ij
x Ci ≤B j for all j = 1,...,k; (12.9)

j =1 ij
x ≤1 for all i = 1,...,n; (12.10)
xij = 0,1.

Constraint set Equation 12.9 ensures that the budget for year j is not exceeded.
Constraint set at Equation 12.10 ensures that project i is released at most once over
the planning horizon. Note that if an individual project has negative gain whatever
its execution time, then the contribution to the objective function from this project
will be greatest when this project is not released over (0, k). Typically such plan-
ning may be informative over the planning horizon, but only decisions relating to
the immediate future (one to two years) would be acted on. Therefore policy would
be continually updated, implying a “rolling horizon” approach. Where a network
consists of many identical components, the modelling of project planning may be
extended to the case in which a proportion of “similar” projects are released in a
given year. This could be done by formulating the capital rationing model (CRM)
as a mixed programming problem.
Consider now dependence between projects. For example, a major expansion
project, while not replacing existing assets, may have significant operating cost or
performance implications for particular assets: the building of a large ring-main in
a water supply network is one such example. Essentially, if two projects P1 and P2
interact in this way, then new projects P1' = ( P1, not P2 ) , P2' = (not P1, P2 ) , and
P12 = ( P1, P2 ) would have to be introduced, along with the constraint to ensure that
at most one of P1' , P2' , and P12 '
is released over the planning horizon. While this
approach may lead to a significant increase in the number of “projects” in the
model, in principle the solution procedure would remain unchanged. The existence
of future-cost dependencies between projects would have to be identified by the
network owner. This may be extremely difficult in practice. However such depend-
ency would very much characterize the network replacement problem, and there-
fore the approach described is an advance over current methods. A similar ap-
proach has been taken by Santhanam and Kyparisis (1996) in modelling depend-
ency in the project release of information systems. Capital costs may be considered
simply using the concept of shared set-up. It is possible that it may be optimal to
release both P1 and P2 during the planning horizon, but not simultaneously. This
presents a more difficult modelling task, without introducing many pseudo-pro-
jects, that is. For example, we could consider: release P1 at time s and P2 at time
t; however for k=10, say, this would mean the introduction of 25 variables,
x( P1P2 )( s ,t ) , for the P1 , P2 decision alone!
Replacement of Capital Equipment 305

For reasons of budgeting constraints or technical delays, the release of some

recommended project at some optimal execution moment x* may not be possible.
In such cases, it would be informative to have an indication of the extra cost to be
incurred in revenue expenditure because of lack of capital expenditure; this is the
marginal increased revenue expenditure due to delayed release. Given the capital
rationing model, and focusing on cashflows, the operating cost consequences of
capital rationing can be determined by calculating the delay associated with each
project as a result of capital rationing. The revenue cost implications due to this
delay would from expression Equation 12.8 be

x ′ −1 h − x′ h − x∗
δ P ( x*, x′; h) = ∑ t = x ft 0ν t + ∑ t = 0 ft1ν t + x′ − ∑ t = 0 ft1ν t + x

where x ′ is the execution time for the project under capital rationing. The marginal
increase in revenue expenditure would be found by summing over all projects. In a
similar manner, the marginal increase in revenue expenditure due to projects delayed
in year j could be found by summing over all projects with x∗ = j , and this measure
indicates how much more capital investment would be required to reduce revenue
expenditure to the optimum level.
Uncertainty in the cashflow/performance model parameter estimates, reflecting
the extent of currently available information about particular components and
potential projects, and the extent of technological developments (new materials and
techniques), may be propagated through into uncertainty in the gain function, g (.) .
This would be most easily done using the delta method; see Baker and Scarf (1995)
for an example of this in maintenance. The variance of the gain, as well as the
expected gain, may then be used to produce the project priority list and those
projects for which the expected gain is high and the uncertainty in the gain
(variance of the gain) is low are candidates for release; these projects would be
viewed as sound investments. Markovitz (1952) is the classic reference here; for a
more recent discussion see Booth and King (1998). Also, a real options approach
might be taken (e.g. Bowe and Lee 2004). Where there are no data regarding a
potential project, there will be no objective basis for determining if and where the
project lies on the project priority list. One possible approach to this problem
would be to use data relating to other projects that are similar in design. Also
subjective data may be collected, and used to update component data for the whole
network in the manner described in O’Hagan (1994) and Goldstein and O’Hagan
(1996) in the context of sewer networks. These methods are particularly useful for
multi-component systems in which there are only limited data for a limited number
of individual components. On the other hand, it may be that the income cashflow
may be deterministic in some situations. For example, expansion of the network
may be initiated by legislation, and the compensation for the investment costs are
fixed and predetermined per customer connection.
306 P. Scarf and J. Hartman

12.5 Dynamic Programming Models

Dynamic programming (DP) is a versatile technique for modelling and solving opti-
mization problems which are sequential in nature. Thus, it is ideally suited to solve
capital equipment replacement problems that consider whether plant should be kept
or replaced after each period.
The use of dynamic programming in equipment replacement analysis is
significant as the methodology allows for keep/replace decisions to be evaluated
after every period. This relaxes the assumption of economic life models that as-
sume an asset and its replacements are retained for the same length of time over the
horizon. Furthermore, dynamic programming allows for general modelling of costs
and technological change.
A dynamic program evaluates the transition from an initial state to possible
eventual states, determining the optimal path of decisions over time. In our appli-
cation, this entails evaluating an asset and the periodic decisions to keep or replace
that asset over some horizon. This requires the definition of a state for the asset. As
there are many possibilities for the definition of a state space, a number of methods
have been developed. We highlight these different models in the following sections.

12.5.1 Age Based Model

Bellman (1955) introduced the first dynamic programming model to analyze the
equipment replacement problem. In this model, the state of the system is defined as
the age of the asset and the decision to be evaluated at each stage is whether to
keep or replace the asset. Thus, a solution consists of keep and replace decisions in
each period of the horizon.
The dynamic program can be described by the network in Figure 12.3. Each
node in the network represents the age of the asset, which is the state of the system,
at the end of the period. The states are labelled according to the age of the asset
along the y-axis, increasing from 1 to N, the maximum allowable age of the asset
(N = 5 in the figure), at the end of the time period which is labelled on the x-axis
from 0 to T, the horizon time (T = 4 in the figure). The arcs connecting the nodes
represent keep and replace decisions. An arc representing a “keep” decision (K)
connects a state (age) of n to n+1 in consecutive stages (periods), as the asset ages
one period. A “replace” decision (R) connects a state of n to a state of 1, as the n-
period old asset is salvaged and a new asset is purchased and used for one period.
The initial decision is made at time zero, with n = 4 in the figure, and the asset is
salvaged at the end of the horizon.
Define ft(n) as the minimum net present value cost of making optimal keep and
replace decisions for an asset of age n at time period t through time period T.
Mathematically, we evaluate ft(n) with the following recursion:

⎧⎪K : α ( Ct +1 (n + 1) + ft +1 (n + 1) )
ft (n) = min ⎨ , n ≤ N , t ≤ T −1 (12.11)
⎪⎩R : Pt − St (n) + α ( Ct +1 (1) + f t +1 (1) )
Replacement of Capital Equipment 307

N=5 K

3 R

0 1 2 3 T=4
Figure 12.3. Dynamic programming network for an age-based model

If the n-period old asset is kept (K), the operating and maitenance (O&M) cost
Ct+1(n+1) is incurred for the asset in the following period. As the asset is age n+1
at the end of the period, ft+1(n+1) defines the costs going forward. (This is why ft
is often referred to as the “cost to go function” in dynamic programming.) If the
asset is replaced (R), then a salvage value St(n) is received and a purchase price Pt
is paid for a new asset. The new asset is utilized for the period as the state
transitions to an age of 1, defined by costs ft+1(1) going forward. If the asset
reaches the maximum age of N, then only the replace decision is feasible.
When the horizon time T is reached, the asset is salvaged such that

fT (n) = − ST (n) (12.12)

Traditionally, the recursion is solved backwards, such that Equation 12.12 is

evaluated for each feasible age n. These values are substituted into Equation 12.11
when determining fT–1(n) for each feasible n. This process continues until the value
and decision at stage zero (t = 0) are computed, signaling the initial decision in the
optimal sequence of decisions over time.
Note that O&M costs are paid at the end of the period and thus are discounted
by the periodic factor α, along with the ensuing state cost ft+1(n). Salvage values
and purchase costs are assumed to occur at the beginning of the period. As the
recursion works from the horizon time T to time zero, the net present value cost is

Example 12.3
Assume a four-period old asset is owned at time zero, its maximum age is 5, and an
asset is required to be in service in each of the next four periods (such that the
decisions are represented in Figure 12.3). The purchase price is $50,000 with first
year O&M costs $10,000, increasing 20% per period of use. The salvage value is
expected to decline 30% (from the purchase price) after the first year of use and an
additional 10% each year thereafter. For simplicity, we assume no technological
change and the interest rate is 12% per period.
308 P. Scarf and J. Hartman

Table 12.4. Dynamic programming state values ft(n) for the example problem
t\n 1 2 3 4 5
0 -- -- -- $47,996 --
1 $16,332 -- -- -- $35,602
2 –$407 $6,292 -- -- --
3 –$17,411 –$12,455 –$7,353 -- --
4 –$35,000 –$31,500 –$28,350 –$25,515 --

Table 12.4 shows the results of solving the dynamic programming algorithm.
The values in the final row (f4 (n)) are the negative salvage values received for a
given asset of age n at that time. To illustrate a calculation at t = 3, consider n = 1.
Substituting into Equation 12.11:

⎧⎪K : 0.893 ( $12, 000 − $31,500 )

f3 (1) = min ⎨ = −$17, 411
⎪⎩R : $50, 000 − $35, 000 + 0.893 ( $10, 000 − $35, 000)

The recursion continues in this fashion until f0 (4) is evaluated, with the decision
to replace the asset immediately with a new asset. This new asset is retained through
the horizon. The net present value cost of this sequence of decisions is $47,996.
The benefit of using this model, in addition to allowing for replacements after
each period, is that periodic costs are explicitly modeled on each arc in the network.
This allows for detailed cost modelling of technological change, as in Regnier et al.
(2004) or those costs associated with after-tax analysis, as in Hartman and Hartman
A similar line of models have also been developed such that the condition of
the asset, not its age, is tracked (i.e. Derman 1963). As opposed to moving from
state to state by increasing the age of the asset, there is some probability that the
asset will degrade to a lower condition during a period. The work assuming sto-
chastic deterioration has been extended to include technological change (Hopp and
Nair 1994) or consider probabilistic utilization (Hartman 2001).

12.5.2 Period Based Model

Wagner (1975) offered an alternative dynamic programming formulation for the

equipment replacement problem in which the state of the system is the time period
and the decision at each period is the length of time to retain an asset. This model
is described in the network in Figure 12.4. The nodes represent the state of the
system (time period) and the arcs connecting two nodes represent the decision to
keep an asset in service between those time periods.
Replacement of Capital Equipment 309

0 1 2 3 4

Figure 12.4. Dynamic programming network for a period-based model

The objective is to find the sequence of service lives that minimizes costs from
time 0 through time T. (As previously, T = 4 in the figure.) Assuming costs along
an arc connecting node t to node t+n are defined as net present value costs at time
t, the optimal sequence of decisions can be determined by solving the following

f (t ) = min n ≤ N ,t + n ≤T {ctn + α n f (t + n)}, t = 0,1,..., T − 1 (12.13)

where ctn represents the cost of retaining the asset for n periods from period t.
Using our previous notation, ctn is defined as

ctn = Pt + ∑ α j Ct + j ( j ) − α n St + n (n) (12.14)
j =1

This model can be solved similarly to the age-based model, assuming that
f(T) = 0 is substituted into Equation 12.13. Note that the network in Figure 12.4
assumes that a new asset is purchased at time 0. To include the option to keep or
replace an asset owned at time zero, another set of arcs must be drawn, emanating
from node 0, representing the length of time to retain the owned asset with its
associated costs. As these arcs parallel those illustrated in Figure 12.4, the higher
cost parallel arcs can be deleted, as they will not reside on the optimal path. This
can be completed in a pre-processing step, with the recursion ensuing as defined.

Example 12.4
Utilizing the same data from Example 12.3, the network in Figure 12.4 represents
the options associated with purchasing a new asset in each period. We would add
an arc from node 0 to node 1 to represent the decision to retain the four-period old
asset for one additional period (to its maximum feasible age of 5).
Table 12.5 provides the net present value costs (at time t) on the arcs from node
t to node t+n. The arc from node 0 to 1 represents the cost of retaining the four-
period old asset for one period, as this is cheaper than salvaging the used asset and
purchasing a new asset for one period of use. The values of c02, c03, and c04 include
the revenue received for salvaging the four-period old asset at time zero. With the
values in Table 9.5, the dynamic programming recursion in Equation 12.13 can be
310 P. Scarf and J. Hartman

Table 12.5. Arc costs for Figure 12.4 using the example data
t \ t+n 1 2 3 4
0 -$1,989 $17,868 $33,051 $47,996
1 $27,679 $43,383 $58,566
2 $27,679 $43,383
3 $27,679

To illustrate a calculation, note that f(4) = 0, f(3) = $27,619, and consider t = 2.

Substituting into Equation 12.13:

⎧$43,383 + $0 ⎫
f (2) = min ⎨ ⎬ = $43,383,
⎩$27, 679 + 0.893($27, 679) ⎭

defining that it is cheaper to keep the asset for two periods (from the end of period
2 to the end of the horizon) rather than replacing it after one period of use. Con-
tinuing in this manner, it is found that f(0) = $47,996, signaling that the four-period
old asset should be sold and the new asset should be retained through the horizon.
This is the same solution found with Bellman’s model.
While this model can be shown to be more computationally efficient than the
age-based model, it is the ease with which multiple challengers (as parallel arcs) or
technological change is modelled that has led to numerous extensions in the
literature. See Oakford et al. (1984), Bean et al. (1985, 1994), and Hartman and
Rogers (2006).

12.5.3 Cumulative-usage Based Model

Recently, Hartman and Murphy (2006) offered a third dynamic programming

formulation for the equipment replacement problem following the form of the
classical knapsack model. The model determines the number of times an asset is
used for a given length of time over some horizon.
The dynamic program is described by the network in Figure 12.5. The y-axis
defines the periods, 1 through T, while the x-axis identifies the stage in which an
asset is to be retained for a given length of time, 1 through N, is evaluated. In the
figure, the order is ages 4, 3, 2, and then 1. Thus, in the first stage of the dynamic
program, the number of times to retain an asset for four consecutive periods is
analyzed. (For this small example with T = 4, the asset can only be retained for
four periods once.) In the second stage, it is evaluated whether an asset should be
retained for three periods. In the third stage, it is evaluated whether an asset should
be retained for two periods either once (for two periods of total service) or twice
(for four periods of total service).
Replacement of Capital Equipment 311


0 4 3 2 1
Figure 12.5. Dynamic programming network for a cumulative usage-based model

A node in the network represents the cumulative service that has been accrued
through a given stage. For example, after the first stage in Figure 12.5, either 0 or 4
periods of service have been reached. As the horizon is 4, a solution must ultimately
result in 4 periods of service. As with the other dynamic programming, models, the
goal is to find the minimum cost path from the initial node, representing no service at
time zero, to the final node, representing an entire horizon’s worth of service after the
final stage.
To determine an optimal solution it is assumed that the costs are stationary and
the stages (lengths of service) are ordered according to increasing annualized costs.
Thus, before the recursion can be solved, the annualized costs of keeping an asset
for each possible service life must be computed such that the stages can be ordered

Example 12.5
We revisit the previous examples again. From the given costs, the annual equiva-
lent costs are computed as given in Table 12.6. For example, to retain the asset for
two years costs $25,670 per year, equivalently, assuming a 12 percent interest rate.
The net present value (NPV) costs are also given. We restrict the set of decisions to
those of a new asset – namely how many to purchase and how long to retain them
over the finite horizon.

Table 12.6. Annual equivalent costs of keeping the asset for up to five years


0 $50,000
1 $10,000 $35,000 $31,000 $27,679
2 $12,000 $31,500 $25,670 $43,383
3 $14,400 $28,350 $24,384 $58,566
4 $17,280 $25,515 $24,202 $73,511
5 $20,736 $22,964 $24,540 $88,462

Given the information in Table 12.6, the stages are ordered according to ages 4,
3, 5, 2, and 1, as the annual equivalent costs increase accordingly. As an asset is
only required for four periods, the age 5 cost can be ignored.
312 P. Scarf and J. Hartman

According to Figure 12.5, an asset can be retained a maximum of one time for
four years, at a cost of $73,511. Thus, the states in the first stage and their values
f1 (0) = 0,
f1 (4) = $73,511.

Similar reasoning defines f2(0)=0, f2(3)=$58,566, and f2(4)=$73,511. For the third
stage, the decisions are more interesting because an asset can be retained for two
years twice in the sequence. Thus

f3 (0) = 0,
f3 (2) = $43,383,
f3 (3) = $58,566,
{ }
f3 (4) = min $73,511,$43,383 + (0.893) 2 $43,383 = $73,511.

The final stage evaluates using assets for a single period with previous com-
binations (three-period and two-period aged assets). It can be shown that the opti-
mal decision is to retain the asset for all four periods at a net present value cost of
$73,511. Note that this is the same decision found with the two previous formula-
tions, as $73,511 less the salvage value of the four-period old asset ($25,515) is
This recursion was not developed in order to provide another computational
approach to the equipment replacement problem. Rather, it was developed to illus-
trate the relationship between the infinite and finite horizon solutions under station-
ary costs. Specifically, as the optimal solution to the infinite horizon problem is to
repeatedly replace an asset at its economic life (age which minimizes equivalent
annualized costs), the question being investigated was whether the solution (re-
placing at the economic life) translates to the finite horizon case.
It was shown that using the infinite horizon solution provides a good answer
when O&M costs increase over the life of an asset more drastically than salvage
values decline. In the case when the salvage value declines are more drastic than
the O&M cost increases, it is generally better to retain the final asset in the se-
quence for a period longer than the economic life of the asset. For the cases when
O&M cost increases and salvage value declines are similar, then it is beneficial to
solve a dynamic programming recursion to find the optimal policy.

12.5.4 Infinite Horizon Considerations

The solution of a dynamic programming algorithm assumes that the horizon is

finite. In the case of an infinite horizon in which an asset is expected to remain in
service indefinitely, it may be possible to identify an optimal time zero decision.
Bean et al. (1985) show that if the time zero decision for an equipment replace-
ment problem does not change for N consecutive horizons, where N is the maxi-
mum age of an asset, then the decision is optimal for any length of horizon, includ-
Replacement of Capital Equipment 313

ing an infinite horizon. Unfortunately, this does not guarantee the existence of an
optimal time zero decision.
For the age or period based dynamic programming recursions, the models must
be solved over T, T+1, T+2, …, T+N horizons. If the time zero decision does not
change for these problems, then the optimal time-zero decision is found. If this is
not the case, the progression must continue until N consecutive time zero decisions
are identified. This may be more easily facilitated using a forward recursion. In the
period-based model, this requires defining f(t) as a function of f(t–1), f(t–2), etc.,
with f(0) = 0. We illustrate by revisiting Example 12.4.

Example 12.6
We illustrate the first few stages of the forward recursion, as its implementation is
better suited for infinite horizon analysis. As noted earlier, the recursion is
initialized with f(0) = 0. Stepping forward in time, it is assumed that T = 1. Using
the values from Table 12.2, it is clear that the only feasible decision is to retain the
four-period old asset for one period such that f(1) = -$1,989. For the second stage,
there are two feasible decisions to evaluate, such that

⎧0.893($27, 679) − $1,989 ⎫

f (2) = min ⎨ ⎬ = $17,868.
⎩$17,868 + $0 ⎭

The first decision evaluates using the new asset for one period, assuming (from
stage 1) that the four-period old asset is retained for one period. The second
decision assumes the four-period old asset is retired immediately and a new asset is
used for two periods. This process moves forward in time, increasing the value of T
in each step. The process stops when, in this case, five consecutive solutions (with
increasing T) result in the same time zero decision.

12.5.5 Modeling Complex Systems

The presented dynamic programming algorithms are designed for single asset
systems. More complex systems are obviously defined by multiple assets which are
not independent, otherwise the presented models would be sufficient. The most
straightforward case in where all assets operate in parallel, such as a fleet.
Jones et al. (1991) offered the first dynamic programming recursion for the
parallel machine replacement model, which can be used to analyze fleet replace-
ment decisions. Machines are assumed to operate in parallel and thus the capacity
of the system is equal to the sum of the individual asset capacities. In addition to
defining the capacity of the system, the assets are often linked economically. Jones
et al. focused on the assumption that a fixed cost would be charged in any period in
which a replacement occurs (in addition to the typical per unit charges for each
asset replaced). This provides an incentive to replace multiple assets together over
time so as to reduce the number of times the fixed charge in incurred over some
To model replacement decisions for this system, the state of the system is
defined as the number of assets aged 1 through N, represented as a vector, [m1, m2,
314 P. Scarf and J. Hartman

…, mN]. In general, this would seem to be an intractable model, as the number of

feasible combinations of replacements to evaluate in each period is exponential –
as one could replace any combination of the m1+m2+…+mN assets. However, Jones
et al. illustrated two key theorems that drastically reduce the computational
First, it was shown that clusters, or assets of the same age in the same time
period, do not split. That is, a group of same aged assets are kept or replaced in
their entirety at the end of each period. Second, under some mild cost assumptions,
they showed that older clusters of assets are replaced before younger clusters. With
these two theorems, the number of possible replacements in a given period is
drastically reduced. In each period, either the oldest cluster is replaced, or the two
oldest clusters are replaced, etc., for a given state of the system.
Consider the network in Figure 12.6. At time zero, a system is defined by six
assets; three of age one, two of age two, and one of age three, and the maximum
feasible age of an asset is three. Replacement decisions, assuming the no-splitting
rule and oldest cluster replacement rule, are illustrated for three periods. Note that
the maximum number of decisions for a given state is N = 3.
Determining the optimal sequence of replacement decisions for the system is
similar to our previous dynamic programming recursions. A value is assigned to
f3(S), where S refers to the state vector. This would merely be the sum of the
salvage values for each asset owned at time T = 3. Then, the value of each state S at
time 2 would be determined by summing the costs of the decision and discounting
the value of the resulting state. For example, moving from state [3,0,3] to state
[3,3,0] would entail selling the three three-period old assets and purchasing three
new assets. The new assets would be utilized for one period (incurring O&M costs)
while the three one-period old assets (at the end of time period two) would be
utilized for a second year, also incurring O&M costs. These costs (discounted
accordingly) would be added to the discounted value of f3(3,3,0). This value would
be compared to the decision of replacing all six assets (leading to state [6,0,0]) to
determine the value of state f2(3,0,3).

[2,1,3] [4,2,0]
[1,3,2] [5,1,0] [0,5,1]
[3,2,1] [3,3,0] [0,3,3] [1,5,0]
[6,0,0] [3,0,3] [3,0,3]
[0,6,0] [3,3,0]
[6,0,0] [0,0,6]
0 1 2 T=3

Figure 12.6. Dynamic programming network for parallel-machine replacement problem

Replacement of Capital Equipment 315

Define n as the decision of what minimum aged assets are to be replaced for a
given state at time t. That is, all assets of age n and older are replaced while the
remaining assets are retained. We can model the recursion in general as follows:

⎧⎪ ⎛ N ⎞ N
f t (m1, m2 ,..., mN ) = min n ⎨ K t ⋅ 1n >1 + ⎜ ∑ m j ⎟ Pt − ∑ m j St ( j ) +
⎪⎩ ⎜ ⎟ (12.15)
⎝ j =n ⎠ j =n

⎛⎛ N ⎞ n −1 ⎞⎫⎪
α ⎜ ⎜ ∑ m j ⎟Ct +1 (1) + ∑ m j Ct +1 ( j + 1) + f t +1 (mm + mm +1 + ... + mN , m1, m2 ,..., mn −1,0,0,...0) ⎟⎬
⎜ ⎜ j =n ⎟ j =1 ⎟⎪
⎝⎝ ⎠ ⎠⎭

Examining the recursion, a purchase price is paid and a salvage value is received
for all assets that are replaced. All of the newly purchased assets (the total number
of assets is the sum of mn+mn+1+…+ mN) incur the O&M cost of a new asset while
the O&M costs of the retained assets are incurred according to their age. A fixed
charge Kt is paid if at least one group of assets is replaced (n>1), captured by the
indicator function. The resulting state is a group of new assets (age 1) with all other
assets incrementing one period in age.
A number of extensions to this model have been published in the literature,
although many utilize integer programming modeling techniques to deal with the
large-state space. Chand et al. (2000) focus on the use of dynamic programming and
include capacity expansion decisions with the replacement decisions. Unfortunately,
capital budgeting constraints greatly complicate the problem as it cannot be assumed
that groups of assets must be kept or replaced together.
While the theorems presented in Jones et al. (1991) greatly reduce the compu-
tational difficulties of solving the dynamic program for the parallel replacement
problem, it should be clear that using dynamic programming to address replacement
decisions for more complex systems may be difficult due to computational com-
plexities that arise due to the number of combinations of replacement alternatives.
(See Hartman and Ban (2002) and the references therein for a discussion of these
Consider a more complex system in which a number of machines are used in
series (such as a production line) and there are a number of lines in parallel, such
as the one given in Figure 12.7. The lines are labeled 1, 2 and 3 while the machines
are labeled a, b, c, and d.

a b c d
Figure 12.7. System with assets in series and lines in parallel
316 P. Scarf and J. Hartman

The capacity of a line is now defined by the machine in the line with the mini-
mum capacity. However, the capacity of the system is raised due to the parallel
design. The capacity of the system is defined by the sum of the capacity of each
line. Therefore, it is defined by the sum of the minimum capacity asset in each line.
Reliability is measured similarly to a capacity, in that it is reduced by the series
structure but increased with parallel (redundant) structure. For a given series, the
reliability of the line is equal to the product of the reliability of each individual
asset. That is because if one asset is down, the line is down. The reliability of the
system, assuming only one line must be up and running, is increased as the system
is operating even if three lines are down.
If one defines minimum system capacity or reliability constraints, these can be
incorporated into a dynamic programming recursion that evaluates the possibility of
replacing any combination of assets in each period over some horizon. Presumably,
newer assets would have higher capacity or reliability, either due to technological
change or due to the fact that they are new (and have not deteriorated), and thus
would increase the respective capacity or reliability of the system (in order to meet
the defined constraints).
The difficulty with using a dynamic programming recursion to evaluate these
decisions is not in capturing the capacity or reliability constraints. Rather, the
difficulty is in the exponential growth in the number of possible combinations of
replacements in each period. Consider the 12 assets shown in Figure 12.7. In the
most general problem, each asset and each combination of assets can be replaced in
each period, totaling 212 combinations each period for each state of the system.
This system could easily become more complicated, merely by defining a, b, c, and
d as processes, each of which may have a number of assets in parallel (or in series).
In the parallel machine replacement problem, a similar problem was encoun-
tered, but the number of possible decisions was reduced to N (the maximum allow-
able age for an asset) for each possible state in each period with the two theorems
introduced by Jones et al., without sacrificing optimality. Unfortunately, the inter-
action of the assets may prohibit the application of these theorems to other systems.
In fact, defining the state of the system is not entirely clear.
For the system described in Figure 12.7, we could define the system as a matrix
of asset ages. Each row would be defined by the age of each machine in a given
line, with a row defined for each line. If an asset is replaced, then the age would
translate to 1 in the next stage while it would merely increment 1 period if the
machine is retained. This modeling approach could be expanded to the case of
multiple machines in a given process – by expanding the size of the matrix.
Again, the difficulty would be in restricting the number of decisions to evaluate
for each state in a given period. Following the approach of Jones et al. (1991),
older assets would be replaced first (and even further restricted to have to be above
a certain age for consideration) and similarly aged assets of the same type would be
replaced in the same time period. Another approach would be to only consider
replacing assets that increase the system capacity or reliability. Thus, replacements
could be examined in the order of either increasing capacity or increasing reli-
ability. Whether these heuristic approaches provide a good solution for a given
problem instance would require extensive numerical testing.
Replacement of Capital Equipment 317

12.6 Discussion and Further Topics for Research

In this chapter we have reviewed both economic life and dynamic programming
models to address capital replacement problems which can arise in various settings,
including manufacturing, transportation, and utility industries. It should be clear that
the trend in developing solutions to these problems has migrated from single asset to
those of complex sytems. While good solutions exist for systems with homogeneous
assets in parallel, computational difficulties exist for those with inhomogeneous
assets both in series and parallel and opportunities exist to develop optimal solution
methods with advanced computational techniques or good solution rules developed
from simpler, tractable models.
In addition to the investigation of systems with multiple assets, further savings
can be achieved by considering operational and replacement decisions simul-
taneously (Hartman 1999, 2004). It should be clear that the usage of an asset over
time impacts its replacement schedule. In the context of multiple assets, it may be
possible to allocate usage to assets over time, thereby influencing replacement
schedules. Thus, in order to minimize total system costs over time, both replace-
ment and operating decisions should be considered simultaneously. There are
numerous application areas where this analysis is warranted, including transpor-
tation networks, water distribution networks, and production systems.
A final area of future research must center on technological change. While
assets are often replaced due to deterioration, newer assets are often purchased
because they are technologically advanced—providing similar capabilities at lower
cost or additional capabilities for additional revenue. Numerous studies focus on
the continuous evolution of technological change, however, more detailed research
must focus on appropriate models for various applications, as it is clear that
technological advances in different ways for different industrial sectors.

12.7 References
Apeland, S. and Scarf, P.A. (2003) A fully subjective approach to capital equipment
replacement. Journal of the Operational Research Society 54, 371–378.
Arnold, G. (2006) Essentials of Corporate Financial Management. Pearson, London.
Baker, R.D. and Scarf, P.A. (1995) Can models to small data samples lead to maintenance
policies with near-optimal cost? IMA Journal of Mathematics Applied in Business and
Industry 6, 3–12.
Bean, J.C., Lohmann, J.R. and Smith, R.L. (1985) A dynamic infinite horizon replacement
economy decision model. The Engineering Economist 30, 99–120.
Bean, J.C., Lohmann, J.R. and Smith, R.L. (1994) Equipment replacement under
technological change, Naval Research Logistics, 41, 117–128.
Bellman, R.E. (1955) Equipment replacement policy. Journal of the Society for the
Industrial Applications of Mathematics 3, 133–136.
Booth, P. and King, P. (1998) The relationship between finance and actuarial science. In
Hand, D.J., Jacka, S.D. (Eds), Statistics in Finance, Arnold, London, pp.7–40.
Bowe, M. and Lee, D.L. (2004), Project evaluation in the presence of multiple embedded
real options: evidence from the Taiwan High-Speed Rail Project, Journal of Asian
Economics 15, 71–98.
318 P. Scarf and J. Hartman

Brint, A.T., Hodgkins, W.R., Rigler, D.M and Smith, S.A. (1998) Evaluating strategies for
reliable distribution. IEEE Comput.Applns. in Power 11, 43–47.
Chand, S., McClurg, T. and J. Ward (2000) A model for parallel machine replacement with
capacity expansion. European Journal of Operational Research, 121. 519–531.
Christer, A.H. (1984) Operational research applied to industrial maintenance and
replacement. In Eglese, R.W. and Rand, G.K. (Eds) Developments in Operational
Research (pp.31–58). Pergamon Press, Oxford.
Christer, A.H. (1988) Determining economic replacement ages of equipment incorporating
technological developments. In Rand, G.K. (Eds) Operational Research ’87 (pp.343–
354). Elsevier, Amsterdam.
Christer, A.H. and Scarf, P.A. (1994) A robust replacement model with applications to
medical equipment. J.Opl.Res.Soc. 45:261–275.
Derman, C. (1963) Inspection-maintenance-replacement schedules under markovian
deterioration. In Mathematical Optimization Techniques, University of California Press,
Berkely, CA, pp. 201–210.
Dixit, A.K. and Pindyck R.S. (1994) Investment Under Uncertainty Princeton University
Press, New Jersey.
Eilon, S., King, J.R. and Hutchinson, D.E. (1966). A study in equipment replacement.
Opl.Res.Quart. 17:59–71.
Elton, D.J. and Gruber, M.J. (1976) On the optimality of an equal life policy for equipment
subject to technological change. Opl.Res.Quart. 22:93–99.
Goldstein, M. and O’Hagan, A. (1996) Bayes linear sufficiency and systems of expert
posterior assessments. Journal of the Royal Statistical Society Series B 58, 301–316.
Hartman, J.C. (1999) A General Procedure for Incorporating Asset Utilization Decisions
into Replacement Analysis. Eng. Econ., 44(3):217–238.
Hartman, J.C. (2001) An Economic Replacement Model with Probabilistic Asset Utilization.
IIE Transactions, 33, 717–729.
Hartman, J.C. (2004) Multiple asset replacement analysis under variable utilization and
stochastic demand. European Journal of Operational Research 59, 145–165.
Hartman, J.C. and J. Ban (2002) The series-parallel replacement problem. Robotics and
Computer Integrated Manufacturing, 18, 215–221.
Hartman, J.C. and R.V. Hartman (2001) After-Tax Replacement Analysis. The Engineering
Economist, 46, 181–204.
Hartman, J.C. and Murphy, A. (2006) Finite Horizon Equipment Replacement Analysis. IIE
Transactions 38, 409–419.
Hartman, J.C. and Rogers, J.L. (2006) Dynamic Programming Approaches for Equipment
Replacement Problems with Continuous and Discontinuous Technological Change. IMA
Journal of Management Mathematics, 17, 143–158.
Hopp, W.J. and Nair, S.K. (1991) Timing replacement decisions under discontinuous
technological change. Naval Research Logistics 38, 203–220.
Hopp, W.J. and Nair, S.K. (1994) Markovian deterioration and technological change. IIE
Transactions, 26, 74–82.
Jones, P.C., Zydiak, J.L. and Hopp, W.J. (1991) Parallel machine replacement. Naval
Research Logistics, 38, 351–365.
Karabakal, N., Lohmann, J.R. and Bean, J.C. (1994) Parallel replacement under capital
rationing constraints. Management Science 40, 305–319.
Kobbacy, K. and Nicol, D. (1994) Sensitivity of rent replacement models. Int.J.Prod.Econ.
36, 267–279.
Markovitz, H.M. (1952) Portfolio selection. Journal of Finance 7, 77–91.
Northcott, D. (1985) Capital Investment Decision Making. Dryden Press, London.
Oakford, R.V., Lohmann, J.R. and Salazar, A. (1984) A dynamic replacement economy
decision model. IIE Transactions, 16, 65–72.
Replacement of Capital Equipment 319

O’Hagan, A. (1994) Robust modelling for asset management. Journal of Statistical

Planning and Inference 40, 245–259.
Regnier, E., Sharp, G., and Tovey, C. (2004) Replacement under ongoing technological
progress. IIE Transactions, 36, 497–508.
Russell, J.C. (1982) Vehicle replacemeny: a case study in adapting a standard approach for a
large organisation. Journal of the Operational Research Society 33, 899–911.
Santhanam, R. and Kyparisis, G.J. (1996) A decision model for interdependent information
system project selection. European Journal of Operational Research 89, 380–399.
Scarf, P.A. (1994) Optimal buying, running and selling policy for the private motorist: an
application of capital replacement modelling, IMA Bulletin 30, 181–186.
Scarf, P.A. and Christer, A.H. (1997) Applications of capital replacement models with finite
planning horizions, International Journal of Technology Management 13, 25–36.
Scarf, P.A. and Hashem, M. (1997) On the application of an economic life model with a
fixed planning horizon, International Transactions in Operations Research 4, 139–150.
Scarf, P.A. and Hashem, H. (2003) Characterization of optimal policies for capital
replacement models. IMA Journal of Management Mathematics 13, 261–271.
Scarf, P.A. and Martin, H. (2001) A framework for maintenance and replacement of a
network structured system. Int. J. Prod.Econ. 69, 287–296.
Scarf, P.A., Dwight, R., McCusker, A. and Chan, A. (2006) Asset replacement for an urban
railway using a modified two-cycle replacement model. Journal of the Operational
Research Society (doi: 10.1057/palgrave.jors.2602288).
Suzukia, Y. and Pautschb, G.R. (2005) A vehicle replacement policy for motor carriers in an
unsteady economy. Transportation Research Part A: Policy and Practice 39, 463–480.
Wagner, H.M. (1975) Principles of Operations Research. Prentice-Hall Inc., Englewood
Cliffs, NJ.
Wang, H. (2002) A survey of maintenance policies of deteriorating systems. European
Journal of Operational Research 139, 469–489.

Maintenance and Production: A Review

of Planning Models

Gabriella Budai, Rommert Dekker and Robin P. Nicolai

13.1 Introduction
Maintenance is the set of activities carried out to keep a system into a condition
where it can perform its function. Quite often these systems are production systems
where the outputs are products and/or services. Some maintenance can be done
during production and some can be done during regular production stops in
evenings, weekends and on holidays. However, in many cases production units
need to be shut down for maintenance. This may lead to tension between the
production and maintenance department of a company. On one hand the production
department needs maintenance for the long-term well-being of its equipment, on
the other hand it leads to shutting down the operations and loss of production. It
will be clear that both can benefit from decision support based on mathematical
In this chapter we give an overview of mathematical models that consider the
relation between maintenance and production. The relation exists in several ways.
First of all, when planning maintenance one needs to take production into account.
Second, maintenance can also be seen as a production process which needs to be
planned and finally one can develop integrated models for maintenance and pro-
duction. Apart from giving a general overview of models we will also discuss some
sectors in which the interactions between maintenance and production have been
Many review articles have been written on maintenance, e.g. Cho and Parlar
(1991), but to our knowledge only one on the combination between maintenance
and production, Ben-Daya and Rahim (2001). This review differs from that in
several aspects. First of all, we also consider models which take production restric-
tions into account, rather than integrated models. Second we discuss some specific
sectors. Finally, we discuss the more recent articles since that review.
Maintenance is related to production in several ways. First of all, maintenance is
intended to allow production, yet to execute maintenance production often has to be
stopped. This negative effect has therefore to be considered in maintenance plan-
322 G. Budai, R. Dekker and R. Nicolai

ning and optimization. It comes specifically forward in the costing of downtime and
in opportunity maintenance. All articles taking the effect of production on mainten-
ance explicitly into account fall into this category.
Second, maintenance can also be seen as a production process which needs to be
planned. Planning in this respect implies determining appropriate levels of capacity
(e.g. manpower) concerning the demand.
Third, we are concerned with production planning in which one needs to take
maintenance jobs into account. The point is that the maintenance jobs take pro-
duction capacity away and hence they need to be planned together with production.
Maintenance has to be done either because of a failure or because the quality of the
produced items is not high enough. In this third category we also consider the
integrated planning of production and maintenance.
The relation between maintenance and production is also determined by the
business sector. We consider the following sectors: railways, road, airlines and
electrical power system maintenance.
The outline of the rest of this chapter is now as follows. In Section 13.2 we
present an overview of the main elements of maintenance planning as these are
essential to understand the rest of this chapter. Following our classification scheme,
in Section 13.3 we review articles in which maintenance is modelled explicitly and
where the needs of production are taken into account. Since these needs differ
between business sectors, we discuss in Section 13.4 the relation between pro-
duction and maintenance for some specific business sectors. In Section 13.5 we
consider the second category in our classification scheme: maintenance as a pro-
duction process which needs to be planned. In Section 13.6 we are concerned with
production planning in which one needs to take maintenance jobs into account
(integrated production and maintenance planning). Trends and open research areas
are discussed in Section 13.7 and, finally, conclusions are drawn in Section 13.8.

13.2 Maintenance Planning and Optimization: An Overview

In maintenance several important decisions have to be made. We distinguish between
(i) the long term strategic and maintenance concept, (ii) medium term planning, (iii)
short term scheduling and finally (iv) control and performance indicators.
Major strategic decisions concerning maintenance are made in the design
process of systems. What type of maintenance is appropriate and when should it be
done? This is laid down in the so-called maintenance concept. Many optimization
models address this problem and the relation with production is implicit in some of
Another important strategic problem is the organization of the maintenance
department. Is maintenance done by production personnel (in the way total produc-
tive maintenance prescribes) or is there specific maintenance personnel? Second,
questions such as “Where is it located?”, “Are specific types of work outsourced?”,
etc. should be answered. Although they are important topics, they are more the
concern of industrial organization than the topic of mathematical models.
Further important strategic issues concern how a system can be maintained,
whether specific expertise or equipment are needed, whether one can easily reach
Maintenance and Production: A Review 323

the subsystems, what information is available and what elements can be easily
replaced. These are typical maintainability aspects, but they have little to do with
In the tactical phase, usually between a month and year, one plans for the major
maintenance/upgrade of major units and this has to be done in cooperation with the
production department. Accordingly, specific decision support is needed in this
respect. Another tactical problem concerns the capacity of the maintenance crew.
Is there enough manpower to carry out the preventive maintenance program?
These questions can be addressed by use of models as will be indicated later on.
In the short term scheduling phase one determines the moment and order of
execution, given an amount of outstanding corrective or preventive work. This is
typically the domain of work scheduling where extensive model-based support can
be given.
We will next consider another important aspect in maintenance, which is the
type of maintenance. A typical distinction is made between corrective and preven-
tive maintenance work. The first is carried out after a failure, which is defined as
the event by which a system stops functioning in a prescribed way. Preventive
work however, is carried out to prevent failures. Although this distinction is often
made, we like to remark that the difference is not that clear as it may seem. This is
due to the definition of failure. An item may be in a bad state, while still func-
tioning and one may or may not consider this as a failure. Anyhow, an important
distinction between the two is that corrective maintenance is usually not plannable,
but preventive maintenance typically is.
The execution of maintenance can also be triggered by condition measurements
and then we speak of condition-based maintenance. This has often been advocated
as more effective and efficient than time-based preventive maintenance. Yet it is
very hard to predict failures well in advance, and hence condition-based mainten-
ance is often unplannable. Instead of time based maintenance one can also base the
preventive maintenance on utilisation (run hours, mileage) as being more appropri-
ate indicators of wear out.
Finally one may also have inspections which can be done by sight or instruments
and often do not affect operation. They do not improve the state of a system how-
ever, but only the information about it. This can be important in case machines start
producing items of a bad quality. There are inspection-quality problems where in-
spection optimization is connected to quality control.
Another distinction is about the amount of work. Often there are small works,
grouped into maintenance packages. They may start with inspection, cleaning and
next some improvement actions like lubricating and or replacing some parts. These
are typically part of the preventive maintenance program attached to a system and
have to be done on a repetitive basis (monthly, quarterly, yearly or two-yearly).
Next, one has replacements of parts or subsystems and overhauls or refurbishments
where a substantial system is improved. The latter are planned well in advance and
carried out as projects with individual (or separate) budgets.
A traditional optimization problem has been the choice and trade-off between
preventive and corrective maintenance. The typical motivation is that preventive
maintenance is cheaper than corrective. Maintenance costs are usually due to man-
hours, materials and indirect costs. The difference between corrective and preven-
324 G. Budai, R. Dekker and R. Nicolai

tive maintenance costs is especially in the latter category. They represent loss of
production and environmental damage or safety consequences. Costing these
consequences can be a difficult problem and is tackled in Section 13.3.1. It will
also be clear that preventive maintenance should be done when production is least
effected. This can be done using opportunities, which has given rise to a specific
class of models dealt in a separate section (Section 13.3.2).

13.3 When to do Maintenance in Relation to Production

In this section we discuss articles in which maintenance (planning or scheduling) is
modelled explicitly and the needs of production are taken into account. The latter,
however, is not usually modelled as such, but it is taken into account in the form of
constraints or requirements. Alternatively the effect of maintenance on varying
production scenarios may be considered. Following this reasoning we arrive at
three streams of research. A first stream assesses the costs of downtime, which are
important in the planning of maintenance. The second stream deals with studies
where one tries to schedule maintenance work at those moments that units are not
needed for production (opportunities) and in the last stream articles are considered
which schedule maintenance in line with production. Each stream is dealt in a
separate section.

13.3.1 Costing of Downtime

Assessing the costs of downtime is an important step in the determination of costs

of preventive and corrective maintenance. Although exact values are not necessary
as most optimization results show, it is important to assess these values with a
reasonable accuracy. It is easier to determine downtime costs in case of preventive
maintenance than in case of corrective maintenance as failures may have many
unforeseen consequences. Yet even in case of preventive maintenance the assess-
ment can be difficult, e.g. in case of highway shutdowns or railway stoppage.
Another problem to be tackled is the system-unit relation. A system can be a
complex configuration of different units, which may imply that downtime of one
unit does not necessarily halt the full system. Accordingly, an assessment of the
consequences of unit downtime on system performance has to be made. This is
especially a problem in case of k-out-of-n systems or even in more general con-
Several articles deal with this issue. Some give an overall model, others de-
scribe a detailed case. Geraerds (1985) gives an outline of a general structuring to
determine downtime costs. In Dekker and Van Rijn (1996) a downtime model is
described for k-out-of-n systems used on the oil production platforms. Edwards et
al. (2002) give a detailed model for the costs of equipment downtime in open-pit
mining. They use regression models based on historical data.
Knights et al. (2005) present a model to assist maintenance managers in
evaluating the economic benefits of maintenance improvement projects.
Maintenance and Production: A Review 325

13.3.2 Opportunity Maintenance

Opportunity maintenance is the maintenance that is carried out at an opportune

moment, i.e. moments at which the units to be maintained are less needed for their
function than normally. We speak of opportunities if these events occur occasional-
ly and if they are difficult to predict in advance. There can be several reasons for a
maintenance opportunity:
• Failure and hence repairs of other units/components. The failure of one com-
ponent is often an opportunity to preventively maintain other components.
Especially if the failure causes the breakdown of the production system it is
favourable to perform preventive maintenance on other components. After
all, little or no production is lost above that resulting from the original failure.
An example is given in Van der Duyn Schouten et al. (1998) who consider
the replacement of traffic lights at an intersection.
• Other interruptions of production. Production processes are not only inter-
rupted by failures or repairs. Several outside events may create an oppor-
tunity as well. This can be market interruptions, or other work for which
production needs to be stopped (e.g. replacing catalysts etc.) and this is an
opportunity to combine preventive maintenance.
According to the foregoing discussion there are two approaches to opportuni-
ties. The first models a whole multi-component system in which upon a failure
preventive maintenance can be carried out on other components as well. In the
latter stream the opportunities are modelled as an outside event at which one may
do maintenance. In the simplest form one considers one component, with mainten-
ance which may be done at opportunities, or also with a forced shutdown.
Bäckert and Rippin (1985) consider the first type of opportunistic maintenance
for plants subject to breakdowns. In this article three methods are proposed to solve
the problem. In the first two cases the problem is formulated as a stochastic
decision tree and solved using a modified branch and bound procedure. In the third
case the problem is formulated as a Markov decision process. The planning period
is discretised, resulting in a finite state space to which a dynamic programming
procedure can be applied.
In Wijnmalen and Hontelez (1997) a multi-component system is considered
where failures of one component may create an opportunity, but the opportunity
process is approximated by an independent process with the same mean rate. In
this way they circumvent the problem of dimensionality which appears in the study
of Bäckert and Rippin (1985).
There are several articles considering the other stream. Tan and Kramer (1997)
propose a general framework for preventive maintenance optimization in chemical
process operations. The authors combine Monte Carlo simulation with a genetic
algorithm. Opportunities are the failure of other components.
In Dekker and Dijkstra (1992) and Dekker and Smeitink (1991) it is assumed
that the opportunity-generating process is completely independent of the failure
process and is modelled as a renewal process. Dekker and Smeitink (1994)
consider multi-component maintenance at opportunities of restricted duration and
determine priorities of what preventive maintenance to do at an opportunity.
326 G. Budai, R. Dekker and R. Nicolai

In Dekker and Van Rijn (1996) a decision-support system (PROMPT) for op-
portunity-based preventive maintenance is discussed. PROMPT was developed to
take care of the random occurrence of opportunities of restricted duration. Here,
opportunities are not only failures of other components, but also preventive main-
tenance on (essential) components. Many of the techniques developed in the
articles of Dekker and Smeitink (1991), Dekker and Dijkstra (1992) and Dekker
and Smeitink (1994) are implemented in the decision-support system. In PROMPT
preventive maintenance is split up into packages. For each package an optimum
policy is determined, which indicates when it should be carried out at an opportu-
nity. From the separate policies a priority measure is determined with which main-
tenance package should be executed at a given opportunity.
In Dekker et al. (1998b) the maintenance of light-standards is studied. A light-
standard consists of n independent and identical lamps screwed on a lamp assembly.
To guarantee a minimum luminance, the lamps are replaced if the number of failed
lamps reaches a prespecified number m. In order to replace the lamps the assembly
has to be lowered. As a consequence, each failure is an opportunity to combine
corrective and preventive maintenance. Several opportunistic age-based variants of
the m-failure group replacement policy (in its original form only corrective main-
tenance is grouped) are considered. Simulation optimization is used to determine the
optimal opportunistic age threshold.
Dagpunar (1996) introduces a maintenance model where replacement of a com-
ponent within a system is possible when some other part of the system fails, at a
cost of c2. The opportunity process is Poisson. A component is replaced at an
opportunity if its age exceeds a specified control limit t. Upon failure a component
is replaced at cost c4 if its age exceeds a specified control limit x, otherwise it is
minimally repaired at cost c1. In case of a minimal repair the age and failure rate of
the component after the repair is as it was immediately before failure. There is also
a possibility of a preventive or “interrupt” replacement at cost c3 if the component
is still functioning at a specified age T. A procedure to optimise the control limits t
and T is given in Dekker and Plasmeijer (2001).

13.3.3 Maintenance Scheduling in Line with Production

Here we consider models where the effect of production on maintenance is expli-

citly taken into account. These models only address maintenance decisions, but
they do not give advice on how to plan production.
The models developed in the articles in this category show that a good
maintenance plan, one that is integrated with the production plan, can result in
considerable cost savings. This integration with production is crucial because
production and maintenance have a direct relation. Any breakdown in machine
operation results in disruption of production and leads to additional costs due to
downtime, loss of production, decrease in productivity and quality, and inefficient
use of personnel, equipment and facilities. Below we review articles following this
stream of research in chronological order.
Dedopoulos and Shah (1995) consider the problem of determining the optimal
preventive maintenance policy parameters for individual items of equipment in
multipurpose plants. In order to formulate maintenance policies, the benefits of
Maintenance and Production: A Review 327

maintenance, in the form of reduced failure rates, must be weighed against the
costs. The approach in this study first attempts to estimate the effect of the failure
rate of a piece of equipment on the overall performance/profitability of the plant.
An integrated production and maintenance planning problem is also solved to
determine the effects of PM on production. Finally, the results of these two
procedures are then utilized in a final optimization problem that uses the relation-
ship between profitability and failure rate as well as the costs of different main-
tenance policies to select the appropriate maintenance policy.
Vatn et al. (1996) present an approach for identifying the optimal maintenance
schedule for the components of a production system. Safety, health and environ-
ment objectives, maintenance costs and costs of lost production are all taken into
consideration, and maintenance is thus optimized with respect to multiple ob-
jectives. The approach is flexible as it can be carried out at various levels of detail,
e.g. adapted to available resources and to the management’s willingness to give
detailed priorities with respect to objectives on safety vs. production loss.
Frost and Dechter (1998) define the scheduling of preventive maintenance of
power generating units within a power plant as constraint satisfaction problems.
The general purpose of determining a maintenance schedule is to determine the du-
ration and sequence of outages of power generating units over a given time period,
while minimizing operating and maintenance costs over the planning period.
Vaurio (1999) develops unavailability and cost rate functions for components
whose failures can occur randomly. Failures can only be detected through periodic
testing or inspections. If a failure occurs between consecutive inspections, the unit
remains failed until the next inspection. Components are renewed by preventive
maintenance periodically, or by repair or replacement after a failure, whichever
occurs first (age-replacement). The model takes into account finite repair and
maintenance durations as well as costs due to testing, repair, maintenance and lost
production or accidents. For normally operating units the time-related penalty is
loss of production. For standby safety equipment it is the expected cost of an
accident that can happen when the component is down due to a dormant failure,
repair or maintenance. The objective is to minimize the total cost rate with respect
to the inspection and the replacement interval. General conditions and techniques
are developed for solving optimal test and maintenance intervals, with and without
constraints on the production loss or accident rate. Insights are gained into how the
optimal intervals depend on various cost parameters and reliability characteristics.
Van Dijkhuizen (2000) studies the problem of clustering preventive main-
tenance jobs in a multiple set-up multi-component production system. This article
has been reviewed in Chapter 11, which gives an overview of multi-component
maintenance models.
Cassady et al. (2001) introduce the concept of selective maintenance. Often
production systems are required to perform a sequence of operations with finite
breaks between each operation. The authors establish a mathematical programming
framework for assisting decision-makers in determining the optimal subset of main-
tenance activities to perform prior to beginning the next operation. This decision
making process is referred to as selective maintenance.
The article of Haghani and Shafahi (2002) deals with the problem of scheduling
bus maintenance activities. A mathematical programming approach to the problem
328 G. Budai, R. Dekker and R. Nicolai

is proposed. This approach takes as input a given daily operating schedule for all
buses assigned to a depot along with available maintenance resources. Then a daily
inspection and maintenance schedule is designed for the buses that require
inspection so as to minimize the interruptions in the daily bus-operating schedule,
and maximize the reliability of the system and efficiently utilize the maintenance
Charles et al. (2003) examine the interaction effects of maintenance policies on
batch plant scheduling in a semiconductor wafer fabrication facility. The purpose
of the work is the improvement of the quality of maintenance department activities
by the implementation of optimized preventive maintenance (PM) strategies and
comes within the scope of total productivity maintenance (TPM) strategy. The
production of semiconductor devices is carried out in a wafer lab. In this produc-
tion environment equipment breakdown or procedure drifting usually induces un-
scheduled production interruptions.
Cheung et al. (2004) consider a plant with several units of different types.
There are several shutdown periods for maintenance. The problem is to allocate
units to these periods in such a way that production is least effected. Maintenance
is not modelled in detail, but incorporated through frequency or period restrictions.

13.4 Specific Business Sectors

The purpose here is to illustrate the interdependence between maintenance and
production for some specific sectors in more detail. Moreover, it shows what ideas
were employed in which sector and the difference between them. Although many
sectors could be distinguished we take those where maintenance plays an important
role. Not surprisingly, these are all capital intensive sectors with high maintenance
expenditure and we discuss railway, road, airline and electric power system main-

13.4.1 Railway Maintenance

Since rail is an important transportation mode, proper maintenance of the existing

lines, repairs and replacements carried out in time are all important to ensure
efficient operation. Moreover, since some failures might have a strong impact on
the safety of the passengers, it is important to prevent these failures by carrying out
in time, and according to some predefined schedules, preventive maintenance
works. The preventive maintenance works are the small routine works and/or
projects. The routine (spot) maintenance activities, that consist of inspections and
small repairs (see Esveld 2001), do not take much time to be performed and are
done regularly, with frequencies varying between monthly and once a year. The
projects include renewal works and they are carried out once or twice every few
In the literature there are a couple of articles that provide useful methods for
finding optimal track possession intervals for carrying out preventive maintenance
works, i.e. time periods when a track is required for maintenance, therefore it will
be blocked for the operation. In production planning terms track possession means
Maintenance and Production: A Review 329

downtime required for maintenance. The main question is when to carry out
maintenance such that the inconvenience for the train operators, the disruption to
and from the scheduled trains, the infrastructure possession time for maintenance
are minimized and the maintenance cost is the lowest possible. For a more detailed
overview of techniques used in planning railway infrastructure maintenance we
refer to Dekker and Budai (2002) and Improverail (2002). In some articles (see,
e.g. Higgins 1998, Cheung et al. 1999 and Budai et al. 2006) the track possession
is modelled in between operations. This can be done for occasionally used tracks,
which is the case in Australia and some European countries. If tracks are used
frequently, one has to perform maintenance during nights, when the train traffic is
almost absent or during weekends (with possible interruption of the train services),
when there are less disturbances for the passengers. In the first case one can either
make a cyclic static schedule, which is done by Den Hertog et al. (2005) and Van
Zante-de Fokkert (2001) for the Dutch situation, or a dynamic schedule with a
rolling horizon, which is done in Cheung et al. (1999). The latter schedule has to
be made regularly.
Some other articles deal with grouping railway maintenance activities to reduce
costs, downtime and inconvenience for the travellers and operators. Here we
mention the study of Budai et al. (2006) in which the preventive maintenance
scheduling problem is introduced. This problem arises in other public/private
sectors as well, since preventive maintenance of other technical systems (machine,
road, airplanes, etc.) also contains small routine works and large projects.

13.4.2 Road Maintenance

Road maintenance has many common characteristics with railway maintenance.

Failures are often indirect, in the sense that norms are surpassed, but there may not
be any consequences. The production function is indirect, but that does not mean
that it is not felt by many. Governments may define a cost penalty due to one hour
waiting per vehicle because of congestion caused by road maintenance. Similar to
railway maintenance one sees that work is shifted to nights or a lot of work is com-
bined into a large project on which the public is informed long before it is started.
The night work causes high logistics costs for maintenance, yet it is useful for
small repairs or patches.
Other similarities with railroads are the large number of identical parts (a road
is typically split up in lanes of 100 meters about which information is stored). Vans
with complex road analysing equipment are used to assess the road quality. For
railways special trains with complex measuring equipment are used. Videos are
used in both cases. Next, both roads and rails have multiple failure modes. Further-
more, the assets to be maintained are spread out geographically, which result in
high logistics costs for maintenance. This is also true for airline and truck main-
tenance. Both road and rail need much maintenance and as a result large budgets
need to be allocated for both.
Although several articles have been written on road maintenance, few take the
production or user consequences into account. We would like to mention Dekker et
al. (1998a) who compare two concepts to do road maintenance – one with small
projects carried out during nights and the other where large road segments (some
330 G. Budai, R. Dekker and R. Nicolai

4 km) are overhauled in one stretch. In the latter case the traffic is diverted to other
lanes or the side of the road. It is shown that the latter is both advantageous for the
traffic as well as cheaper, provided the volume of traffic on the road is not too
high. Another interesting contribution is from Rose and Bennett (1992) who pro-
vide a model to locate and decide on the size (or capacity) of road maintenance
depots, for corrective maintenance.

13.4.3 Airline Maintenance

Maintenance costs are a substantial factor of an airline’s costs. Estimates are that
20% of the cost is due to maintenance. Maintenance is crucial because of safety
reasons and because of high downtime costs. Apart from a crash, the worst event
for an airline is an aircraft on ground (AOG) because of failures. Accordingly a lot
of technology has been developed to facilitate maintenance. We like to mention in-
flight diagnosis, such that quick actions can be taken on ground and a very high
level of modularity, such that failed components can easily be replaced. Yet in an
aircraft there is still a high level of time-based preventive maintenance rather than
condition-based maintenance. A plane has to undergo several checks, ranging from
an A check taking about an hour after each flight, to a monthly B check, a yearly C
check and a five-yearly D check, where it is completely overhauled and which can
take a month. The presence of the monthly check implies that planes cannot always
fly the same route, but need to be rotated on a regular basis. It also implies that
airlines need multiple units of a type in order to provide a consistent service.
Several studies have addressed the issue of fleet allocation and maintenance
scheduling. In the fleet allocation one decides which planes fly which route and at
which time. One would preferably make an allocation which remains fixed for a
whole year, but due to the regular maintenance checks this is not possible. Gopalan
and Talluri (1998) give an overview of mathematical models on this problem.
Moudani and Mora-Camino (2000) present a method to do both flight assignment
and maintenance scheduling of planes. It uses dynamic programming and heuris-
tics. A case of a charter airline is considered. Sriram and Haghani (2003) also
consider the same problem. They solve it in two phases. Finally, Feo and Bard
(1989) consider the problem of maintenance base planning in relation to an airlines
fleet rotation, while Cohn and Barnhart (2003) consider the relation between crew
scheduling and key maintenance routing decisions.
In another line of research, Dijkstra et al. (1994) develop a model to assess
maintenance manpower scheduling and requirements in order to perform inspec-
tion checks (A type) between flight turnarounds. It appears that their workload is
quite peaked because of many flights arriving more or less at the same time (so-
called banks) in order to allow fast passenger transfers.
The same problem is also tackled by Yan et al. (2004). The articles in this line
of research consider in effect the production planning of maintenance, a topic also
addressed in Section 13.5.
As the last article in this category we would like to mention Cobb (1995) who
presents a simulation model to evaluate current maintenance system performance
or the positive effect of ad hoc operating decisions on maintenance turn times (i.e.
the time maintenance takes to carry out a check or to do a repair).
Maintenance and Production: A Review 331

13.4.4 Electric Power System Maintenance

Kralj and Petrovic (1988) have presented an overview article on optimal main-
tenance of thermal generating units in power systems. They primarily focused on
articles published in IEEE Transactions on Power Apparatus and Systems. Here we
will briefly discuss the typical problems of the maintenance of power systems and
review two articles dealing with these problems.
First of all, note that maintenance of power systems is costly, because it is im-
possible to store generated electrical energy. Moreover, the continuity of supply is
very important for its customers.
A second problem of scheduling the maintenance of power systems is that joint
maintenance of units is often impossible or very expensive, since that would too
much effect production.
Frost and Dechter (1998) consider the problem of scheduling preventive
maintenance of power generating units within a power plant. The purpose of the
maintenance scheduling is to determine the duration and sequence of outages of
power generating units over a given time period, while minimizing operating and
maintenance costs over the planning period, subject to various constraints. A
subset of the constraints contains the pairs of components that cannot be main-
tained simultaneously. In this article the maintenance problem are cast as constraint
satisfaction problems (CSP). The optimal solution is found by solving a series of
CSPs with successively tighter cost-bound constraints.
Langdon and Treleaven (1997) study the problem of scheduling maintenance
for electrical power transmission networks. Grouping maintenance in the network
may prevent the use of a cheap electricity generator, so requiring a more expensive
generator to be run in its place. That is, some parts of the network should not be
maintained simultaneously. These exclusions are modelled by adding restrictions
to the MIP formulation of the problem.

13.5 Production Planning of Maintenance

Maintenance can also be regarded as a production process which needs to be
planned. Planning in this respect implies determining appropriate levels of capacity
concerning the demand. It will be clear that this activity can only be carried out for
plannable maintenance, e.g. overhauls or refurbishment and that it is only needed
when there are capacity restrictions, e.g. in a shipyard.
The specific aspect of maintenance production planning with standard pro-
duction planning is that there tend to be more unforeseen events and intervening
corrective maintenance work than in regular production planning.
Articles in this category are Dijkstra et al. (1994) and Yan et al. (2004), who
both consider manpower determination and allocation problems in case of a
fluctuating workload for aircraft maintenance. Shenoy and Bhadury (1993) use the
MRP approach to develop a maintenance-manpower plan. Bengü (1994) discusses
the organization of maintenance centres that are specialized to carry out particular
types of maintenance jobs in the telecommunication sector. Al-Zubaidi and
332 G. Budai, R. Dekker and R. Nicolai

Christer (1997) consider the problem of manpower planning for hospital building
Another typical production planning problem is with respect to layout planning.
A case study for a maintenance tool room is described in Rosa and Feiring (1995).
The study by Rose and Bennett (1992), which was discussed in Section 13.4, also
falls into this category.

13.6 Integrated Production and Maintenance Planning

In recent years there has been considerable interest in models attempting to
integrate production, quality and maintenance (Ben-Daya 2001). Whereas in the
past these aspects have been treated as separate problems, nowadays models take
into account the mutual interdependencies. Production planning typically concerns
determining lot sizes and evaluating capacity needs, in case of fluctuating demand.
Both the optimal lot size and the capacity needs are influenced by failures. On the
other hand, maintenance prevents breakdowns and improves quality. Accordingly,
they should be planned in an integrated way (see, e.g. Nahmias 2005).
We subdivide the class of integrated production and maintenance planning
models into four categories: high-level models considering conceptual and process
design problems (Section 13.6.1); the economic manufacturing quantity model,
which was originally posed as a simple inventory problem, but has been (success-
fully) extended to deal with quality and failure aspects (Section 13.6.2); models of
production systems with buffer capacities, which by definition are suitable to deal
with breakdowns (Section 13.6.3); finally, production and maintenance rate optimi-
zation models, which aim to find the production and preventive/corrective main-
tenance rates of machines so as to minimize the total cost of inventory, production
and maintenance (Section 13.6.4). In Section 13.6.5 we discuss articles which do
not fit in any of the above categories.

13.6.1 Conceptual and Design Models

In a number of articles conceptual models are developed that integrate the pre-
ventive and corrective aspects of the maintenance planning, with aspects of the
production system such as quality, service level and priority and capacity activities.
For instance, Finch and Gilbert (1986) present an integrated conceptual framework
for maintenance and production in which they focus especially on manpower
issues in corrective and preventive work. Weinstein and Chung (1999) test the
hypothesis that integrating the maintenance policy with the aggregate production
planning will significantly influence total cost reduction. It appears that this is the
case in the experimental setting investigated in this study. Lee (2005) considers
production inventory planning, where high level decisions on maintenance (viz.
their effects) are made.
Another group of articles deal with integrating process design, production and
maintenance planning. Already at the design stage decisions on the process system
and initial reliabilities of the equipments are made. Pistikopoulos et al. (2000)
describe an optimization framework for general multipurpose process models,
Maintenance and Production: A Review 333

which determine both the optimal design as well as the production and main-
tenance plans simultaneously. In this framework, the basic process and system
reliability-maintainability characteristics are determined in the design phase with
the selection of system structure, components, etc. The remaining characteristics
are determined in the operation phase with the selection of appropriate operating
and maintenance policies. Therefore, the optimization of process system effective-
ness depends on the simultaneous identification of optimal design, operation and
maintenance policies having properly accounted for their interactions. In Goel et
al. (2003) a reliability allocation model is coupled with the existing design, produc-
tion, and maintenance optimization framework. The aim is to identify the optimal
size and initial reliability for each unit of equipment at the design stage. They
balance the additional design and maintenance costs with the benefits obtained due
to increased process availability.

13.6.2 EMQ Problems

In the classical economic manufacturing quantity (EMQ) model items are produced
at a constant rate p and the demand rate for the items is equal to d < p. The aim of
the model is to find the production uptime that minimizes the sum of the inventory
holding cost and the average, fixed, ordering cost. This model is an extension of
the well known economic order quantity (EOQ) model, the difference being that in
the EOQ model orders are placed when there is no inventory. Note that the EMQ
model is also referred to as economic production quantity (EPQ) model.
In the extensive literature on production and inventory problems, it is often
assumed that the production process does not fail, that it is not interrupted and that
it only produces items of acceptable quality. Unfortunately, in practice this is not
always the case. A production process can be interrupted due to a machine break-
down or because the quality of the produced items is not acceptable anymore. The
EMQ model has been extended to deal with these aspects and we thus divide the
literature on EMQ models into two categories. First, we consider EMQ problems
that take into account the quality aspects of the items produced. The second
category of EMQ models analyzes the effects of (stochastic machine) breakdowns
on the lot sizing decision. EMQ Problems with Quality Aspects

One of the reasons why a production process is interrupted is the (lack of) quality
of the items produced. Obviously, items of inferior quality can only be sold at a
lower revenue or cannot be sold at all. Thus, the production of these items results
in a loss (or a lower profit) for the firm. This type of interruption is usually
modelled as follows. It is assumed that at the start of the production cycle the
production is in an “in-control” state, producing items of acceptable quality. After
some time the production process may then shift to an “out-of-control” state. In
this state a certain percentage of the items produced are defective or of sub-
standard quality. The elapsed time for the process to be in the in-control state,
before the shift occurs, is a random variable. Once a shift to the out-of-control state
has occurred, it is assumed that the production process stays in that state unless it is
334 G. Budai, R. Dekker and R. Nicolai

discovered by (a periodic) inspection of the process, followed by corrective main-

One of the earliest works that consider the problem of finding the optimal lot
size and optimal inspection schedule is the article of Lee and Rosenblatt (1987).
They show that the derived optimal lot size is smaller than the classical EMQ if the
time for the process to be in the in-control state follows an exponential distribution.
Lee and Rosenblatt (1989) have extended this work by assuming that the cost of
restoration is a function of the elapsed time since a shift from an in-control to an
out-of-control state of the production process has occurred. In addition, the possibi-
lity of incurring shortages in the model is allowed.
Many attempts have been made to extend these two models. For instance,
Tseng (1996) assumes that the process lifetime is arbitrarily distributed with an
increasing failure rate. Furthermore, two maintenance actions are considered. The
first is a perfect maintenance action, which restores the system to an as-good-as
new condition if the process is in the in-control state. If however, the production
process is in out-of-control state, it is restored to the in-control state at a given
restoration cost. Secondly, maintenance is always done at the end of a production
cycle to ensure that the process is perfect at the beginning of each production
Wang and Sheu (2003) assume that the periodic inspections are imperfect. Two
types of inspection errors are considered, namely (I) the process is declared out-of-
control when it is in-control and (II) the process is declared in-control when it is
out-of-control. They use a Markov chain to jointly determine the production cycle,
process inspection intervals, and maintenance level. Wang (2006) derives some
structural properties for the optimal production/preventive maintenance policy,
under the assumption that the (sufficient) conditions for the optimality of the equal-
interval PM schedule hold. This increases the efficiency of the solution procedure.
The quality characteristics of the product in a production process can be
monitored by x -control chart. The economic design of the x -control chart deter-
mines the sample size n, sampling interval h, and the control limit coefficient k
such that the total cost is minimized.
Rahim (1994) develops an economic model for joint determination of produc-
tion quantity, inspection schedule and control chart design for a production process
which is subject to a non-Markovian random shock. In their model it is assumed
that the in-control period follows a general probability distribution with an in-
creasing failure rate and that production ceases only if the process is found to be
out of control during inspection. However, if the alarm turns out to be false the
time for searching an assignable cause is assumed to be zero. Rahim and Ben-Daya
(1998) generalize the model of Rahim (1994) by assuming that the production
stops for a fixed amount of time not only for a true alarm, but also whenever there
is a false alarm during the in-control state. Rahim and Ben-Daya (2001) further
extend the model of Rahim (1994) by looking at the effect of deteriorating pro-
ducts and a deteriorating production process on the optimal production quantity,
inspection schedule and control chart design parameters. The deterioration times
for both product and process are assumed to follow Weibull distributions. It is
assumed that the process is stopped either at failure or at the m-th inspection
Maintenance and Production: A Review 335

interval, whichever occurs first. Furthermore, the inventory is depleted to zero

before a new cycle starts.
Tagaras (1988) develops an economic model that incorporates both process
control and maintenance policies, and simultaneously optimizes their design
parameters. Lam and Rahim (2002) present an integrated model for joint determi-
nation of economic design of x -control charts, economic production quantity, pro-
duction run length and maintenance schedules for a deteriorating production
system. In the model of Ben-Daya and Makhdoum (1998) PM activities are also
coordinated with quality control inspections, but they are carried out only when a
preset threshold of the shift rate of the production process is reached. EMQ Problems with Failure Aspects

A couple of articles study the EMQ model in the presence of random machine
breakdowns or random failures of a bottleneck component For instance,
Groenevelt et al. (1992a) consider the effects of stochastic machine breakdowns
and corrective maintenance on economic lot sizing decisions. Maintenance of the
machine is carried out after a failure or after a predetermined time interval, which-
ever occurs first. They consider two production control policies. Under the first
policy when the machine breaks down the interrupted lot is not resumed and a new
lot starts only when all available inventory is depleted. In the second policy,
production is immediately resumed after a breakdown if the current on hand in-
ventory is below a certain threshold level. They showed that under these policies
the optimal lot size increases with the failure rate and assuming a constant failure
rate and instantaneous repair times the optimal lot sizes are always larger than the
EMQ. Nevertheless, Groenevelt et al. (1992a) propose to use the EMQ as an
approximation to the optimal production lot size. Chung (2003) provides a better
approximation to the optimal production lot size. Groenevelt et al. (1992b) study
the problem of selecting the economic lot size for an unreliable manufacturing
facility with a constant failure rate and general distributed repair times. The
quantity of the safety stock that is used when the machine is being repaired is was
derived based on the managerially prescribed service level.
Makis and Fung (1995) present a model for joint determination of the lot size,
inspection interval and preventive replacement time for a production facility that is
subject to random failure. The time that the process stays in the in-control state is
exponentially distributed and once the process is in out-of-control state, a certain
percentage of the items produced is defective or qualitatively not acceptable.
Periodic inspections are done to review the production process and the time to
machine failure is generally distributed random variable. Preventive replacement of
the production facility is based on operation time, i.e. after a certain number of
production runs the production facility is replaced.
Some other articles are concerned with PM policies for EMQ models. For
instance, in Srinivasan and Lee (1996) an (S, s) policy is considered, i.e. as soon as
the inventory level reaches S, a preventive maintenance operation is initiated and
the machine becomes as good as new. After the preventive maintenance operation,
production resumes as soon as the inventory level drops down to or below a
prespecified value, s, and the facility continues to produce items until the inventory
level is raised back to S. If the facility breaks down during operation, it is mini-
336 G. Budai, R. Dekker and R. Nicolai

mally repaired and put back into commission. Okamura et al. (2001) generalize the
model of Srinivasan and Lee (1996) by assuming that both the demand as well as
the production process is a continuous-time renewal counting process. Further-
more, they suppose that machine breakdown occurs according to a non-homo-
geneous Poisson process. In Lee and Srinivasan (2001) the demand and production
rates are considered constant and a production run begins as soon as the inventory
drops to zero. If the facility fails during operation, it is assumed to be repaired, but
restoring the facility only to the condition it was in before the failure. Lee and
Srinivasan (2001) consider an (S, N) policy, where the control variable N specifies
the number of production cycles the machine should go through before it is set
aside for preventive maintenance overhaul, which restores the facility to its original
Recently, Lin and Gong (2006) determined the effect of breakdowns on the
decision of optimal production uptime for items subject to exponential deteriora-
tion under a no-resumption policy. Under this policy, a production run is executed
for a predetermined period of time provided that no machine breakdown has
occurred in this period. Otherwise, the production run is immediately aborted. The
inventories are built up gradually during the production uptime and a new pro-
duction run starts only when all on-hand inventories are depleted. If a breakdown
occurs then corrective maintenance is carried out and this takes a fixed amount of
time. If the inventory build-up during the production uptime is not enough to meet
the demand during the entire period of the corrective maintenance, shortages (lost
sales) will occur. Maintenance restores the production system to the same initial
working conditions.

13.6.3 Deteriorating Production System with Buffer Capacity

In order to reduce the negative effect of a machine breakdown on the production

process, a buffer inventory may be built up during the production uptime (as it is
done in the EMQ model). The role of this buffer inventory is that if an unexpected
failure of the installation occurs then this inventory is used to satisfy the demand
during the period that corrective maintenance is carried out. One of the earliest
works on this subject is Van der Duyn Schouten and Vanneste (1995). In their
model the demand rate is constant and equal to d (units/time) and as long as the
fixed buffer capacity (K) is not reached the installation operates at a constant rate
of p units/time (p>d) and the excess output is stored in the buffer. When the buffer
is full, the installation reduces its speed from p to d. Upon failure corrective
maintenance starts and the installation becomes as good as new. It is possible to
perform preventive maintenance, which takes less time than repair and it also
brings the installation back into the as-good-as-new condition. The decision to start
a preventive maintenance action is not only based on the condition of the installa-
tion, but also on the level of the buffer. The criterion is to minimize the average
inventory level and the average number of backorders. Since the optimal policy is
difficult to implement, the authors develop suboptimal (n, N, k) control-limit
policies. Under this policy if the buffer is full, preventive maintenance is under-
taken at age n. If the buffer is not full, but it has at least k items, preventive main-
Maintenance and Production: A Review 337

tenance is undertaken at age N. Maintenance is never performed unless the system

has at least k items. The objective is to obtain the best values for n, N and k.
Iravani and Duenyas (2002) extend the above model by assuming a stochastic
demand and production process. Demand that cannot be met from the inventory is
lost and a penalty is incurred. Moreover, it is assumed that the production charac-
teristics of the system change with usage and the more the system deteriorates the
more its production rate decreases and the more its maintenance operation becomes
time-consuming and costly. In a recent article, Yao et al. (2005) assume that the
production system can produce at any rate from 0 (idle) to its maximum rate if it is
in working state. Upon failure corrective maintenance is performed immediately to
restore the system to the working state. Preventive maintenance actions can be
performed as well. Both the failure process and the times to complete corrective/
preventive maintenance are assumed to be stochastic. Thus, in addition to the direct
cost of performing corrective/preventive maintenance the non-negligible main-
tenance completion time leads to an indirect cost of lost production capacity due to
system unavailability.
Kyriakidis and Dimitrakos (2006) study an infinite-state generalisation of Van
der Duyn Schouten and Vanneste (1995). The deterioration process of the installa-
tion is considered nonstationary, i.e. the transition probabilities depend not only on
the working conditions of the installation but on its age and buffer level as well.
Furthermore, the cost structure is more general than in Van der Duyn Schouten and
Vanneste (1995) since it includes operating and maintenance costs of the installa-
tion as well as storage and shortage costs. It is assumed that the operating costs of
the installation depend on both the working condition and the age of the installa-
Another way of maintaining the buffer inventory is according to an (S, s)
policy, i.e. the system stops production when the buffer inventory reaches S and the
production restarts when the inventory drops to s. This idea is used by Das and
Sarkar (1999). They assume that exogenous demand for the product arrives
according to a Poisson process. Back-orders are not allowed. The unit production
time, the time between failures, and the repair and maintenance times are assumed
to have general probability distributions. Preventive maintenance decisions are
made only at the time that the buffer inventory reaches S, and they depend on both
the current inventory level and the number of items produced since the last
repair/maintenance operation. The objective is to determine when to perform
preventive maintenance on the system in order to improve the system performance.
A different approach of dealing with integrated maintenance/production
scheduling with buffer capacity is presented in Chelbi and Ait-Kadi (2004). They
assume the preventive maintenance actions are regularly (after each T time
periods) performed and the duration of corrective and preventive maintenance
actions is random. The proposed strategy consists of building up a buffer stock
whose size S covers at least the average consumption during the repair periods
following breakdowns within the period of length T. When the production unit has
to be stopped to undertake the planned preventive maintenance actions, a certain
level of buffer stock must still be available in order to avoid stoppage of the
subsequent assembly line. The two decision variables are: the period T at which
preventive maintenance must be performed, and the level S of the buffer stock.
338 G. Budai, R. Dekker and R. Nicolai

A recent article of Kenne et al. (2006) considers the effects of both preventive
maintenance policies and machine age on optimal safety stock levels. Significant
stock levels, as the machine age increases, hedge against more frequent random
failures. The objective of the study is to determine when to perform preventive
maintenance on the machine and to find the level of the safety stock to be main-

13.6.4 Production and Maintenance Rate Optimization

An integrated production and maintenance planning can also be made by

optimizing the production and maintenance rates of the machines under considera-
tion. In this line of research we mention the work of Gharbi and Kenne (2000,
2005), Kenne and Boukas (2003) and Kenne et al. (2003). In these articles a
multiple-identical-machine manufacturing system with random breakdowns, re-
pairs and preventive maintenance activities is studied. The objective of the control
problem is to find the production and the preventive maintenance rates of the ma-
chines so as to minimize the total cost of inventory/backlog, repair and preventive

13.6.5 Miscellaneous

Finally, we list some articles that deal with integrated maintenance and production
planning, but their approaches for modelling or the problem settings are different
from the articles in the previous categories discussed earlier. For instance, the
model presented in Ashayeri et al. (1996) deals with the scheduling of production
and preventive maintenance jobs on multiple production lines, where each line has
one bottleneck machine. The model indicates whether or not to produce a certain
item in a certain period on a certain production line.
In Kianfar (2005) the manufacturing system is composed of one machine that
produces a single product. The failure rate of the machine is a function of its age
and the demand of the manufacturing product is time-dependent. Its rate depends
on the level of advertisement of the product. The objective is to maximize the
expected discounted total profit of the firm over an infinite time horizon.
Sarper (1993) considers the following problem. Given a fixed repair/main-
tenance capacity, how many of each of the low demand large items (LDLIs) should
be started so that there are no incomplete jobs at the end of the production period?
The goal is to ensure that the portion of the total demand started will be completed
regardless of the amount by which some machines may stay idle due to insufficient
work. A mixed-integer model is presented to determine what portion of the
demand for each LDLI type should be rejected as lost sales so that the remaining
portion can be finished completely.

13.7 Trends and Open Areas

Initial publications on models in the production and maintenance area date from
the end of the 1980s (Lee and Rosenblatt 1987). Since that time many papers have
Maintenance and Production: A Review 339

been published with the majority dating from the 1990s and the new millennium.
The most popular area in this review is also the oldest one, i.e. on integrated
models for maintenance and production. However, still many papers appear in that
area and the models become more and more complex, with more decision
parameters and more aspects.
The topics on opportunity maintenance and scheduling maintenance in line
with production have also been popular, but maybe more in the past than today.
We did expect to find more studies on specific business sectors, but could only find
many for the airline sector. That sector seems to be the most popular as it has both
a lot of interaction between maintenance and production as well as high costs
involved. In the other sectors, we do see the interaction, but perhaps more papers
will be published in the future. The other sections are interesting but small in terms
of papers published.
In general, the demands on maintenance become higher as public and com-
panies are less likely to accept failures, bad quality products or non-performance.
Yet at the same time society’s inventory of capital goods is increasing as well as
ageing in the western societies. This is very much the case for roads, railways,
electric power generation, transport, and aircrafts. As there are continuous pres-
sures on maintenance budgets we do foresee the need for research supporting
maintenance and production decisions, also because decision support software is
gaining in popularity and more data becomes electronically available. A theory is
therefore needed for such decision support systems. As several case studies have
taught us that practical problems have many complex aspects, there is a high need
for more theory that can help us to understand and improve complex maintenance

13.8 Conclusions
In this chapter we have given an overview of planning models for production and
maintenance. These models are classified on the basis of the interactions between
maintenance and production. First, although maintenance is intended to allow
production, production is often stopped during maintenance. The question arises
when to do maintenance such that production is least effected. In order to answer
this question planning models should take into account the needs of production.
These needs are business sector specific and thus applications of planning models
in different areas have been considered. In comparison with other specific sectors,
much work has been done on modelling maintenance for the airline sector. Second,
maintenance itself can also be seen as a production process which needs to be
planned. Models for maintenance production planning mainly address allocation
and manpower determination problems. Finally, maintenance also affects the pro-
duction process since it takes capacity away. In production processes maintenance
is mostly initiated by machine failures or low quality items. Maintenance and
production should therefore be planned in an integrated way to deal with these
aspects. Indeed, integrated maintenance and production planning models determine
optimal lot sizes while taking into account failure and quality aspects. We observe
340 G. Budai, R. Dekker and R. Nicolai

a non-stop attention for such models, which take more and more “real world” as-
pects into account.
Although many articles have been written on the interaction between production
and maintenance, a careful reader will detect several open issues in this review. The
theory developed thus far, is far from complete and any real application, is likely to
reveal many more open issues.

13.9 Acknowledgements
The authors would like to thank Georgios Nenes, Sophia Panagiotidou, and the
editors for their helpful suggestions and comments.

13.10 References
Al-Zubaidi H, Christer A, (1997) Maintenance manpower modelling for a hospital building
complex. European Journal of Operational Research 99:603–618
Ashayeri J, Teelen A, Selen W, (1996) A production and maintenance planning model for
the process industry. International Journal of Production Research 34: 3311–3326
Bäckert W, Rippin D, (1985) The determination of maintenance strategies for plants subject
to breakdown. Computers and Chemical Engineering 9(2):113–126
Ben-Daya M, Makhdoum M, (1998) Integrated production and quality model under various
preventive maintenance policies. Journal of the Operational Research Society 49(8):
Ben-Daya M, Rahim M, (2001) Integrated production, quality & maintenance models: an
overview. in M. Rahim and M. Ben-Daya (eds), Integrated models in production
planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 3–28
Bengü G, (1994) Telecommunications systems maintenance. Computers and Operations
Research 21:337–351
Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance
activities. Journal of the Operational Research Society 57:1035–1044
Cassady C, Pohl E, Murdock W, (2001) Selective maintenance modeling for industrial
systems. Journal of Quality in Maintenance Engineering 7(2):104–117
Charles A, Floru I, Azzaro-Pantel C, Pibouleau L, Domenech S, (2003) Optimization of
preventive maintenance strategies in a multipurpose batch plant: application to
semiconductor manufacturing. Computers and Chemical Engineering 27:449–467
Chelbi A, Ait-Kadi D, (2004) Analysis of a production/inventory system with randomly
failing production unit submitted to regular preventive maintenance. European Journal
of Operational Research 156:712–718
Cheung B, Chow K, Hui L, Yong A, (1999) Railway track possession assignment using
constraint satisfaction. Engineering Applications of AI 12(5):599–611
Cheung K, Hui C, Sakamoto H, Hirata K, O'Young L, (2004) Short-term site-wide
maintenance scheduling. Computers and Chemical Engineering 28:91–102
Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European
Journal of Operational Research 51:1–23
Chung K, (2003) Approximations to production lot sizing with machine breakdowns.
Computers & Operations Research 30:1499–1507
Cobb R, (1995) Modeling aircraft repair turntime: simulation supports maintenance
marketing. Journal of Air Transport Management 2:25–32
Maintenance and Production: A Review 341

Cohn A, Barnhart C, (2003) Improving crew scheduling by incorporating key maintenance

routing decisions. Operations Research 51(3):387–396
Dagpunar J, (1996) A maintenance model with opportunities and interrupt replacement
options. Journal of the Operational Research Society 47:1406–1409
Das T, Sarkar S, (1999) Optimal preventive maintenance in a production inventory. IIE
Transactions 31:537–551
Dedopoulos L, Shah N, (1995) Preventive maintenance policy optimisation for multipurpose
plant equipment. Computers and Chemical Engineering 19:693–698
Dekker R, Budai G, (2002) An overview of techniques used in planning railway
infrastructure maintenance. In Geraerds W, Sherwin D, (eds), Proceedings of
IFRIMmmm (maintenance management and modelling) conference, Vaxjo University,
Sweden, 1–8
Dekker R, Dijkstra M, (1992) Opportunity-based age replacement: exponentially distributed
times between opportunities. Naval Research Logistics 39:175–190
Dekker R, Plasmeijer R, (2001) Multi-parameter maintenance optimisation via the marginal
cost approach. Journal of the Operational Research Society 52:188–197
Dekker R, Smeitink E, (1991) Opportunity-based block replacement, European Journal of
Operational Research 53:46–63
Dekker R, Smeitink E, (1994) Preventive maintenance at opportunities of restricted
duration. Naval Research Logistics 41:335–353
Dekker R, van Rijn C, (1996) Prompt - a decision support system for opportunity based
preventive maintenance. In Özekici S, (ed) Reliability and Maintenance of Complex
Systems, NATO ASI series 154:530–549
Dekker R, Plasmeijer R, Swart J, (1998a) Evaluation of a new maintenance concept for the
preservation of highways. IMA Journal of Mathematics applied in Business and Industry
Dekker R, van der Meer J, Plasmeijer R, Wildeman R, (1998b) Maintenance of light-
standards - a case-study. Journal of the Operational Research Society 49:132–143
Den Hertog D, van Zante-de Fokkert J, Sjamaar S, Beusmans R, (2005) Optimal working
zone division for safe track maintenance in the Netherlands. Accident Analysis and
Prevention 37:890–893
Dijkstra M, Kroon L, Salomon M, van Nunen J, van Wassenhoven L, (1994) Planning the
size and organization of KLM's aircraft maintenance personnel. Interfaces 24:47–58
Edwards D, Holt G, Harris F, (2002) Predicting downtime costs of tracked hydraulic
excavators operating in the UK opencast mining industry. Construction Management &
Economics 20:581–591
Esveld C, (2001) Modern Railway Track. MRT-Productions, Zaltbommel, The Netherlands
Feo T, Bard J, (1989) Flight scheduling and maintenance base planning. Management
Science 35(12):1415–1432
Finch B, Gilbert J, (1986) Developing maintenance craft labor efficiency through an
integrated planning and control system: a prescriptive model. Journal of Operations
Management 6(4):449–459
Frost D, Dechter R, (1998) Optimizing with constraints: a case study in scheduling
maintenance of electric power units. Lecture Notes in Computer Science 1520:469–488
Geraerds W, (1985) The cost of downtime for maintenance: preliminary considerations.
Maintenance Management International 5:13–21
Gharbi A, Kenne J, (2000) Production and preventive maintenance rates control for a
manufacturing system: an experimental design approach. International Journal of
Production Economics 65:275–287
Gharbi A, Kenne J, (2005) Maintenance scheduling and production control of multiple-
machine manufacturing systems. Computers and Industrial Engineering 48:693–707
342 G. Budai, R. Dekker and R. Nicolai

Goel H, Grievink J, Weijnen M, (2003) Integrated optimal reliable design, production, and
maintenance planning for multipurpose process plant. Computers and Chemical
Engineering 27:1543–1555
Gopalan R, Talluri K, (1998) Mathematical models in airline schedule planning: a survey.
Annals of Operations Research 76(1): 155–185
Groenevelt H, Pintelon L, Seidmann A, (1992a) Production batching with machine
breakdowns and safety stocks. Operations Research 40(5):959–971
Groenevelt H, Pintelon L, Seidmann A, (1992b) Production lot sizing with machine
breakdowns. Management Science 48(1):104–123
Haghani A, Shafahi Y, (2002) Bus maintenance systems and maintenance scheduling:
model formulations and solutions. Transportation Research Part A 36:453–482
Higgins A, (1998) Scheduling of railway maintenance activities and crews. Journal of the
Operational Research Society 49:1026–1033
Improverail (2002) (accessed
September 26, 2006)
Iravani S, Duenyas I, (2002) Integrated maintenance and production control of a
deteriorating production system. IIE Transactions 34:423–435
Kenne J, Boukas E, (2003) Hierarchical control of production and maintenance rates in
manufacturing systems. Journal of Quality in Maintenance Engineering 9:66–82
Kenne J, Boukas E, Gharbi A, (2003) Control of production and corrective maintenance
rates in a multiple-machine, multiple-product manufacturing system. Mathematical and
Computer Modelling 38:351–365
Kenne J, Gharbi A, Beit M, (2006) Age-dependent production planning and maintenance
strategies in unreliable manufacturing systems with lost sale. Accepted for publication in
European Journal of Operational Research 178(2):408–420
Kianfar F, (2005) A numerical method to approximate optimal production and maintenance
plan in a flexible manufacturing system. Applied Mathematics and Computation
Knight P, Jullian F, Jofre L, (2005) Assessing the “size” of the prize: developing business
cases for maintenance improvement projects. Proceedings of the International Physical
Asset Management Conference, 284–302
Kralj B, Petrovic R, (1988) Optimal preventive maintenance scheduling of thermal
generating units in power systems – a survey of problem formulations and solution
methods. European Journal of Operational Research 35:1–15
Kyriakidis E, Dimitrakos T, (2006) Optimal preventive maintenance of a production system
with an intermediate buffer. European Journal of Operational Research 168:86–99
Lam K, Rahim M, (2002) A sensitivity analysis of an integrated model for joint
determination of economic design of x -control charts, economic production quantity
and production run length for a deteriorating production system. Quality and Reliability
Engineering International 18:305–320
Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission
networks using genetic programming. In Warwick K, Ekwue A, Aggarwal A, (eds),
Artificial intelligence techniques in power systems, Institution of Electrical Engineers,
Stevenage, UK, 220–237
Lee H, (2005) A cost/benefit model for investments in inventory and preventive
maintenance in an imperfect production system. Computers and Industrial Engineering
Lee H, Rosenblatt M, (1987) Simultaneous determination of production cycle and inspection
schedules in a production system. Management Science 33:1125–1137
Lee H, Rosenblatt M, (1989) A production and maintenance planning model with restoration
cost dependent on detection delay. IIE Transactions 21(4):368–375
Maintenance and Production: A Review 343

Lee H, Srinivasan M, (2001) A production/inventory policy for an unreliable machine. In

Rahim M, Ben-Daya M, (eds) Integrated models in production planning, inventory,
quality, and maintenance, Kluwer Academic Publishers, 79–94
Lin G, Gong D, (2006) On a production-inventory system of deteriorating items subject to
random machine breakdowns with a fixed repair time. Mathematics and Computer
Modelling 43:920–932
Makis V, Fung J, (1995) Optimal preventive replacement, lot sizing and inspection policy
for a deteriorating production system. Journal of Quality in Maintenance Engineering,
1(4): 41–55
Moudani WE, Mora-Camino F, (2000) A dynamic approach for aircraft assignment and
maintenance scheduling by airlines. Journal of Air Transport Management 6:233–237
Nahmias S, (2005) Production and operations analysis (5th ed). McGraw-Hill, Boston
Okamura H, Dohi T, Osaki S, (2001) Computation algorithms of cost-effective EMQ
policies with PM. In Rahim M, Ben-Daya M, (eds) Integrated models in production
planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 31–65
Pistikopoulos E, Vassiliadis C, Papageorgiou L, (2000) Process design for maintainability:
an optimization approach. Computers and Chemical Engineering 24:203–208
Rahim M, (1994) Joint determination of production quantity, inspection schedule, and
control chart design. IIE Transactions, 26(6), 2–11
Rahim M, Ben-Daya M, (1998) A generalized economic model for joint determination of
production run, inspection schedule and control chart design. International Journal of
Production Research 36:277–289
Rahim M, Ben-Daya M, (2001) Joint determination of production quantity, inspection
schedule, and quality control for an imperfect process with deteriorating products.
Journal of the Operational Research Society 52(12):1370–1378
Rosa L, Feiring B, (1995) Layout problem for an aircraft maintenance company tool room.
International Journal of Production Economics 40:219–230
Rose G, Bennett D, (1992) Locating and sizing road maintenance depots. European Journal
of Operations Research 63:151–163
Sarper H, (1993) Scheduling for the maintenance of completely processed low-demand large
items. Applied Mathematical Modelling 17:321–328
Shenoy D, Bhadury B, (1993) MRSRP – a tool for manpower resources and spares
requirements planning. Computers and Industrial Engineering 24:421–439
Srinivasan M, Lee H, (1996) Production-inventory systems with preventive maintenance.
IIE Transactions 28:879–890
Sriram C, Haghani A, (2003) An optimization model for aircraft maintenance scheduling
and re-assignment. Transportation Research Part A 37:29–48
Tagaras G, (1988) An integrated cost model for the joint optimization of process control and
maintenance. Journal of the Operational Research Society 39(8):757–766
Tan J, Kramer M, (1997) A general framework for preventive maintenance optimization in
chemical process operations. Computers and Chemical Engineering 21(12):1451–1469
Tseng S, (1996) Optimal preventive maintenance policy for deteriorating production
systems. IIE Transactions 28:687–694
Van der Duyn Schouten F, Vanneste S, (1995) Maintenance optimization of a production
system with buffer capacity. European Journal of Operational Research 82:323–338
Van der Duyn Schouten F, van Vlijmen B, Vos de Wael S, (1998) Replacement policies for
traffic control signals. IMA Journal of Mathematics Applied in Business and Industry
Van Dijkhuizen G, (2000) Maintenance grouping in multi-setup multi-component
production systems. In Ben-Daya M, Duffuaa M, Raouf A, (eds) Maintenance,
Modeling and Optimization, Kluwer Academic Publishers, 283–306
344 G. Budai, R. Dekker and R. Nicolai

Van Zante-de Fokkert J, den Hertog D, van den Berg F, Verhoeven J, (2001) Safe track
maintenance for the Dutch Railways, Part II: Maintenance schedule. Technical report,
Tilburg University, the Netherlands
Vatn J, Hokstad P, Bodsberg L, (1996) An overall model for maintenance optimization.
Reliability Engineering and System Safety 51:241–257
Vaurio J, (1999) Availability and cost functions for periodically inspected preventively
maintained units. Reliability Engineering and System Safety 63:133–140
Wang C, (2006) Optimal production and maintenance policy for imperfect production
systems. Naval Research Logistics 53:151–156
Wang C, Sheu S, (2003) Determining the optimal production-maintenance policy with
inspection errors: using a Markov chain. Computers & Operations Research 30:1–17
Weinstein L, Chung C, (1999) Integrating maintenance and production decisions in a
hierarchical production planning environment. Computers & Operations Research
Wijnmalen D, Hontelez A, (1997) Coordinated condition-based repair strategies for
components of a multi-component maintenance system with discounts. European
Journal of Operational Research 98:52–63
Yan S, Yang T, Chen H, (2004) Airline short-term maintenance manpower supply planning.
Transportation Research Part A 38:615–642
Yao X, Xie X, Fu M, Marcus S, (2005) Optimal joint preventive maintenance and
production policies. Naval Research Logistics 52:668–681

Delay Time Modelling

Wenbin Wang

14.1 Introduction
In this chapter we present a modelling tool that was created to model the problems
of inspection maintenance and planned maintenance interventions, namely delay
time modelling (DTM). This concept provides a modelling framework readily
applicable to a wide class of actual industrial maintenance problems of assets in
general, and inspection problems in particular.
The concept of the delay time was first mentioned by Christer (1976) in a con-
text of building maintenance. It was not until 1984, the concept was first applied to
an industrial maintenance problem (Christer and Waller 1984). Since then, a series
of research papers appeared with regard to the theory and applications of delay
time modelling of industrial asset inspection problems; see Christer (1999) for a
detailed review. The delay time concept itself is simple which defines the failure
process of an asset as a two-stage process. The first stage is the normal operating
stage from new to the point that a hidden defect has been identified. The second
stage is defined as the failure delay time from the point of defect identification to
failure. It is the existence of such a failure delay time which provides the opportu-
nity for preventive maintenance to be carried out to remove or rectify the identified
defects before failures. With appropriate modelling of the durations of these two
stages, optimal inspection intervals can be identified to optimise a criterion func-
tion of interest.
The delay time concept is similar in definition to the well known potential
failure (PF) interval in reliability centred maintenance (Moubray 1997). It is noted,
however, that two differences between these two definitions mark a fundamental
difference in modelling maintenance inspection of assets. First, the delay time is
random in Christer’s definition while the PF interval is assumed to be constant.
Second, the initial point of a defect identification is very important to the set up of
an appropriate inspection interval, but ignored by Moubray. Nevertheless, Moubray
did not provide any means of modelling the inspection practice, while DTM
346 W. Wang

provides a rich source of modelling methodologies ranged from the concept to

practical solutions.
Asset inspection modelling has long been researched by many others, Among
them, the model proposed by Barlow and Proschan (1965) is perhaps the most
famous one. They consider a unit subject to inspections as follows. The unit is
inspected at prespecified times, where each inspection is executed perfectly and
instantaneously. The policy terminates with an inspection which detects the unit
failure. This implies that the unit may have already failed during an operation
interval between inspections, but can only be identified at the forthcoming inspec-
tion. Various modifications and extensions to the Barlow and Proschan’s model
have been proposed; see for example, Thomas et al. (1991), Luss (1983), Abdel-
Hameed (1995), Kaio and Osaki (1989) and McCall (1965). The delay time
inspection model is different from the classical Barlow and Proschan’s model on
two accounts. First, a failure is identified immediately when it occurs. This is
perhaps more rationale than the Barlow and Proschan’s model since if the system
fails, it may have stopped operating and should be observed immediately by the
operators. Second, there is a failure delay time in DTM which characterises the
abnormal deterioration before failure, which is not defined in Barlow and
Proschan’s model. It is noted however, that for a certain class of equipment such as
fire distinguishers, Barlow and Proschan ’s model is appropriate.
To clarify the objective of the type of inspection modelling we are concerned
with here, consider a plant item with an inspection practice every period T, says,
weeks, months, … , with repair of failures undertaken as they arise. The inspection
consists of a check list of activities to be undertaken, and a general inspection of
the operational state of the plant. Any defect identified leads to immediate repair,
and the objective of the inspection is to minimise operational downtime. Other
objectives could be considered, for example cost, availability or output. There are
other types of inspection activities such as condition monitoring and preventive
maintenance which will be introduced and discussed elsewhere in this book; for
now we focus on the inspection practice outlined above using the delay time
inspection modelling technique.
This chapter is organised as follows. Section 14.2 gives an outline of the delay
time concept. Sections 14.3 and 14.4 introduce two delay time inspection models
of a single component and a complex system respectively. Section 14.5 discusses
the parameters estimation techniques used in DTM. Section 14.6 highlights exten-
sions to the basic delay time model and future research in DTM and Section 14.7
concludes the chapter.

14.2 The Delay Time Concept

We are interested in the relationship between the performance of assets and
inspection intervention, and to capture this, the conventional reliability analysis of
time to first failure, or time between failures, requires enrichment. Consider a
repairable item of an asset. It could be, say, a component, a machine, a building, or
an integrated set of machines forming a production line, but viewed by manage-
ment as a unit. For now we take a complex system of multiple components as an
Delay Time Modelling 347

example, the case for a single component will be considered in Section 14.3. The
interaction between inspection and equipment performance may be captured using
the delay time concept presented below.
Let the item of an asset be maintained on a breakdown basis. The time history
of breakdown or failure events is a random series of points; see Figure 14.1. For
any one of these failures, the likelihood is that, had the item been inspected at some
point just prior to failure, it could have revealed a defect which, though the item
was still working, would ultimately lead to a failure. Such signals include exces-
sive vibration, unusual noise, excessive heat, surface staining, smell, reduced out-
put, increased quality variability, etc. The first instance where the presence of a
defect might reasonably be expected to be recognised by an inspection, had it taken
place, is called the initial point u of the defect, and the time h to failure from u is
called the delay time of the defect; see Figure 14.2. Had an inspection taken place
in (h, u + h) , the presence of a defect could have been noted and corrective actions
taken prior to failure. Given that a defect arises, its delay time represents a window
of opportunity for preventing a failure. Clearly, the delay time h is a characteristic
of the item concerned, the type of defect, the nature of any inspection, and perhaps
the person inspecting. For example, if the item was a vehicle, and the maintenance
practice was to respond when the driver reported a problem, then there is in effect a
form of continuous monitoring inspection of cab related aspects of the vehicle,
with a reasonably long delay time consistent with the rate of deterioration of the
defect. However, should the exhaust collapse because a support bracket was
corroded through, the likely warning period for the driver, the delay time, would be
virtually zero, since he would not normally be expected to look under the vehicle.
At the same time, had an inspection been undertaken by a service mechanic, the
delay time may have been measured in weeks or months. Had the exhaust
collapsed because securing bolts became loose before falling out, then the driver
could have had a warning period of excessive vibration, and perhaps noise, and the
defects would have had a drive related delay time measured in days or weeks.

● ● ● ● ● ● ●
Figure 14.1. Failure points ‘●’

○ ●
u failure
Figure 14.2. The delay time for a defect
348 W. Wang

To see why the delay time concept is of use, consider Figure 14.3 incorporating
the same failure point pattern as Figure 14.1 along with the initial points associated
with each failure arising under a breakdown system. Had an inspection taken place
at point (A), one defect could have been identified and the seven failures could
have been reduced to six. Likewise, had inspection taken place at points (B) and
point (A), four defects could have been identified and the seven failures could have
been reduced to three. Figure 14.3 demonstrates that provided it is possible to
model the way defects arise, that is the rate of arrival of defects λ (u ) , and their
associated delay time h , then the delay time concept can capture the relationship
between the inspection frequency and the number of plant failures.
We are assuming for now that inspections are perfect, that is, a defect is recog-
nised if, and only if, it is there and is removed by corrective action. Delay time
modelling is still possible if these assumptions are not valid, but this more complex
case is discussed in Section 14.3.1.

○ ○ ● ○ ● ● ○ ○● ● ○○ ● ●
Figure 14.3. ‘○’ initial points; ‘●’ failure points

14.3 Delay Time Models for Complex Plant

14.3.1 Perfect Inspections

A complex plant, or multi-component plant, is one where a large number of failure

modes arise, and the correction of one defect or failure has nominal impact in the
steady state upon the overall plant failure characteristics. Consider the following
basic complex plant maintenance modelling scenario where:

1. An inspection takes place every T time units, costs cs units and requires
d s time units, where d s << T .
2. Inspections are perfect in that all (and only) defects present are identified.
3. Defects identified are repaired during the inspection period.
4. Defects arise according to a homogeneous Poisson process (HPP) with the
rate of occurrence of defects, λ , per unit time.
5. The delay time, H , of a random defect is described by a pdf. f (h) , cdf.
F (h) , and is independent of the initial point U .
6. Failure will be repaired immediately at an average cost c f and downtime
df .
7. The plant has operated sufficiently long since new to be considered effec-
tively in a steady state.
8. Defects and failures only arise whilst plant is operating.
Delay Time Modelling 349

These assumptions characterise the simplest non-trivial inspection maintenance

problem, Christer et al. (1995), and would, of course, only be agreed in any parti-
cular case after careful analysis and investigation of the specific situation. We now
proceed to construct the mathematical model of the relationship between T and an
objective function of interest.
From assumptions 1–4, it is obvious that the number of system failures is iden-
tical and independent over each inspection interval, and we can simply study the
behaviour of such a failure process over one interval, say the first interval [0, T ) .
Suppose for now that we take the expected downtime per unit time, D(T ) , as a
measure of our objective function, the relationship between T and D(T ) can be
established directly by using the renewal reward theorem (Ross 1983) as

E (Downtime over t) d f E[( N f (T )] + d s

D(T ) = limt →∞ = (14.1)
t T + ds

where E[ N f (T )] is the expected number of failures within [0,T). Clearly if

E[ N f (T )] is available, D(T ) can be readily calculated.
It can be shown that the failure process shown in Figure 14.3 is a Marked
Poisson process (Taylor and Karlin 1998), with the delay time h as the marker. It
has been proved that this failure process over [0, T ) is a nonhomogenous Poisson
process (NHPP) (Taylor and Karlin 1998; Christer and Wang 1995). To derive the
rate of occurrence of failures (ROCOF), ν (t ) , for this NHPP, within [0, T ) , we
start first by deriving the expected number of failures within [0, T ) . Since the ex-
pected number of the defects arrived within [ t , t + δ t ), 0 ≤ t < T , is λδ t , then the
expected value of the failures caused by these defects is λ F (T − t )δ t . Integrating
t from 0 to T and after some manipulation we have

E[ N f (T )] = ∫ λ F (t )dt (14.2)

Differentiating Equation 14.2 with respect to T we have

v(t ) = λ F (t ) (14.3)

The original model developed in Christer and Waller (1984) for Equation 14.2
uses a different approach, but leads to the same result.

14.3.2 Imperfect Inspections

Section 14.3.1 outlined a basic delay time model under perfect inspections. It is
established under a set of assumptions, and some of them may not be valid in
practical situations. These assumptions greatly simplify the mathematics involved
but also restrict a wider use of the models developed. Perhaps the most restrictive
assumption is that of perfect inspections. In almost all the case studies conducted
using the delay time concept, we found none of them supported the perfect inspec-
350 W. Wang

tion assumption. The other concerning assumption is the HPP for defect arrival in
the case of a complex system. One would naturally think as the system ages there
could be more defect arrivals than that of a younger system. In this section, we
introduce one delay time model that relaxes the perfect inspection assumption. The
delay time model using a NHPP is presented in Christer and Wang (1995) and
Wang and Christer (2003). These models are mainly developed for complex sys-
tems, but a non-perfect inspection single component delay time model can also be
developed along a similar line (Baker and Wang 1991).
All the assumptions proposed in Section 14.3.1 will hold except the perfect
inspection one. Assume for now that if a defect is present at an inspection; then
there is a probability r that the defect can be identified. This implies that there is a
probability 1 − r that the defect will be unnoticed. Figure 14.4 depicts such a

Two defects were not identified

○ ○ ● ○ ○ ○● ● ● ○○ ●
A B C time
Figure 14.4. Failure process of a multi-component system subject to three non-perfect in-
spections at points A, B, and C; two potential failures were removed and two missed

It has been proved that the failure process over each inspection interval is still
an NHPP (Christer and Wang 1995), but not identical over the earlier inspection
intervals of the system. It can be shown that as the number of inspections increases,
the number of failures over each inspection interval becomes stable and identical,
so we need to study the asymptotic behaviour of the failure process assuming the
number of previous inspections is very large.
i --- i-th inspection
U --- random variable of the initial time u
r --- probability of perfect inspection
ν i (t ) --- ROCOF at time t , t ∈ [(i − 1)T , iT )
E[ N f ((i − 1)T , iT )] --- expected number of failures over [(i − 1)T , iT )
E[ N s (iT )] --- expected number of defects identified at iT

It can be shown (Christer et al. 1995; Christer and Wang 1995) that vi (t ) is given

vi (t ) = λ∑ n=1 (1 − r)i −n+1[F (t − (n −1)T ) − F (t − nT )] + λ F (t − (i −1)T )


for t ∈ [(i − 1)T , iT ) .

Delay Time Modelling 351

It can also be proved by induction that vi −1 (t ) ≈ vi (t ) when i is large. Given that

Equation 14.4 is available, it is straightforward that the expected number of failures
over [(i − 1)T , iT ) is given by

E [ N f ((i − 1)T , iT )] = ∫ ( i −1)T
vi (t )dt
{λ ∑ }
iT i
= ∫ ( i −1)T n =1
(1 − r )i − n +1[ F (t − (n − 1)T ) − F (t − nT )] + λ F (t − (i − 1)T ) dt

The expected number of defects found at an inspection point, say, iT , is also a

Poisson variable with the mean given by (Christer et al. 1995; Christer and Wang

E[ N s (iT )]
nT iT (14.6)

=λ n =1
(1 − r )i − n +1r ∫ ( n −1)T
[1 − F (iT − u )]du + λ r ∫ ( i −1)T
[1 − F (iT − u )] du

The expected downtime is given by Equation 14.1 with the expected number of
failures given by by Equation 14.5, so that

d f E[ N f ((i − 1)T , T )] + d s
D(T ) = (14.7)
T + ds

The use of Equation 14.7 assumes that the system is already in a steady state
with i → ∞ . For computation purpose we can select a large i , and then n starts
from the first k where (1 − r )i − k +1 ≥ ε and ε is a very small number.
Equation 14.7 is established assuming that the defects identified at an inspection
will always be removed without costing any extra downtime or cost. This assump-
tion can be relaxed. Let d r be the mean downtime per defect being repaired. Then
using the same approach as before, the expected downtime is given by

d f E[ N f ((i − 1)T , T )] + d s + d r E[ N s (iT )]

D(T ) = , (14.8)
T + d s + d r E[ N s (iT )]

If the objective function is the expected cost per unit time, we obtain this by
simply substituting the downtime parameters in Equations 14.7 or 14.8 by the
corresponding cost parameters.

Example 14.1 Assume that the rate of occurrence of defects is two per day, and the
delay time distribution is exponential with scale parameter 0.03 measured in days.
The downtime measures are d f = 30 and d s = 30 min respectively. The probability
of a perfect inspection is assumed to be 0.7. Using Equations 14.5 and 14.7, we have
the expected downtime against inspection intervals as shown in Figure 14.5. It can be
seen from Figure 14.5 that a weekly inspection interval is the best.
352 W. Wang


Expected cost per unit time











Inspection interval

Figure 14.5. Expected downtime per unit time vs. inspection interval (in days)

14.4 Delay Time Model for a Component Subject

to a Single Failure Mode (Single Component System)
Most DTM applications are for multiple component systems subject to independent
failure modes; although most maintained equipment fall into this category, there
are plant items which may have a single dominant failure mode, and may be, in
some cases, replaced or renewed upon failure. Examples of such plant items are
batteries, traffic lights, small pumps and motors. Such plant items are called single
component systems. Noted that a system in this category may not actually be a
single component, but the key difference compared with a complex multi-compo-
nent system is that this single component system is subject to a single failure mode,
and the only maintenance action is to renew the whole system either by a complete
replacement or a renewal type of repair. This implies that at any point of time, only
one defect of the dominant failure mode can exist. This contrasts with a complex
system with many failure modes, where only the failed component was replaced or
repaired upon a failure, and at any point of time there could be many defects
present, and the system is not renewed at failures.
The failure process of this type of a single plant item is different from that of a
multi-component complex system; see Figures 14.6 and 14.7.

○ ○ ● ○ ● ● ○ ○● ● ○○ ● ●
Figure 14.6. Failure process of a multi-component system, where ‘○’ denotes initial points;
‘●’ failure points
Delay Time Modelling 353

○ ● ○ ● ○ ●
Figure 14.7. Failure process of a single component system

For the system in Figure 14.6, the system may be renewed at inspection points if
these inspections are perfect, and the rate of arrival of defects is constant. However
for the system in Figure14.7, the system can be renewed either at a failure or at an
inspection. We present the case with a perfect inspection assumption. The case of an
imperfect inspection delay time model for a single component can be found in
Baker and Wang (1991, 1993).
We need the following additional assumptions and notation;

1. The system is renewed at either a failure repair or at a repair done at an

inspection if a defect is identified.
2. After either a failure renewal or inspection renewal the inspection process
3. The initial time, U , to the appearance of a random defect has a probability
density function g (u ) .
4. The defective compoment identified at an inspection will be renewed either
by a repair or a replacement at an average cost of cr and downtime d r .

14.4.1 Inspection Model Based on an Exponentially Distributed Initial Time

We first consider a simple case that an inspection renews the system regardless of
whether a defect was identified or not. This effectively assumes an exponential
distribution for the initial time U .
Since each failure or inspection renewed the system with associated downtimes
or costs, the process is a renewal reward process, and the long term expected cost
per unit time, C (T ) , is given by Ross (1983):

C(T) =

where CC is the renewal cycle cost and CL is the renewal cycle length which is
the interval between two consecutive renewals. There could be two different
renewal cycles, one is the failure renewal and the other is the inspection renewal.
Taking the expected cost per renewal cycle as an example, since a failure will
cost c f with probability of it happening as P( X < T ) , then the expected cost due
to a failure renewal within T is
c f P( X < T ) = c f ∫ 0
g (u ) F (T − u ) du , (14.9)

where X is the time to failure.

354 W. Wang

The expected cost due to an inspection renewal with a defect identified at T is

(cr + cs ) P (U < T ∩ X ≥ T ) = (cr + cs ) ∫ g (u ){1 − F (T − u )}du (14.10)

and finally the expected cost due to an inspection renewal without a defect being
identified at T is given by

cs P (U ≥ T ) = cs ∫ T
g (u ) du (14.11)

From Equations 14.9–14.11 we have expected cost per renewal cycle:

T T ∞ (14.12)
= cf ∫ 0
g (u ) F (T − u )du +(cr + cs ) ∫ g (u ){1 − F (T − u )}du + cs ∫ g (u ) du
0 T

As to the expected cycle length, we model two possibilities. The first is that the
cycle ends at a failure before T . Define p (t ) the density function for the time to
failure which is given readily by

d t
p (t ) =
P( X ≤ t ) = ∫ 0
g (u ) f (t − u ) du

Since 1 − P( X < T ) is the probability of no failure, which implies an inspection

renewal and is given by 1 − ∫ g (u ) F (T − u ) du , we have

T t T
E (CL) = ∫ t∫
0 0
g (u ) f (t − u )dudt + T (1 − ∫ 0
g (u ) F (T − u ) du ) (14.13)

For the detailed derivation of Equations 14.9–14.13 see Baker and Wang (1991,
Finally the expected cost per unit time is given by

C(T) =
T T ∞
cf ∫ 0
g (u ) F (T − u )du + (cr + cs ) ∫ 0
g (u ){1 − F (T − u )}du + cs ∫ T
g (u ) du (14.14)
T t T

∫ t∫0 0
g (u ) f (t − u )dudt + T (1 − ∫ 0
g (u ) F (T − u ) du )

The expected downtime can be obtained in a similar manner.

Delay Time Modelling 355

Example14.2 Assume both the initial time and delay time distributions are expo-
nential with scale parameters 0.6 and 0.75 respectively. The time unit is 100 days
and the cost parameter values are c f = £1000, cr = £150 and cs = £15 respectively.
Using Equation 14.14, the calculated expected cost per unit time as a function of T
is shown in Figure 14.8.
Expected cost per unit time

Inspection interval

Figure 14.8. Expected cost per unit time vs. inspection interval

The optimal inspection interval is 0.4 x 100 = 40 days, so a monthly inspection

schedule is appropriate.

14.4.2 Inspection Model Based on a Non-exponentially Distributed Initial Time

If g (u ) is not exponentially distributed, then we cannot assume any inspection will

renew the system unless a defect was identified at an inspection and the system
was replaced or repaired to as new condition. In this case a renewal cycle may span
several inspection intervals.
Using a similar framework as before and now taking the expected downtime
per renewal cycle as an example, the expected downtime due to a failure renewal at
time X where X ∈ [(i − 1)T , iT ) is

[(i − 1)d s + d f ]P ((i − 1) < X < iT ) = [(i − 1)d s + d f ]∫ g (u ) F (iT − u )du (14.15)
( i −1)T

This is because inspections are perfect so that if a failure at time X, then the
initial time U must be bounded within [(i − 1)T , X ), X < iT . There are (i − 1)
inspections with no defect identified before the failure so (i − 1) times of the
inspection downtime are added.
Equation 14.15 models only one of the possibilities and a failure can be in any
of the inspection intervals so summing over all possible intervals i from 1 to
infinity gives the expected downtime due to a failure:
356 W. Wang

∑ [(i − 1)d + d ]P((i − 1) < X < iT )

i =1 s f

= ∑ [(i − 1)d + d ]∫

i =1
g (u ) F (iT − u ) du
s f
( i −1)T

Equation 14.16 is always finite since all the probability terms for large i tend
to zero because g (u ) tends to zero for u > (i − 1)T when i is large.
Similarly the expected downtime due to an inspection renewal with a defect
identified is


i =1
((i − 1)d s + d r ) ∫ g (u )[1 − F (iT − u )]du (14.17)
( i −1)T

Summing Equations 14.16 and 14.17 gives the complete expected downtime
per renewal cycle:


= ∑

i =1 {[(i −1)d + d ]∫ s r

( i −1)T
g (u ) du +( d f − d r ) ∫

( i −1)T
g (u ) F (iT − u ) du } (14.18)

The expected cycle length is obtained in a similar manner and is given by

E (CL)

∑ {∫ }
∞ iT t iT (14.19)
= i =1 ( i −1)T
t ∫ ( i −1)T
g (u ) f (t − u ) dudt + iT ∫ ( i −1)T
g (u ){1 − F (iT − u )}du

Finally the expected downtime per unit time is given by

C(T) =

i =1 {[(i −1)d + d ]∫ g (u)du + (d − d )∫ g(u)F (iT − u)du}
s r

( i −1)T
f r

( i −1)T (14.20)

i =1 {∫ t ∫ g (u) f (t − u)dudt + iT ∫ g(u)[1 − F (iT − u)]du}}

( i −1)T

( i −1)T

( i −1)T

14.4.3 A Case Example

The medical physics department of a teaching hospital in England, which main-

tains a large number of medical equipment, records the history of breakdowns and
repairs carried out using history cards for each individual item of departmental
equipment. Information available included purchase date, date of preventive main-
tenance, failures and some description of the work carried out. There were no costs
recorded, but some estimated cost values were provided by the hospital staff.
Delay Time Modelling 357

Following a discussion with the chief technician, it seemed best to focus on the
following items, to ensure a sample of similar machine types, under heavy and
constant use, with a usefully long history of failures, and with reasonably well-
defined modes of failures. Two pumps were chosen, namely volumetric infusion
pumps and peristaltic pumps all from the intensive-care, neurosurgery and heart-
care units. There were 105 volumetric pumps and the most frequent failure mode
was the failure of the pressure transducer. There were 35 peristaltic pumps and the
most frequent failure mode was battery failure. For a detailed description of the
case, data and model fitting see Baker and Wang (1991). Several distributions were
chosen for the initial and delay time distributions for both pumps, and it turned out
that in both cases a Weibull distribution was the best for the initial time distribution
and an exponential distribution for the delay time distribution. The estimated
parameter values based on history data using the maximum likelihood method for
both pumps are shown in Table 14.1.

Table 14.1. Estimated parameter values for the pumps

Pump Initial time pdf. Delay time pdf.

f ( h) = β e − β h
g (u ) = αη (α u ) β −1 e− (α u )
Volumetric infusion α̂ =0.0017, η̂ =1.42 β̂ =0.0174
Peristaltic α̂ =0.0007, η̂ =2.41 β̂ =0.0093

Although the cost data were not recorded, it was relatively easy to estimate the
cost of an inspection (called preventive maintenance in the hospital) and the cost of
an inspection repair if a defect was identified. However, it was extremely difficult
to have an estimate for the failure cost since if the pump failed to work while
needed the penalty cost could be very high compared with the cost of the pump
itself. Nevertheless, some estimates were provided, which are shown in Table 14.2

Table 14.2. Cost estimates

Pump Inspection cost Inspection repair Failure cost

Volumetric infusion £15 £50 £2000
Peristaltic £15 £70 £1000

This time we cannot derive an analytical formula for the expected cost because
of the use of the Weibull distribution. Numerical integrations have to be used to
calculate Equation 14.20. We did this using the maths software package MathCad
and the results are shown in Figures 14.9 and 14.10.
358 W. Wang



Expected_Cost( T )



0 20 40 60 80 100 120

Figure 14.9. Expected cost per unit time vs. inspection interval for the volumetric infusion


Expected_Cost( T ) 1.5

0 20 40 60 80 100 120

Figure 14.10. Expected cost per unit time vs. inspection interval for the peristaltic pump

Time is given in days in Figures 14.9 and 14.10, so the optimal inspection
interval for the volumetric infusion pump is about 30 days and for the peristaltic
pump is around 70 days. The hospital at the time checked the pumps at an interval
of six months, so clearly for both pumps the inspection intervals should be
shortened. However, it has to be pointed out that the model is sensitive to the
failure cost, and had a different estimate been provided, the recommendation
would have been different.
Delay Time Modelling 359

14.5 Delay Time Model Parameter Estimation

14.5.1 Introduction

In previous sections, delay time models for both a complex system and a single
compnent have been introduced. However in a practical situation, before the con-
struction of expected cost or downtime models, it is necessary to estimate the values
of the parameters that characterise the defect arrival and failure processes. In this
section we discuss various methods developed to estimate the parameters from
either ‘subjective’ data of experts opinions or ‘objective’ data collected at failures
and inspections.
Naturally, the parameter estimation process is not the same for the different
types of delay-time model, i.e. single component models where a single potential
failure mode is modelled and only one defect may (or may not) be present at any
one time, compared with complex system models where many defects can exist
simultaneously and many failures can occur in the interval between inspections.
This is particularly important for the method using objective data. In this section,
we mainly focus on the estimation methods for complex systems since these
systems are the most applicable asset items for DTM. The details of the approaches
developed for parameters estimation for a single component DTM can be found in
Baker and Wang (1991, 1993).

14.5.2 Subjective Data Method

If the maintenance records of failures and recorded findings at maintenance inter-

ventions such as inspections (collectively called objective data in this chapter) are
available and sufficient in quantity and quality, the delay time distribution and
parameters can be estimated by the classical statistical method of maximum likeli-
hood; see Section 14.5.3 and the paper by Christer et al. (1995). If, however, such
a data set does not exist, or is insufficient in quality and quantity for the purpose of
estimation, the alternative is to use the subjective judgement of experienced main-
tenance engineers or technicians to obtain the delay time distribution and para-
meters. This section introduces three methods developed by Christer and Waller
(1984), Wang (1997) and Wang and Jia (2007) in estimating the delay time distri-
bution and the associated parameters using subjective data. Subjective Estimation of the Delay Times Through an On-site and On-spot
This method needs to be done over a time period to collect detailed information and
assessment at every maintenance intervention or failure; Christer and Waller (1984).
At every failure repair, the maintenance technician repairing the plant would be
asked to estimate:
HLA: how long ago the defect causing the failure may first have been expected
to have been recognised at an inspection.
If a defect was identified at an inspection, then in addition to HLA, the techni-
cian would be asked to estimate:
360 W. Wang

HML: how much longer could the defect be left unattended before a repair was
The estimates are given by ĥ = HLA for a failure, and ĥ = HLA + HML for an
inspection repair; see Figure 14.11a,b. f (h) is then estimated from the data of { ĥ }.


(a) Failure (b) Inspection
Figure 14.11. HLA and HML estimates at failure and inspection

At the time of repair, the maintenance technician has information available to

produce his estimate. In addition to his experience, the defect is present, the plant
may be examined, and operatives questioned.
The rate of defect arrivals can be estimated directly from the number of ob-
served failures and defects identified over the survey period. For a case study using
this approach for estimating delay time model parameters; see Christer and Waller
(1984). Subjective Estimation of the Delay Times Based Identified Failure Modes
The method introduced earlier is a questionnaire survey based approach where the
subjective opinions of maintenance engineers were asked. It has the advantage of
directly facing the defect or failure when the information regarding the delay time
was requested. However, it has also the following problems: (a) it is a time con-
suming process in conducting such a survey, particularly in the case that the
frequency of failures or defects is not high, which implies a longer time to get
sufficient data; (b) the estimation process is not easy to control since all the forms
are left at the hands of the maintenance engineers involved without an analyst
present, which may result in confusion and mistakes as experienced in the studies
of Christer and Waller (1984) and Christer et al. (1998b).
Wang (1997) recommended a new approach to estimate directly the delay time
distribution based on pre-defined major failure modes or types. The idea is as

1. If the estimates can be made based on pre-selected major failure types

instead of the individual failure or defect when it occurs, the time spent for
the questionnaire survey will be greatly reduced since the estimates for all
major failure types can be carried out at the same time, which may only take
a few hours. This also creates the opportunity for an analyst to be present to
reduce possible confusion and mistakes.
2. A group of experts should be questioned on the same failure type and
opinions can be properly combined to reduce sampling errors.
3. The question asked should be a probabilistic measure of the delay time over
all possible ranges.
Delay Time Modelling 361

The following phases for the estimating of the delay time were suggested; Wang

The problem identification phase This is for the identification of all major failure
types and possible causes of the failures. This was normally done via a failure
mode and criticality analysis so that a list of dominant failures can be obtained.
This process will entail a series of discussions with the maintenance engineers to
clarify any hidden issues. If some failure data exists it should be used to validate
the list, or otherwise a questionnaire should be designed and forwarded to the
person concerned for a list of dominant failure types.

Expert identification and choice phase The term ‘expert’ is not defined by any
quantitative measure of resident knowledge. However, it is clear in the case here
that a person who is regarded by others as being one of the most knowledgeable
about the machine should be chosen as the expert. The shop floor fitters or any
maintenance technicians or engineers who maintain the machine would be the
desired experts; Christer and Waller (1984). After the set of experts is identified, a
choice is made of which experts to use in the study. Full discussion with manage-
ment is necessary in order to select the persons who know the machine ‘best’.
Psychologically, five or fewer experts are expected to take part of the exercise, but
not less than three.

The question formulation phase The questions we want to ask in this case are the
rate of occurrence of defects, (assuming we are modelling a complex plant) and the
delay time distribution. In the case addressing the rate of arrival of a defect type,
we can simply ask for a point estimate since it is not random variable. Without
maintenance interventions, this would, in the long term, be equal to the average
number of the same failure type per unit time. For example we may ask ‘how many
failures of this type will occur per year, month, week or day?’. It is noted that this
quantity is usually observable. In fact, our focus is mainly on the delay time
Given the amount of uncertainty inherent in making a prediction of the delay
time, the experts may feel uncomfortable about giving a point estimate, and may
prefer to communicate something about the range of their uncertainty. Accepting
these points, perhaps the best that experts could do in this case would be to give
their subjective probability mass function for the quantity in question. In other
words, they could provide an estimate over the interval such that the mass above
the interval is proportional to their subjective probability measures. Alternatively,
three point estimates can be asked, such as the most likely, the minimum and the
maximum durations of the delay times for a particular type of failure.
The word ‘delay time’ was not entered in the question since it will take some
effort to explain what is the delay time. Instead, we just asked a similar question
like HLA. But this question was still difficult for the experts to understand based
upon our case experience. The lesson learned is to demonstrate one example for
them before starting the session.
362 W. Wang

The elicitation phase Elicitation should be performed with each expert individu-
ally. If possible, the analyst should be present, which proved to be vital in our case
studies. The above-mentioned histogram was used to draw the answer from the
experts so that the experts can have a visual overview of their estimates and a
smooth histogram could be achieved if the experts are advised to do so. The maxi-
mum number of the histogram intervals is set to be five, which is advised by
psychological experiments.

The calibration phase Roughly speaking, calibration is intended to measure the

extent to which a set of probability mass functions ‘correspond to reality’. Review-
ing the problem we have concluded that subjective calibration is not recommended
due to its time consuming nature. If any objective data is available, we may cali-
brate the experts’ opinion by a Bayesian approach as discussed by many others.
Another approach is to calibrate the estimate by matching a statistics observed. If
significant difference is found, the estimates must be revised.

The combination phase Experts resolution, or combining probabilities from ex-

perts, has received some attention. Here we use one of the simplest approaches,
namely the weighting method. It is simply a weighted average of the estimates of
all experts. The weights need to be selected carefully according to each expert’s
level of expertise, and their sum should be equal to one. Other more complicated
methods are available; see Wang (1997)
It is noted that the combined delay time distribution obtained from this phase is
in a form of discrete probability distribution. In fact a continuous delay time dis-
tribution is needed in delay time inspection modelling. To achieve this, based upon
the number of delay times in each interval, an estimated continuous delay time
distribution Fˆ (h) of F (h) can be obtained by fitting a distribution from a known
family failure distributions, such as exponential or Weibull using the least square
method or maximum likelihood method.

The updating phase This phase is mainly for after some failure and recorded
findings become available. In a sense it is a way of calibrating.
A case study using the above method is detailed in Akbarov et al. (2006). An empirical Bayesian Approach for Estimating the DTM Parameters

Based on Subjective Data
In previous subjective data based delay time estimating approaches (Christer and
Waller 1984; Wang 1997; Akbarov et al. 2006), some direct subjective estimates
of the delay time is required, which has been found to be extremely difficult for the
experts to estimate since the delay time is not usually observable and difficult to
explain Akbarov et al. (2006).
We now introduce a recently developed new approach which starts with sub-
jective data first and then updates the estimates when objective data becomes
available. The initial estimates are made using the empirical Bayesian method
matching with a few subjective summary statistics provided by the experts. These
statistics should be designed easy to get based on the experience of the experts and
on observed practice rather than unobservable delay times. Then the updating
Delay Time Modelling 363

mechanism enters the process when objective data become available, which
requires a repeated evaluation of the likelihood function which will be introduced
later. In the framework of Bayesian statistics and assuming no objective data is
available at the beginning, we basically first assume a prior on the parameters
which characterize the underlying defect and failure arrival processes. When
objective data becomes available, we calculate the joint posterior distribution of the
parameters, and then we may use this posterior distribution to evaluate the expec-
ted cost or downtime per unit time conditional on observed data.
Assuming for now that we are interested in the rate of arrival of defects, λ , and
the delay time pdf., f (h) , which is characterised by a two parameter distribution
f (h | α , β ) . Unlike the methods proposed in Christer and Waller (1984) and Wang
(1997), here we treat parameters λ and the α and β in f (h | α , β ) as random
variables. The classical Bayesian approach is used here to define the prior dis-
tributions for model parameters λ , α and β as f (λ | Φ λ ) , f (α | Φα ) and
f ( β | Φ β ) , where Φ • is the set of hyper-parameters within f (• | Φ • ) .
Once those Φ • are available, the point estimates of λ , α and β are the ex-
pected values of them and are given by

∞ ∞ ∞
λˆ = ∫ 0
λ f (λ | Φ λ ) d λ , αˆ = ∫ 0
α f (α | Φα )dα and βˆ = ∫ 0
β f (β | Φ β )d β

Let g (λ , α , β ) denote a statistics of interest, which may be a function of λ , α

and β , say the mean number of failures within an inspection interval, and
E[ g (Φ λ , Φα , Φ β )] denote its expected value in terms of Φ λ , Φα and Φ β then we

E[ g (Φ λ , Φα , Φ β )]
∞ ∞ ∞ (14.21)
= ∫ ∫ ∫
0 0 0
g (λ , α , β ) f (λ | Φ λ ) f (α | Φα ) f ( β | Φ β )d λ dα d β .

If we can obtain a subjective estimate of E[ g (Φ λ , Φα , Φ β )] provided by the ex-

perts, denoted by g s , then letting E[ g (Φ λ , Φα , Φ β )] = g s , we have

∞ ∞ ∞
gs = ∫ ∫ ∫0 0 0
g (λ , α , β ) f (λ | Φ λ ) f (α | Φα ) f ( β | Φ β )d λ dα d β . (14.22)

Equation 14.22 is only one of such equations and if several such subjective
estimates (different) were provided, we could have a set of equations like Equation
14.22. The hyper-parameters Φ • may be estimated by solving the equations like
Equation 14.22 in the case that the number of equations like Equation 14.22 is at
least the same as the number of hyper-parameters in Φ • . We now demonstrate this
in our case.
Suppose that the experts can provide us the following subjective statistics in
estimating Φ λ :
364 W. Wang

• The average number of failures within [0, T ) , denoted by , n f

• The average number of defects identified at inspection time T ,
denoted by nd
• The average probability of no defect at all in [0, T ) , denoted by pnd .
In this case if the statistics of interest is the average number of the defects within
[0, T ) , we have from the property of the HPP that g (λ , α , β ) = λT , and then

E[ g (Φ λ , Φα , Φ β )]
∞ ∞ ∞ ∞
= ∫ ∫ ∫
0 0 0
λTf (λ | Φ λ ) f (α | Φα ) f ( β | Φ β ) d λ dα d β = ∫ 0
λTf (λ | Φ λ ) d λ

Since if inspection is perfect we have g s = n f + nd , it follows from Equation 14.22


n f + nd = ∫ λTf (λ | Φ λ ) d λ . (14.23)

Similarly, from the property of the HPP, that is, P( N d (0,T) = n|λ ) = e (λT ) , we


∞ ∞
pnd = ∫ 0
Pr ( N d (0,T) = 0|λ )f (λ|Φλ )d λ = ∫ 0
e − λT f ( λ | Φ λ ) d λ . (14.24)

where N d (0, T ) is the number of defects in [0, T ) . If we have only two hyper-para-
meters in Φ λ , then solving Equations 14.23 and 14.24 simultaneously in terms of
Φ λ will give the estimated values of the hyper-parameters in Φ λ . Note that λ is
independent with α and β so that the integrals of f (α | Φα ) and f ( β | Φ β ) are
dropped from Equation 14.21. Similarly if more subjective estimates were provided,
the hyper-parameters in Φα and Φ β can be obtained. For a detailed description of
such an approach to estimate delay time model parameters see Wang and Jia (2007).
Obviously this approach is better than the previously developed subjective
methods in terms of the way to get the data and the accuracy of the estimated
parameters. It is also naturally linked to the objective method in estimation DTM
parameters to be presented in the next section via Bayesian theorem if such
objective data becomes available, Wang and Jia (2007).

14.5.3 Objective Data Method

Objective data for complex systems under regular inspections should consist of the
failures (and associated times) in each interval of operation between inspections
and the number of defects found in the system at each inspection. From this data
information, we estimate the parameters for the chosen form of the delay time
Delay Time Modelling 365

Initially, we consider a simple case of the estimation problem for the basic
delay time model where only the number of failures, mi , occurring in each cycle
[(i − 1), iT ) and the number of defects found and repaired, ji , at each inspection (at
time iT ) are required. We do not know the actual failure times within the cycles
The probability of observing mi failures in [(i − 1), iT ) is
− E [ N f (( i −1)T ,iT )]
e E[ N f ((i − 1)T , iT )]mi
P ( N f ((i − 1)T , iT ) = mi ) = (14.25)
mi !

Similarly the probability of removing ji defects at inspection i (at time iT ) is

e− E[ N s ( iT )]
E[ N s (iT )] j i

P ( N s (iT ) = ji ) = (14.26)
ji !

As the observations are independent, the likelihood of observing the given data
set is just the product of the Poisson probabilities of observing each cycle of data,
mi and ji . As such, the likelihood function for K intervals of data is

K ⎧⎪⎛ e − E[ N f (( i −1)T , iT )]
E[ N f ((i − 1)T , iT )]m ⎞ ⎛ e − E[ N i s ( iT )]
E[ N s (iT )] j
i ⎞ ⎫⎪
L (Θ) = ∏ ⎨⎜⎜
mi !
⎟⎜ ji !
⎟⎬ ,

i =1
⎩⎪⎝ ⎠⎝ ⎠ ⎭⎪

where Θ is the set of parameters within the delay time model. The likelihood
function is optimised with respect to the parameters to obtain the estimated values.
This process can be simplified by taking natural logarithms. The log-likelihood
function is

 ( Θ)

∑ ( m log{E[ N ((i − 1)T , iT )]} + j log{E[ N (iT )]} − E[ N ((i − 1)T , iT )] − E[ N s (iT )])
= i =1 i f i s f (14.28)
−∑ ( log(m !) + log( j !) )
i =1 i i

where the final summation term is irrelevant when maximising the log-likeli-
hood as it is a constant term and therefore not a function of any of the parameters
under investigation.
When the times of failures are available, it is often necessary to refine the
likelihood function at Equation 14.27 by considering the detailed pattern of be-
haviour within each interval in terms of the number of failures and their associated
times. Define t ij the time of the j-th failure in the i-th inspection interval; the
likelihood is given by (Christer et al. 1998a)

K ⎧⎪ mi ⎛ e − E[ N s (iT )] E[ N s (iT )] ji ⎞ ⎫⎪
− E [ N f (( i −1)T , iT )]
L (Θ) = ⎨ v (t )e ⎜ ⎟⎬ (14.29)
j =1 i ij ⎜ ji ! ⎟
i =1 ⎩⎪ ⎝ ⎠ ⎭⎪

where vi (tij ) is given by Equation 14.4.

366 W. Wang

In the case study of Christer et al. (1995), only the daily numbers of failures are
available. They formulated a different likelihood taking account of this pattern of
data. It was done essentially by formulating the probability of a particular number
of failures for each day over each inspection interval, and then the likelihood for a
particular inspection interval is just the product of these probabilities and the
probabilty of observing some number of defects at the inspection; see Christer et
al. (1995) for details.

14.5.4 A Case Example

A copper works in the north-west of England has used the same extrusion press for
over 30 years, and the plant is a key item in the works since 70% of its products go
through this press at some stage of their production. The machine comprises a
1700-ton oil-hydraulic extrusion press with one 1700 kW induction heater and
completely mechanized gear for the supply of billets to the press and for the
removal of the extruded products. The machine was operated 15–18 h a day (two
shifts), five days a week, excluding holidays and maintenance down-time. Preven-
tive maintenance (PM) has been carried out on this machine since 1993, which
consisted of a thorough inspection of the machinery, along with any subsequent
adjustments or repairs if the defects found can be rectified within the PM period.
Any major defects which cannot be rectified during the PM time were supposed to
be dealt with during non-production hours. PM lasted about 2 h and is performed
once a week at the beginning of each week.
Questions of concern are (i) whether PM is or could be effective for this
machine; (ii) whether the current PM period is the right choice, particularly the one
week PM interval which was based upon maintenance engineers’ subjective judge-
ment; (iii) whether PM is efficient, i.e. whether it can identify most defects present
and reduce the number of failures caused by those defects.
In this case study, the delay time model introduced earlier was used to address
the above questions. The first question can also be answered in part by comparing
the total downtime per week under PM with the total downtime per week per week
of the previous years without PM. A parallel study carried out by the company
revealed that PM has lowered the total downtime. The proportion of downtime was
reduced from 7.8% to 5.8%.
To establish the relationship between the downtime measure and the PM
activities using the delay time concept, the first task is to estimate the parameters
of the underlying delay time distribution from available data, and hence build a
model to describe the failure and PM processes. The type of delay time model used
in the study is the non-perfect inspection model.
In the original study, Christer et al. (1995), a number of different candidate
delay time distributions were considered including exponential and Weibull distri-
butions. The chosen form for the delay time distribution is a mixed distribution
consisting of an exponential distribution (scale parameter α) with a proportion P of
defects having a delay time of 0. The cdf. is given by

F(h) = 1 − ( 1 − P)e −α h
Delay Time Modelling 367

An optimisation algorithm is required for maximisation of the likelihood with

respect to the parameters. The estimated values are given in Table 14.3 with their
associated coefficients of variation (CV).

Table 14.3. Estimated model parameters

Rate of occurrence Probability of perfect Proportional of zero Scale parameter

of defect inspection delay time of defects

λˆ = 1.3561 rˆ = 0.902 Pˆ = 0.5546 αˆ = 0.0178

CV = 0.0832 CV = 3.4956 CV = 0.4266 CV = 1.1572

Inserting the optimal parameter estimates into the log-likelihood function gives
an ML value of 101.86. See Christer et al. (1995) on the analysis and the fit of the
model to the data.

14.6 Other Developments in DTM and Future Research

Several useful extensions have been made over the last decade to make the delay
time model more realistic, but that increases the mathematical complexity as well.
Christer and Wang (1995) addressed an NHPP non-perfect inspection delay
time model of multiple component systems. In this case the constant inspection
interval assumption cannot be held, and a recursive algorithm was developed in
Wang and Christer (2003) to find the optimal non-constant intervals until final
replacement. Christer and Redmond (1990) reported a problem of sampling bias,
and proposed ways of estimating the delay time distribution from subjective data.
Wang and Christer (1997) modelled a single component system subject to inspec-
tions over a finite time horizon. Christer et al. (1997) used an NHPP in modelling
the rate of arrival of defects within a case study. Wang (2000) developed a model
of nested inspections using the delay time concept. Wang and Jia (2007) reported
the use of empirical Bayesian statistics in the estimation of delay time model
parameters using subjective data, which overcame a number of problems in pre-
vious subjective delay time parameter estimation. If the downtime due to failures
cannot be ignored in the calculation of the expected number of failures during an
inspection interval, Christer et al. (2000) addressed this problem and a refined
method was proposed. Christer et al. (2001) compared the delay time model with
an equivalent semi-Markov setting to explore the robustness of both modelling
techniques to the Markov assumption. Carr and Christer (2003) in a recent paper
studied the problems of non-perfect repairs at failures, which allows failures to re-
occur if the repair is not perfect.
The future research on the DTM relies on the application areas, the data in-
volved, and the objective function chosen. We consider that the following areas or
problems are worthy of research using the delay time concept:
368 W. Wang

1. PM type of inspections. Inspections may consist of many activities and

some of them are purely preventive types such as greasing, topping up oil,
and cleaning, which may have no connection with defect identification. It is
noted, however, that this type of PM may change the RATE of defect
arrivals and therefore change the expected number of failures within an
inspection interval. This problem has not been modelled in previous DTM
research, but it is a reality we have to face. An initial idea is to introduce
another parameter in the RATE OF DEFECT ARRIVALS to model the
effectiveness of such PM activities.
2. Multiple inspections scheme. This is again common in practice in that more
than one inspection intervals of different scales or types are in place. Wang
(2000) developed a DTM for nest inspections, but the model is not generic,
and can only be used for a specific type of problems.
3. Condition monitoring (CM). This is becoming more popular in industry and
offers abundent modelling opportunities with a large amount of data. CM
may identify the initial point of a random defect at an earlier stage than
manual inspections, and it is possible that u, the initial point of a random
defect, becomes observable by CM. A pilot research has been carried out to
investigate the use of DTM in condition based maintenance modelling
(Wang 2006).
4. Parameters estimation. This is still an on-going research item since for each
specific problem we may have to develop a tailor made approach. The em-
pirical Bayesian approach outlined earlier is promising since it combines
both subjective and objective data. It is noted, however, that the computa-
tion involved is intensive, and therefore, algorithms developments are re-
quired to speed up the process.

14.7 Conclusion
There is considerable scope for advances in maintenance modelling that impact
productivity upon current maintenance practice. This chapter reports upon one
methodology for modelling inspection practice. The power of mathematics and
statistics is used to exploit an elementary mathematical construct of failure process
to build operational models of maintenance interactions. The delay time concept is
a natural one within the maintenance engineering context. More importantly, it can
be used to build quantitative models of the inspection practice of asset items, which
have proved to be valid in practice. The theory is still developing, but so far there
has been no technical barrier to developing DTM for any plant items studied.
This chapter has introduced the delay time concept and has shown how it can
be applied to various production equipment to optimise inspection intervals. To
provide substance to this statement, the processes of model parameter estimation
and case examples outlining the use of delay time modelling in practice are
introduced. We only presented some fundamental DTMs and associated parameters
estimation procedures, but interested readers can refer to the references listed at the
end of the chapter for further consultation.
Delay Time Modelling 369

14.8 Dedications
This chapter is dedicated to Professor Tony Christer who recently passed away.
Tony was a “world class” researcher with an international reputation. He was the
originator of the delay time concept and had produced in conjunction with others a
considerable number of papers in delay time modelling theory and applications. He
was a great man who enthused, mentored and guided many of us to strive for
higher quality research. He will be sadly missed by all who knew him.

14.9 References
Abdel-Hameed, M., (1995), Inspection, maintenance and replacement models, Computers
and Operations Research, V22, 4, 435–441
Akbarov, A., Wang W. and Christer A.H., (2006), Problem identification in the frame of
maintenance modelling: a case study, to appear in I. J. Prod. Res.
Baker, R.D. and Wang, W., (1991), Estimating the delay time distribution of faults in
repairable machinery from failure data, IMA J. Maths. Applied in Business and Industry,
4, 259–282.
Baker, R. and Wang, W., (1993), Developing and testing the delay time model, Journal of
Operational Research Society, Vol. 44, No. 4, 361–374.
Barlow, R.E and Proschan, F., (1965), Mathematical theory of reliability, Wiley, New York.
Carr, M.J., and Christer, A.H, (2003) Incorporating the potential for human error in
maintenance models, J. Opl. Res. Soc., 54 (12), 1249–1253
Christer, A.H., (1976), Innovative decision making, proceedings of NATO conference on
the role of effectiveness of theory of decision in practice, eds. Bowen K.C and White
D.J., Hodder and Stoughton, 368–377.
Christer, A.H., (1999), Developments in delay time analysis for modeling plant main-
tenance, J. Opl. Res. Soc., 50, 1120–1137.
Christer, A.H. and Redmond, D.F., (1990), A recent mathematical development in main-
tenance theory, Int. J. Prod. Econ, 24, 227–234.
Christer, A.H. and Waller, W.M., (1984), Delay time Models of Industrial Inspection Main-
tenance Problems, J. Opl. Res. Soc., 35, 401–406.
Christer, A.H and Wang, W., (1995), A delay time based maintenance model of a multi-
component system, IMA Journal of Maths. Applied in Business and Industry, Vol. 6,
Christer, A.H and Whitelaw, J. (1983), An Operational Research approach to breakdown
maintenance: problem recognition, J Opl Res Soc, 34, 1041–1052.
Christer, A.H., Wang, W., Baker, R.D. and Sharp, J.M., (1995), Modelling maintenance
practice of production plant using the delay time concept, IMA J. Maths. Applied in
Business and Industry, Vol. 6, 67–83.
Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1997), A stochastic modelling
problem of high-tech steel production plant, in Stochastic Modelling in Innovative
Manufacturing, Lecture Notes in Economics and mathematical Systems, (Eds. by A.H
Christer, Shunji Osaki and L. C. Thomas), Springer, Berlin, 196–214.
Christer, A.H., Wang, W., Choi, K. and Sharp, J.M., (1998a), The delay-time modelling of
preventive maintenance of plant given limited PM data and selective repair at PM, IMA
J. Maths. Applied in Business and Industry, Vol. 9, 355–379.
370 W. Wang

Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1998b), A case study of modelling
preventive maintenance of production plant using subjective data, J. Opl. Res. Soc., 49,
Christer, A.H., Wang, W. and Lee, C., (2000), A data deficiency based parameter estimating
problem and case study in delay time PM modelling, Int. J. Prod. Eco. Vol. 67, No. 1,
Christer, A.H. Wang, W., Choi, K. and Schouten, F.A., (2001), The robustness of the semi-
Markov and delay time maintenance models to the Markov assumption, IMA. J.
Management Mathematics, 12, 75–88.
Kaio, N. and Osaki, S., (1989), Comparison of Inspection Policies Journal of the Opera-
tional Research Society, Vol. 40, No. 5, 499–503
Luss, H., (1983), An Inspection Policy Model for Production Facilities, Management
Science, Vol. 29, No. 9, 1102–1109
McCall, J., (1965), Maintenance Policies for Stochastically Failing Equipment: A Survey,
Management Science, Vol. 11, No. 5, 493–524
Moubray, J., (1997), Reliability Centred Maintenance, Butterworth-Heineman, Oxford.
Ross, (1983), Stochastic processes, Wiley, New York
Taylor, H.M., and Karlin, S., (1998), An introduction to stochastic modeling, 3rd Ed.,
Academic press, San Diego.
Thomas, L.C., Gaver, D.P. and Jacobs, P.A. (1991), Inspection Models and their application,
IMA Journal of Management Mathematics, 3(4):283–303
Wang, W., (1997), Subjective estimation of the delay time distribution in maintenance mo-
delling, European Journal of Operational Research, 99, 516–529.
Wang W., (2000), A model of multiple nested inspections at different intervals, Computers
and Operations Research, 27, 539–558.
Wang W., (2006), Modelling the probability assessment of the system state using available
condition information, to appear in IMA. J. Management Mathematics
Wang W. and Christer A.H., (1997), A modelling procedure to optimise component safety
inspection over a finite time horizon, Quality and Reliability Engineering International,
13, No. 4, 217–224.
Wang W. and Christer A.H., (2003), Solution algorithms for a multi-component system
inspection model, Computers and OR, 30, 190–134.
Wang W. and Jia, X., (2007), A Bayesian approach in delay time maintenance model
parameters estimation using both subjective and objective data, Quality Maintenance
and reliability Int. , 23, 95–105
Part E


Maintenance Outsourcing

D.N.P. Murthy and N. Jack

15.1 Introduction
Every business (mining, processing, manufacturing and service-oriented businesses
such as transport, health, utilities, communication) needs a variety of equipment to
deliver its outputs. Equipment is an asset that is critical for business success in the
fiercely competitive global economy. However, equipment degrades with age and
usage and ultimately become non-operational and businesses incur heavy losses
when their equipment is not in full operational mode. For example, in open cut
mining, the loss in revenue resulting from a typical dragline being out of action is
around one million dollars per day and the loss in revenue from a 747 plane being
out of action is roughly half a million dollars per day. Non-operational equipment
leads to delays in delivery of goods and services and this in turn causes customer
dissatisfaction and loss of goodwill.
Rapid changes in technology have resulted in equipment becoming more com-
plex and expensive. Maintenance action can reduce the likelihood of such equip-
ment becoming non-operational (referred to as preventive maintenance) and also
restore a non-operational unit to an operational state (referred to as corrective main-
tenance). For most businesses it is no longer economical to carry out maintenance in
house. There are a variety of reasons for this including the need for a specialist work
force and diagnostic tools that often require constant upgrading. In these situations it
is more economical to outsource the maintenance (in part or total) to an external
agent through a service contract. Campbell (1995) gives details of a survey where it
was reported that 35% of North American companies had considered outsourcing
some of their maintenance.
Consumer durables (products such as kitchen appliances, televisions, auto-
mobiles, computers, etc.) that are bought by individuals are certainly getting more
complex. A 1990 automobile is immensely more complex than its 1950 counter-
part. Customers need assurance that a new product will perform satisfactorily over
its lifetime. In the case of consumer durables, manufacturers have used warranties
to provide this assurance during the early part of a product’s useful life. Under
374 D. Murthy and N. Jack

warranty the manufacturer repairs all failures that occur within the warranty period
and this is often done at no cost to the customer. The warranty period for most
consumer durables has been increasing and the warranty terms have been
becoming more favourable to the customer. For example, the typical warranty pe-
riod for an automobile in 1930 was 90 days, in 1970 it was 1 year, and in 1990 it
was 3 years. A warranty is tied to the sale of a product and the cost of servicing the
warranty is factored into the sale price. For customers who need assurance beyond
the warranty period, manufacturers and/or third parties (such as financial institu-
tions, insurance companies and independent operators) offer extended warranties
(or service contracts) at an additional cost to the customer. Extended warranties for
automobiles of 5–7 years are now fairly common.
Governments (local, state or national) own infrastructure (roads, rail and com-
munication networks, public buildings, dams, etc.) that were traditionally main-
tained by in-house maintenance departments. Here there is a growing trend towards
outsourcing these maintenance activities to external agents so that the governments
can focus on their core activities.
In all the above cases, we have an asset (complex equipment, consumer durable
or an element of public infrastructure) that is owned by the first party (the owner)
and the asset maintenance is outsourced to the second party (the service agent who
is also referred to as the “contractor” in many technical papers) under a service
contract. This chapter deals with maintenance outsourcing from the perspectives of
both the owner (the customer for the maintenance service) and the service agent
(the service provider). We focus on the first case (where the customer is a
business) and we develop a framework to indicate the different issues involved,
carry out a review of the literature, and indicate topics that need further investiga-
tion and research.
The outline of the chapter is as follows. Section 15.2 deals with the customer
and the agent perspectives. In Section 15.3, we propose a framework to study main-
tenance outsourcing. Section 15.4 reviews the relevant literature on maintenance
outsourcing and on extended warranties. Section 15.5 deals with a game theoretic
approach to maintenance outsourcing and extended warranties. In Section 15.6 we
briefly discuss agency theory and its relevance to maintenance outsourcing and, in
Section 15.7 we conclude with a brief discussion of future research in maintenance

15.2 Customer and Service Agent Perspectives

15.2.1 Customer

Outsourcing of maintenance involves some or all of the maintenance actions

(preventive and/or corrective) being carried out by an external service agent under
a service contract. The contract specifies the terms of maintenance and the cost
issues. It can be simple or complex and can involve penalty and incentive terms.
Maintenance Outsourcing 375 Businesses
Businesses (producing products and/or services) need to come up with new
solutions and strategies to develop and increase their competitive advantage.
Outsourcing is one of these strategies that can lead to greater competitiveness
(Embleton and Wright 1998). It can be defined as a managed process of trans-
ferring activities performed in-house to some external agent. The conceptual basis
for outsourcing (see Campbell 1995) is as follows:

1. Domestic (in-house) resources should be used mainly for the core com-
petencies of the company.
2. All other (support) activities that are not considered strategic necessities
and/or whenever the company does not possesses the adequate compet-
ences and skills should be outsourced (provided there is an external agent
who can carry out these activities in a more efficient manner).

Most businesses tend not to view maintenance as a core activity and have
moved towards outsourcing it. The advantages of outsourcing maintenance are as

1. Better maintenance due to the expertise of the service agent.

3. Access to high-level specialists on an “as and when needed” basis.
4. Fixed cost service contract removes the risk of high costs.
5. Service providers respond to changing customer needs.
6. Access to latest maintenance technology.
7. Less capital investment for the customer.
8. Managers can devote more resources to other facets of the business by re-
ducing the time and effort involved in maintenance management.

However, there are some disadvantages of outsourcing the maintenance and

these are indicated below

1. Dependency on the service provider.

2. Cost of outsourcing.
3. Loss of maintenance knowledge (and personnel).
4. Becoming locked in to a single service provider.

For very specialised (and custom built) products, the knowledge to carry out the
maintenance and the spares needed for replacement need to be obtained from the
original equipment manufacturer (OEM). In this case, the customer is forced into
having a maintenance service contract with the OEM and this can result in a non-
competitive market. In the USA, Section II of the Sherman Act (Khosrowpour
1995) deals with this problem by making it illegal for OEMs to act in this manner.
When the maintenance service is provided by an agent other than the original
equipment manufacturer (OEM) often the cost of switching prevents customers
from changing their service agent. In other words, customers get “locked in” and
are unable to do anything about it without a major financial consequence.
376 D. Murthy and N. Jack

As a result, it is very important for businesses to carry out a proper evaluation

of the implications of outsourcing their maintenance. If done properly, outsourcing
can be cheaper than in-house maintenance and can lead to greater business profit-
ability. Owners of Infrastructure

Traditionally, governments owned and operated infrastructures (such as road, rail,
water and electricity networks). There has been a growing trend towards selling
these assets to private businesses who either lease them back to the government or
operate of the asset. The maintenance of the asset is often outsourced as it is again
viewed as not being the core activity of the business owning the asset. A com-
plicating factor is the additional parties involved and these are shown in Figure
15.1. For example, in the case of a rail network, the operators are the different rail
companies that use the track and the maintenance is outsourced to specialist
contractors. The government plays a critical role in terms of providing loans to
and/or acting as a guarantor for the owner and the regulators are independent
authorities responsible for ensuring public safety. The role of maintenance now
becomes important in the context of safety and risk. For further discussion see
Vickerman (2004).





Figure 15.1. Different parties that need to be considered in the maintenance of infrastructures Individual Consumers

In the case of consumer durables, the cost of rectifying failures in the post-
warranty period is a concern to buyers. The uncertainty in the cost of repair and
attitude to risk determines the amount a customer is willing to pay for an extended
warranty or service contract. In one sense, opting for an extended warranty can be
viewed as taking out an insurance to cover future potential costs resulting from the
product failures in the post-warranty period.
Maintenance Outsourcing 377 Decision Problems

In the case of businesses (producing goods and services) and infrastructure
operators the decision problems are (i) whether to outsource or not, (ii) what main-
tenance activities to outsource and, (iii) how to implement and manage the process.
We will discuss these issues in a later section.
In the case of an extended warranty, the customer has to decide (i) whether or
not to buy an extended warranty and (ii) the best one to buy when there are several
different options.

15.2.2 Service Agent – Issues and Decisions

The service agent providing the maintenance needs to operate as a service

business. This implies that issues such as return on investment (ROI), number of
customers to service (market share), location of operations, range of service
contracts to offer are some of the variables that are important in the context of
strategic management of the business. The type of contract depends on the needs of
customers and they can be either standard contracts or customized. At the opera-
tional level, the service agent needs to deal with issues such as scheduling of
maintenance tasks, spare part inventory control, etc.
The pricing of the different service contracts offered is critical for business
profitability. If the price is low, the service agent might end up making a loss
instead of profit. On the other hand, if it is too high then there might be no cus-
tomer for the service. The price must cover the costs and estimating the cost is a
challenge due to information uncertainties. Extended Warranty Providers – Issues and Decisions

For most products, the product market has become global and highly competitive,
resulting in many similar brands. Survival and growth in such an environment
requires the manufacturers to differentiate their products from those of competi-
tors. Product support provides the mechanism for this differentiation. Product
support deals with issues such as providing better information about the product
before sale and post-sale support in the form of warranty, extended warranty,
training, upgrades, spares, etc. The bundling of products with product-support is a
mechanism that manufactures have used very effectively to market their products
(see Eppen et al. 1991).
In many industries (for example, consumer electronics) extended warranties
have been highly profitable to manufacturers (see Padmanabhan 1996 and the UK
Competition Commission Report 2003). The popularity of extended warranties has
resulted in third parties (financial institutions, insurance companies and indepen-
dent operators) providing these to customers.
The decision problem here is the pricing of extended warranties. The price must
exceed the cost of servicing claims over the warranty period. In the case where the
extended warranty is offered by the manufacturer, the manufacturer has some
information about product reliability. However, third parties offering extended
warranties lack this information and as such the decision on pricing must take into
account this uncertainty.
378 D. Murthy and N. Jack

15.3 Framework to Study Maintenance Outsourcing

A proper framework to study maintenance outsourcing from both customer and
service agent points of view involves several interlinked elements as indicated in
Figure 15.2. In Section 2 we discussed the customer and the service agent elements
and in this section we discuss the remaining elements.










Figure 15.2. Framework for study of maintenance outsourcing

15.3.1 Asset and State of Asset

In general, an asset is a complex system comprising several components. The state

of the system degrades with age and/or usage and this leads to a failure. An asset is
said to be in failed state when it is no longer functioning properly. In the case of
equipment, or a consumer durable, the failure is due to the failure of one or more
components. In the case of infrastructure, for example a road, a failure occurs when
a pothole reaches some size or the number of potholes per kilometre exceeds some
specified amount.
In the case of a new asset, the initial state is determined by the decisions made
during its design and construction (or manufacture). The asset reliability character-
ises the probability of no failure and this decreases with age. The field reliability
also depends on the operating stress (load) on the asset and the operating environ-
Maintenance Outsourcing 379

ment. The stress can be thermal, mechanical, electrical, etc., and the reliability de-
creases as the stress increases and/or the environment gets harsher.
When a failure occurs, the asset can be restored to an operational state through
corrective maintenance (CM). In the case of equipment, this involves repairing or
replacing the failed components. In the case of the road example, the CM involves
filling the potholes and resealing a section of the road. The degradation in the asset
state can be controlled through use of preventive maintenance (PM) and, in the
case of equipment, this involves regular monitoring and replacing of components
before failure.
The asset state at any given time (subsequent to it being put into operation) is a
function of its inherent reliability and past history of usage and maintenance. This
information is important in the context of maintenance service contracts for used
assets. The information that the service agent (and the customer) has can vary from
very little to lot (if detailed records of past usage and maintenance have been kept).
Finally, for some assets, the delivery of maintenance requires the service agent
to visit the site where the asset is located (for example, lifts in buildings and roads)
and for others (most consumer durables and some industrial equipment) the failed
asset can be brought to a service centre to carry out the maintenance actions.

15.3.2 Maintenance Corrective Maintenance (CM)

These are corrective actions performed when the asset has a failure. The most
common form of CM is “minimal repair” where the state of the asset after repair is
nearly the same as that just before failure. The other extreme is “as good as new”
repair and this is seldom possible unless one replaces the failed asset by a new one.
Any repair action that restores the asset state to better than that before failure and
not as good as that of new asset is referred to as “imperfect repair”. Preventive Maintenance (PM)

In the case of equipment or consumer durables, PM actions are carried out at com-
ponent level where components are replaced based on age, usage and/or condition.
As a result, there are several different kinds of PM policies (Blischke and Murthy
2000). Some of the more commonly used ones are the following:
• Age based maintenance. Replace a component (under PM) when it reaches
age T (after being put into use) or on failure under CM, if the item fails
• Clock based maintenance. Replace a component (under PM) at set times
t = kT , k = 1, 2, , or on failure under CM.
• Opportunistic maintenance. This is based on exploiting opportunities that
become available. An example is PM actions for some components being
carried out at the same time as the CM action for a failed component.
• Condition-based maintenance. Here, the maintenance action is based on an
assessment of the state of a component from a set of measurement data
obtained. For example, the state of a turbine bearing is assessed on data
relating to noise, vibration, wear debris in oil, etc.
380 D. Murthy and N. Jack Modeling Failures and Maintenance Actions

To evaluate different maintenance actions, mathematical models are needed for the
failure of assets and the effect of maintenance on these failures. Themodeling can
be done at two levels – system or component.

System level modeling If only CM and no PM is used and the time to repair is very
much smaller than time between failures, then one can model failures over time as
a stochastic point process with an intensity function λ (t ) that is increasing with t
(time or age) to capture the degradation with time (see Rigdon and Basu 2000).
The effect of operating stress and operating environment can be modeled through a
Cox-regression model where the intensity function is modified to g ( z )λ (t ) where
z is the vector of covariates representing the stress and environmental  variables

(see Cox and Oakes 1984).
The effect of PM actions can be modeled through a reduction in the intensity
function as shown in Figure 15.3. The level of PM (indicated by δ in the figure)
determines the reduction in the intensity function and the cost of a PM action
increases with the level of PM.


δ1 δ2

Figure 15.3. Effect of PM actions on the intensity function

Component level modeling If a component of the asset fails and is non-repairable

and/or too costly to repair, then it is replaced by a new one. If the replacement time
is small relative to the mean time to failure, then it can be ignored and component
failures (over time) can be modeled by a renewal process (see Ross 1980). If the
component is repairable and costly and a failed component is subjected to minimal
repair, then failures (over time) can be modeled by a stochastic point process with
intensity function having the same form as the hazard function of the component.
Maintenance Outsourcing 381

15.3.3 Contract

The contract is a legal document that is binding on both parties (customer and
service agent) and it needs to deal with technical, management and economic
issues. Technical and Management Issues

Maintenance of an asset involves carrying out several activities as indicated in
Figure 15.4 (adapted from Dunn 1999). There are many different contract sce-
narios depending on how these activities are outsourced. Table 15.1 indicates three
different scenarios (S-1 to S-3) where:
• (D-1). What (components) need to be maintained?
• (D-2). When should the maintenance be carried out?
• (D-3). How should the maintenance be carried out?





Figure 15.4. Activities in asset maintenance

Table 15.1. Different contract scenarios

S-1 D-1, D-2 D-3
S-2 D-1 D-2, D-3
S-3 - D-1, D-2, D-3

In scenario S-1, the service agent is only providing the resources (workforce
and material) to execute the work. This corresponds to the minimalist approach to
outsourcing. In scenario S-2, the service agent decides on how and when and what
is to be done is decided by the customer. Finally, in scenario S-3 the service agent
makes all three decisions.
There is growing trend towards functional guarantee contracts. Here the contract
specifies a level for the output generated from equipment, for example, the amount
of electricity produced by a power plant, or the total length of flights and number of
landings and takeoffs per year. The service agent has the freedom to decide on the
maintenance needed (subject to operational constraints) with incentives and/or
382 D. Murthy and N. Jack

penalties if the target levels are exceeded or not. For more on this, see Kumar and
Kumar (2004).
In the context of infrastructures, there is a trend towards giving the service
agent the responsibility for ongoing upgrades or the responsibility for the initial
design resulting in a BOOM (build, own, operate and maintain) contract.
The levels of risk to both parties vary with the contract scenario. Economic Issues

There are a number of alternative contract payment structures. The following list is
from Dunn (1999):
• Fixed or firm price
• Variable price
• Price ceiling incentive
• Cost plus incentive fee
• Cost plus award fee
• Cost plus fixed fee
• Cost plus margin
Each of these price structures represents a different level of risk sharing between
the customer and the service agent. According to Vickerman (2004), an increasing
issue in privatized infrastructure is the appropriate incentives needed to ensure
adequate maintenance of the infrastructure as a public resource. Other Issues

Some other issues are as follows:
Requirements. Both parties might need to meet some stated requirement. For
example, the customer needs to ensure that the stresses on the asset do not exceed
the levels specified in the contract as this can lead to greater degradation and
higher servicing costs to the service agent. Similarly, the service agent needs to en-
sure proper data recording.
Contract duration. This is usually fixed with options for renewal at the end of the
Dispute resolution. This specifies the avenues to follow when there is a dispute.
The dispute can involve going to a third party (legal courts).
Unless the contract is written properly and relevant data (relating to equipment
and collected by the service agent) are analysed properly by the customer, the long-
term costs and risks will escalate.

15.3.4 Maintenance Outsourcing Market

Whether the maintenance outsourcing market is competitive or not depends on the

number of customers and service agents. Table 15.2 indicates the different market
scenarios. These have an impact on issues such as the types of service contracts
available to customers and the pricing of the contracts.
Maintenance Outsourcing 383

Table 15.2. Maintenance outsourcing market scenarios


ONE A-1 B-1
FEW A-2 B-2
MANY A-3 B-3

15.4 Review of Literature

There is a vast literature on maintenance and it covers a range of topics (approaches
to maintenance, mathematical models for deciding optimal maintenance, main-
tenance management, etc.). There are several review papers that have appeared over
the last 40 years and these include McCall (1965), Pierskalla and Voelker (1976),
Jardine and Buzzacot (1985), Sherif and Smith (1986), Thomas (1986), Valdez-
Flores and Feldman (1989), Pintelton and Gelders (1992) and Scarf (1997). Cho and
Parlar (1991) and Dekker et al. (1997) deal with the maintenance of mutli-compo-
nent systems. There are also several maintenance books. In contrast, the literature
on maintenance outsourcing is very limited and in this section we briefly review this

15.4.1 Maintenance Outsourcing

The literature deals with maintenance outsourcing mainly from the customer
perspective and is focussed on management issues. More specifically, attempts are
made to address one or more of the following questions in a qualitative manner:

1. Does outsourcing make sense?

2. Are the objectives achievable?
3. Is the organisation ready?
4. What are the outsourcing alternatives?
5. What maintenance activities should be outsourced?
6. How should the best service agent be selected?
7. What are the negotiating tactics for contract formation?

Some of the relevant papers are Campbell (1995), Judenberg (1994), Martin
(1997), Levery (1998) and Sunny (1995).
Unfortunately, cost has been the sole basis used by businesses for making
maintenance out-sourcing decisions. Sunny (1995) looks at what activities are to be
outsourced by looking at the long strategic dimension (core competencies) as well
as the short-term cost issues.
Bertolini et al. (2004) take a quantitative approach and use the analytic hierarchy
process (AHP) to make decisions regarding the outsourcing of maintenance.
Ashgarizadeh and Murthy (2000) and Murthy and Ashgarizadeh (1998, 1999)
look at maintenance outsourcing from both customer and service agent perspec-
384 D. Murthy and N. Jack

tives and propose game-theoretic models to determine the optimal strategies for
both parties. This approach is discussed further in Section 15.5.
On the application side, Armstrong and Cook (1981) look at clustering of
highway sections for awarding maintenance contracts to minimise the cost and use
a fixed-charge goal programming model to determine the optimal strategy.
Bevilacqua and Braglia (2000) illustrate their AHP model in the context of an
Italian brick manufacturing business having to make decisions regarding main-
tenance outsourcing.
Stremersch et al. (2001) look at the industrial maintenance market.

15.4.2 Extended Warranties

The literature can be broadly divided into three groups. Group 1: Warranty cost analysis

The cost analysis of many different types of basic warranties can be found in
Blischke and Murthy (1994, 1996). For a review of more recent literature, see
Murthy and Djamaludin (2002). These techniques can be easily extended to obtain
the costs for extended warranties and this has been done by Sahin and Polatoglu
(1998). Group 2: Warranty Servicing Strategy

When a repairable asset fails under warranty, the manufacturer has the choice of
either repairing or replacing it with a new one. The first option costs less then the
second but a repaired asset has a greater probability of failing during the remainder
of the warranty period. It is therefore important for the manufacturer to choose an
appropriate servicing strategy in order to minimise the expected cost of servicing
the warranty per asset sold.
Servicing strategies for products sold with one-dimensional warranties have
received considerable attention. Biedenweg (1981) and Nguyen and Murthy (1986,
1989) assume that repaired items have independent and identically distributed
lifetimes different from that of a new item and considered strategies where the
warranty period is divided into distinct intervals for repair and replacement.
Nguyen (1984) introduces the first servicing model with minimal repair (see
Barlow and Hunter 1960), with the warranty period split into a replacement
interval followed by a repair interval. The length of the first interval is selected
optimally to minimize the expected warranty cost.
Jack and Van der Duyn Schouten (2000) show that this strategy is sub-optimal
and that the optimal servicing strategy is in fact characterized by three distinct
intervals – [0, x), [x, y] and (y, W] where W is the warranty period. The optimal
strategy is to carry out minimal repairs in the first and last intervals and to use
either minimal repair or replacement by new in the middle interval depending on
the age of the item at failure. This strategy is difficult to implement, so Jack and
Murthy (2001) propose a near optimal strategy involving the same three intervals
but with only the first failure in the middle interval resulting in a replacement and
all other failures being minimally repaired.
Maintenance Outsourcing 385

Servicing strategies for products sold with two-dimensional warranties have

been studied by Iskandar and Murthy (2003) who propose two strategies similar to
those from Nguyen and Murthy (1986, 1989) but with minimal repair. Iskandar et
al. (2005) deal with a servicing strategy similar to that given in Jack and Murthy
When the cost of replacement is high compared to the cost of a minimal repair
then strategies involving replacement are not appropriate. In this case, strategies
involving imperfect repair (where the failure characteristics of the repaired asset
are better than those after minimal repair but are not the same as a new item) are
more appropriate. The advantage of using imperfect repair is that the degree of
improvement in the reliability after repair is a decision variable under the control
of the manufacturer. Yun et al. (2006) discuss this topic.
Every EW provider also needs to choose appropriate servicing strategies to
minimise the costs of servicing the EWs that they have sold. The techniques that
have been developed for basic warranties can easily be adapted to the EW case. Group 3: Market for EWs

There are a number of studies that have been carried out to show how EWs can be
used as a tool for market segmentation. Unfortunately, most of the failure modeling
used in these studies is static in nature. The asset either functions or doesn’t
function properly during the EW period. Padmanabhan and Rao (1993) consider
strategies that manufacturers should adopt for warranty provision when consumers
vary in risk attitude and consumer moral hazard is also present. Moral hazard
problems occur when consumers who have purchased EWs reduce their level of
maintenance effort and this causes increased servicing costs to EW providers. Lutz
and Padmanabhan (1994) look at the effect of income variation on EW purchasing
and Padmanabhan (1995) and Hollis (1999) consider heterogeneity in consumer
usage. Lutz and Padmanabhan (1998) investigate differences in consumers’ valu-
ations of a working asset and the effect of independent EW providers in the
market. Desai and Padmanabhan (2004) consider the impact of different distri-
butional arrangements for the sale of assets and their EWs.

15.5 Game Theoretic Approach

In the game theoretic approach, the outsourcing problem is viewed as a game with
two players – customer and service agent. Each player has his/her own goal or
objective and a set of decisions that need to be selected optimally. There are
several different scenarios depending on whether there is a dominant player (a
leader-follower situation where the actions of the follower depend on the actions of
the leader – referred to as a “Stackelberg game formulation”) or there isn’t (both
players decide on their actions either in a cooperative or non-cooperative mode –
referred to as a “Nash game formulation”), and also on the kinds of information
available to each player and their attitudes to uncertainty and risk. This approach
allows maintenance outsourcing to be studied from both customer and service
agent perspectives.
386 D. Murthy and N. Jack

15.5.1 Maintenance Outsourcing

Consider the case where the service agent is the leader and offers n options
( Ai (θi ),1 ≤ i ≤ n, ) to the customer where θi ,1 ≤ i ≤ n, are the decision variables
corresponding to the different options that the agent needs to select optimally. As
an illustrative case, let n = 2 and the two options that the service agent offers for
CM actions are as follows:
Option 1 [Fixed Price Service Contract – A1 (θ1 ) ]: For a fixed price P , the
service agent agrees to rectify all failures occurring over a period L at no
additional cost to the customer. If a failure is not rectified within a period τ , the
service agent incurs a penalty. If Y denotes the time for which the equipment is in
the non-operational state before it becomes operational, then the penalty incurred is
given by max{0, α (Y − τ )} , where α is the penalty cost per unit time. This
ensures that the service agent does not deprive the customer of the use of the
equipment for too long. Here, θ1 = {P,τ , α }.
Option 2 [Pay for each repair contract – A2 (θ 2 ) ]: In this case, whenever a
failure occurs, the service agent charges an amount Cs for each repair and does not
incur any penalty if the equipment is in the non-operational state for greater than τ
units of time. Here, θ 2 = {Cs }.
In the Stackelberg game formulation, given the set of options (along with the
values for the decision variables of the service agent), the customer chooses the
best option to optimize his/her goal. This generates the optimal response function
A *(θ1 , θ 2 , ,θ n ) as shown in Figure 15.5. Using this, the service agent then
optimally selects the decision variables to optimize his/her objective.

Ai (θi ), 1 ≤ i ≤ n

A* (θ1 , θ 2 , , θ n )
Figure 15.5. Stackelberg game formulation

Murthy and Asgharizadeh (1998, 1999) and Asgharizadeh and Murthy (2000)
use a Stackelberg game formulation for a special case where the time between
equipment failures is given by an exponential distribution so that the failures over
time occur according to a Poisson process. They consider the two options dis-
cussed earlier and consider the following three cases:

1. Single service agent and single customer (Case A-1)

2. Single service agent, multiple customers (Case A-2)and one repair facility
so that only one failed equipment can be repaired at any given time
3. Single service agent, multiple customers (Case A-3) and more than one
repair facility
Maintenance Outsourcing 387

In case 1 the service agent has to decide the optimal number of customers to
service and in case 3 he has to decide the optimal number of repair facilities.

15.5.2 Extended Warranties

Jack and Murthy (2006) consider the case where the product is complex and so the
specialist knowledge of the manufacturer is required to carry out any repairs after
the base warranty expires. The consumer must decide how long to keep the item
and how to maintain it until replacement. Two maintenance options are available:
the consumer can (i) pay the manufacturer to repair the item each time it fails, or
(ii) purchase an extended warranty (EW) from the manufacturer. These are similar
to Options 2 and 1 respectively, discussed earlier. The EW contract specifies that
the manufacturer will again rectify all failures free of charge to the consumer. The
consumer has flexibility in choosing when the EW will begin and the length of
cover. The price of the EW depends on these two variables and is set by the
manufacturer. The manufacturer also has to decide the price of each repair if the
item fails and the consumer does not have an EW. A Stackelberg game formulation
is used to determine the optimal strategies for both the consumer and the manu-

15.6 Agency Theory (The Principal – Agent Problem)

Agency theory deals with the relationship that exists between two parties (a princi-
pal and an agent) where the principal delegates work to the agent who performs
that work and a contract defines the relationship. Agency theory is concerned with
resolving two problems that can occur in agency relationships.
The first problem arises when the two parties have conflicting goals and it is
difficult or expensive for the principal to verify the actual actions of the agent and
whether the agent has behaved properly or not. The second problem involves the
risk sharing that takes place when the principal and agent have different attitudes to
risk (due to various uncertainties).
According to Eisenhardt (1989), the focus of the theory is on determining the
optimal contract, behaviour vs. outcome, between the principal and the agent.
Many different cases have been studied in depth in the principal-agent literature
and these deal with the range of issues indicated in Figure 15.6. Agency theory has
also been applied in many different disciplines. For an overview see Van Ackere
388 D. Murthy and N. Jack






Figure 15.6. Issues in agency theory

15.6.1 Issues in Agency Theory

Moral hazard. Moral hazard refers to lack of effort (or shirking) on the part of the
agent. The agent does not put in the agreed-upon effort because the objectives of
the two parties are different and the principal cannot assess the level of effort that
the agent has actually used.
Adverse selection. Adverse selection refers to any misrepresentation of ability by
the agent and the principal is unable to completely verify this before deciding to
hire the agent.
Information. To counteract adverse selection, the principal can invest in getting
information about the agent’s ability. One way of getting the desired information is
by contacting people for whom the agent has provided service in the past.
Monitoring. The principal can counteract the moral hazard problem by monitoring
the actions of the agent. Monitoring provides information about the agent’s actual
Information asymmetry. There are several uncertainties that affect the overall
outcome of the relationship. The two parties, in general, will have different infor-
mation to make an assessment of these uncertainties and will also differ in terms of
other information.
Risk. This results from the different uncertainties that affect the outcome of the
relationship. The risk attitude of the two parties, in general, will differ for a variety
of reasons. A problem arises when this disagreement is over the allocation of risk
between the two parties.
Costs. There are various kinds of costs for both parties. Some of these depend on
the outcome (which is influenced by uncertainties) but also in acquiring informa-
tion, monitoring and the administration of the contract. The heart of the principal-
agent theory is the trade-off between (i) the cost of monitoring the actions of the
Maintenance Outsourcing 389

agent and (ii) the cost of measuring the outcomes of the relationship and the trans-
ferring of risk to the agent.
Contract. The design of the contract that takes into account the issues discussed
above is the challenge that lies at the heart of the principal-agent relationship.

15.6.2 Relevance to Maintenance Outsourcing and Extended Warranties Maintenance Outsourcing

Outsourcing of maintenance involves all the Agency Theory issues discussed in
Section 15.6.1 with the customer as the principal and the maintenance service
provider as the agent. The key factor is the contract that specifies what, when, and
how maintenance is to be carried out. This needs to be designed taking into
account all the various issues. Kraus (1996) reviews the literature on incentive
The customer and service agent both potentially face moral hazard. This can
occur for the customer when the service agent shirks to reduce costs and doesn’t do
proper maintenance and it can occur for the agent when the customer uses the asset
in a manner different to that stated in the contract. Adverse selection can also take
place when the customer chooses from a pool of potential maintenance service
providers (the B scenarios in Table 15.2). The two parties have different infor-
mation about asset state, usage level, care and attention of the asset, and quality of
maintenance used and this asymmetry will affect the outcome of their relationship.
The different market scenarios for maintenance outsourcing are as indicated in
Table 15.2. In scenario A-1, the classical principal-agent model discussed in
Section 15.6.1 is appropriate with a single principal (customer) and a single agent
(maintenance provider). This could be a large business unit, for example.
In the remaining five scenarios, there are multiple principals and/or multiple
agents. In scenarios A-2 and A-3, the equipment under consideration could be a
particular brand of lift installed in different buildings within a city. In this case, all
the equipment is maintained either by the OEM or an agent of the OEM. There is
an extensive literature dealing with the design of contracts for multiple principal/
multiple agent problems (Macho-Stadler and Perez-Castrillo 1997 and Laffont and
Martimort 2002 are a couple of samples of the papers from this literature) and all
the issues from Section 15.6.1 are still relevant. The principal-agent models that
have been studied in the literature are static in nature and new, dynamic models
need to be formulated so that they can be applied meaningfully in the context of
maintenance outsourcing. Extended Warranties

This case is similar to A-3. In the case of standard commercial and industrial
products and consumer durables, the EW policy is decided by the EW provider and
the customer does not have any direct input. The issues (such as moral hazard,
adverse selection, risk, monitoring, etc) from agency theory are all relevant for EW
policies. Current EW offerings lack flexibility from the customer point of view and
there is a perception (amongst customers and EW regulators) that the pricing of
EWs is not fair. This provides an opportunity for EW providers to offer flexible
390 D. Murthy and N. Jack

warranties to meet the different needs across the customer population. Agency
theory offers a framework to evaluate the costs of different policies taking into
account all the relevant issues.

15.7 Conclusion and Topics for Future Research

In this chapter we have proposed a framework to look at maintenance outsourcing
from both the equipment owner (customer for maintenance service) and the service
agent (maintenance service provider) perspectives. A review of the literature indi-
cates that the bulk of it is qualitative with only very few papers dealing with the
topic in a more quantitative manner. Also, not all the relevant issues have been
addressed effectively. Agency theory provides an approach to address all these
issues in a unified manner. This will require building new models and offers scope
for lot of new research in the future.
The provision of extended warranties is very similar to maintenance out-
sourcing. We have highlighted this link and have also discussed the concept of
flexible EWs. The framework proposed in this chapter combined with Agency
theory can be used by EW providers to obtain better estimates of the cost of offer-
ing different EW options in a more objective and scientific manner where all the
various issues such as moral hazard, adverse selection, risk, etc., are taken into
account. Again, there is considerable scope for more future research in EWs.

15.8 References
Armstrong, R.D. and Cook, W.D. (1981), The contract formation problem in preventive
pavement maintenance: A fixed-charge goal-programming model, Comp. Environ.
Urban Systems, 6, 147–155
Ashgarizadeh, E. and Murthy, D.N.P. (2000), Service contracts – a stochastic model,
Mathematical and Computer Modelling, 31, 11–20
Barlow, R.E. and Hunter, L.C. (1960), Optimum preventive maintenance policies,
Operations Research, 8, 90–100
Bertolini, M., Bevilacqua, M. Braglia, M. and Frosolini, M. (2004), An analytical method
for maintenance outsourcing service selection, International Journal on Quality &
Reliability Management, 21, 772–788
Bevilacqua, M. and Braglia, M. (2000), The analytic hierarchy process applied to
maintenance strategy selection, Reliability Engineering & System Safety, 70, 71–83.
Biedenweg, F. M. (1981), Warranty Analysis: Consumer Value vs. Manufacturers Cost,
Unpublished Ph.D. Thesis, Stanford University, U.S.A.
Blischke, W.R. and Murthy, D.N.P. (1994), Warranty Cost Analysis. Marcel Dekker, New
Blischke, W.R. and Murthy, D.N.P. (1996), Product Warranty Handbook, Marcel Dekker,
New York
Blischke, W.R. and Murthy D.N.P. (2000), Reliability, Wiley, New York
Campbell, J.D. (1995), Outsourcing in maintenance management: a valid alternative to self-
provision, Journal of Quality in Maintenance Engineering, 1, 18–24.
Maintenance Outsourcing 391

Cho, D. and Parlar, M. (1991), A survey of maintenance models for multi-unit systems,
European Journal of Operational Research, 51, 1–23.
Cox, D.R. and Oakes, D. (1984), Analysis of Survival Data, Chapman and Hall, New York
Day, E. and Fox, R.J. (1985), Extended warranties, service contracts and maintenance
agreements – A marketing opportunity? Journal of Consumer Marketing, 2, 77–86
Dekker, R., Wildeman, R.E. and van der Duyn Schouten, F.A. (1997), Review of multi-
component models with economic dependence, Zor/Mathematical Methods of
Operations Research, 45, 411–435.
Desai, P.S. and Padmanabhan, V. (2004), Durable good, extended warranty and channel
coordination. Review of Marketing Science, 2, Article 2, available at
Dunn, S. (1999), Maintenance outsourcing – Critical issues, available at: www.plant-
Eisenhardt, K.M. (1989), Agency theory: An assessment and review, The Academy of
Management Review, 14, 57–74
Embleton, P.R. and Wright, P.C. (1998), “A practical guide to successful outsourcing”,
Empowerment in Organizations, Vol. 6 No. 3, pp. 94–106
Eppen, G.D., Hanson, W.A. and Martin, R.K. (1991), Bundling – new products, new
markets, low risks, Sloan Management Review, Summer, 7–14
Hollis, A. (1999), Extended warranties, adverse selection and aftermarkets. The Journal of
Risk and Insurance, 66, 321–343
Iskandar, B.P., and Murthy, D.N.P. (2003), Repair-replace strategies for two-dimensional
warranty policies, Mathematical and Computer Modelling, 38, 1233–1241
Iskandar, B.P., Murthy, D.N.P. and Jack, N. (2005), A new repair-replace strategy for items
sold with a two-dimensional warranty, Computers and Operations Research, 32,
Jack, N. and Murthy, D.N.P. (2001), A servicing strategy for items sold under warranty, Jr.
Oper. Res. Soc., 52, 1284–1288
Jack, N. and Murthy, D.N.P. (2006), A Flexible Extended Warranty and Related Optimal
Strategies, Jr. Oper. Res. Soc. (accepted for publication)
Jack, N. and Van der Duyn Schouten, F. (2000), Optimal repair-replace strategies for a
warranted product, Int. J. Production Economics, 67, 95–100
Jardine, A.K.S. and Buzacott, J.A. (1985), Equipment reliability and maintenance, European
Journal of Operational Research, 19, 285–296.
Judenberg, J. (1994), Applications maintenance outsourcing, Information Systems
Management, 11, 34–38
Khosrowpour, M. (ed) (1995), Managing Information Technology Investments with
Outsourcing, Idea Group Publishing, Harrisburg
Kraus, S. (1996), An overview of incentive contracting, Artificial Intelligence, 83, 297–346
Kumar, R. and Kumar, U. (2004), Service delivery strategy: Trends in mining industries, Int.
J. Surface Mining, Reclamation and Environment, 18, 299–307
Laffont, J. and Martimort, D, (2002) The Theory of Incentives: the Principal-Agent Model,
Princeton University Press
Levery, M. (1998), Outsourcing maintenance: a question of strategy, Engineering
Management Journal, February, 34–40.
Lutz, N.A. and Padmanabhan, V. (1994), Income variation and warranty policy. Working
Paper, Graduate School of Business, Stanford University.
Lutz, N.A. and Padmanabhan, V. (1998), Warranties, extended warranties and product
quality. International Journal of Industrial Organization, 16, 463–493.
Macho-Stadler, I. and Perez-Castrillo, D. (1997), An Introduction to the Economics of
Information, Oxford University Press
392 D. Murthy and N. Jack

Martin, H.H. (1997), Contracting out maintenance and a plan for future research, Journal of
Quality in Maintenance Engineering, 3, 81–90
McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey,
Management Science, 11, 493–524.
Murthy D.N.P. and Ashgarizadeh, E. (1998), A stochastic model for service contract; Int. Jr.
of Reliability Quality and Safety Engineering; 5, 29–45
Murthy D.N.P. and Ashgarizadeh, E. (1999), Optimal decision making in a maintenance
service operation, European Journal of Operational Research, 116, 259–273
Murthy, D.N.P. and Djamaludin, I. (2002), Product warranty – A review, International
Journal of Production Economics, 79, 231–260
Nguyen, D.G. (1984), Studies in Warranty Policies and Product Reliability. Unpublished
Ph.D. Thesis, The University of Queensland, Australia.
Nguyen, D.G. and Murthy, D.N.P. (1986), An optimal policy for servicing warranty, Jr.
Oper. Res. Soc., 37, 1081–1088
Nguyen, D.G. and Murthy, D.N.P. (1989), Optimal replace-repair strategy for servicing
items sold with warranty, Euro. Jr. of Oper. Res., 39, 206–212
Padmanabhan, V. (1995), Usage heterogeneity and extended warranties. Journal of
Economics and Management Strategy, 4, 33–53
Padmanabhan, V. (1996), Extended warranties, in Product Warranty Handbook, W.R.
Blischke and D.N.P. Murthy (eds), Marcel Dekker, New York
Padmanabhan, V. and Rao, R.C. (1993), Warranty policy and extended warranties: theory
and an application to automobiles. Marketing Science, 12, 230–247
Pierskalla, W.P. and Voelker, J.A. (1976), A survey of maintenance models: The control and
surveillance of deteriorating systems, Naval Research Logistics Quarterly, 23, 353–388.
Pintelton, L.M. and Gelders, L. (1992), Maintenance management decision making,
European Journal of Operational Research, 58, 301–317.
Rigdon, S.E. and Basu, A.P. (2000), Statistical Methods for the Reliability of Repairable
Systems, Wiley, New York
Ross, S.M. (1980), Stochastic Processes, Wiley, New York
Sahin, I. and Polatoglu, H. (1998), Quality, warranty and preventive maintenance. Kluwer:
Scarf, P.S. (1997), On the application of mathematical models to maintenance, European
Journal of Operational Research, 63, 493–506.
Sherif, Y.S. and Smith, M.L. (1986), Optimal maintenance models for systems subject to
failure - A review, Naval Logistics Research Quarterly, 23, 47–74.
Stremersch, S., Wuyts, S. and Frambach, R.T. (2001), The purchasing of full-service
contracts: An exploratory study within the industrial maintenance market, Industrial
Marketing Management, 30, 1–12
Sunny, I. (1995), Outsourcing maintenance: making the right decisions for the right reasons,
Plant Engineering, 49, 156–157.
Thomas, L.C. (1986), A survey of maintenance and replacement models for maintainability
and reliability of multi-item systems, Reliability Engineering, 16, 297–309
UK Competition Commission (2003): A report into the supply of extended warranties on
domestic electrical goods within the UK, available at:
Valdez-Flores, C. and Feldman, R.M. (1989), A survey of preventive maintenance models
for stochastically deteriorating single-unit systems, Naval Research Logistics Quarterly,
36, 419–446.
Van Ackere, A. (1993), The principal-agent paradigm: Its relevance to various functional
fields, European Journal of Operational Research, 70, 83–103
Maintenance Outsourcing 393

Vickerman, R. (2004), Maintenance incentives under different infrastructure regimes,

Utilities Policy, 12, 315–322
Yun, W.Y., Murthy, D.N.P. and Jack, N. (2006), Warranty servicing with imperfect repair,
Submitted for publication

Maintenance of Leased Equipment

D.N.P. Murthy and J. Pongpech

16.1 Introduction
Businesses need equipment to produce their outputs (goods/services). Equipment
degrades with age and usage, and eventually fails (Blischke and Murthy 2000).
This impacts business performance in several ways – reduced equipment avail-
ability, lower output quality, higher operating costs, increased customer dissatisfac-
tion, etc. The degradation can be controlled through preventive maintenance (PM)
actions whilst corrective maintenance (CM) actions restore failed equipment to its
working state.
Prior to 1970, businesses owned the equipment, and maintenance was done in
house. Since 1970, there has been a shift towards outsourcing of maintenance. This
was primarily due to a change in the management paradigm where activities in a
business were classified as either core or non-core, with the non-core activities to
be outsourced to external agents if this was deemed to be cost effective. Also, as
technology became more complex it was no longer economical to carry out in-
house maintenance due to the need for expensive maintenance equipment and
highly trained maintenance staff.
Since 1990, there has been an increasing trend towards leasing rather than
owning equipment. According to Fishbein et al. (2000) there are several reasons
for this. Some of these are as follows:
• Rapid technological advances have resulted in improved equipment ap-
pearing on the market, making the earlier generation equipment obsolete at
an ever-increasing pace.
• The cost of owning equipment has been increasing very rapidly.
• Businesses viewing maintenance as a non-core activity.
• It is often economical to lease equipment, rather than buy, as this involves
less initial capital investment and often there are tax benefits that make it
396 D. Murthy and J. Pongpech

In the USA, the Equipment Leasing Association (ELA) conducted a survey in

2002 (ELA, 2002a) and the results of their findings were as follows:
• 80% of businesses acquire equipment through leasing.
• Leasing accounts for roughly 30% of business capital investment.
• Nearly 50% of office equipment is leased.
• Leasing companies own more equipment than companies in other US in-
The leasing industry grew from 1990 till the last quarter of year 2001 when it
experienced an economic downturn due to the impact from 9/11. In 2002, the
predictions made by the Department of Commerce for equipment leasing volume
for 2003 and 2004 were $208 and $218 billion respectively.
The ELA Online Focus Groups Report (ELA 2002b) states that 60% of leasing
benefits come from maintenance options. This is because some equipment leases
come with maintenance as an integral part of the lease so that the physical equip-
ment is bundled with maintenance service and offered as a package under a lease
contract. This implies that the lessee can focus on the core activities of the business
and not be distracted with equipment maintenance.
Maintenance of leased equipment raises several new issues for both the lessor
and the lessee (Desai and Purohit 1998; Kleiman 2001). The strategic issues deal
with the size and composition of the equipment fleet, the number and the location
of lease centers, workshop facilities, warehouse for spares, etc. The operational
issues include logistics, pricing, marketing, and maintenance strategies. In this
chapter we touch on these issues and then focus our attention on maintenance
strategies for leased equipment.
The outline of the chapter is as follows. Section 16.2 starts with a general intro-
duction to equipment leasing and then the different types of leases are discussed.
Section 16.3 deals with a framework to study equipment leasing and reviews the
relevant literature. In Section 16.4, we look at the maintenance of equipment under
operational lease. We discuss the modeling issues and propose various main-
tenance policies. Section 16.5 looks at the analysis of two of these policies and the
optimal selection of the policy parameters. We conclude with a brief discussion of
topics for future research in Section 16.6. We use the following abbreviations and

AFT: Accelerated failure time
PH: Proportional hazard
NHPP: Non-homogeneous Poisson process
ROCOF: Rate of occurrence of failure
CM: Corrective maintenance
PM: Preventive maintenance

F (t ) : Failure distribution for the time to first failure of new equipment
f (t ), r (t ) : Failure density and hazard functions associated with F (t )
λ0 (t ) : Intensity function with only CM actions
Maintenance of Leased Equipment 397

λ (t ) : Intensity function with both CM and PM actions

A: Age of used equipment
x: Reduction in age with PM action
L: Duration of lease period
δj: Reduction in intensity function with j-th PM action
tj : Time instant of j-th PM action
N ( L) : Number of equipment failures over the lease period
Y: Time to carry out minimal repair (Random variable)
G( y) : Distribution function for Y
γ ,τ : Parameters of penalty cost
C p (δ ) : Cost of PM action with reduction in intensity function δ
Cu ( x) : Cost of PM action with reduction in virtual age x
Cf : Mean cost of a CM action (minimal repair)
Cn : Penalty cost per failure (when number of failures exceeds γ )
Ct : Penalty cost per unit time (when repair time exceeds τ )

16.2 Equipment Leasing

16.2.1 Lease Definition

A lease is a contractual agreement under which the owner of equipment (referred

to as the “lessor”) allows another person (referred to as the “lessee”) to operate the
equipment for a stated period of time and under specified conditions. Examples of
equipment can include aircraft, computers, telecommunications equipment, hospi-
tal equipment, office equipment, cars, forklifts, etc.

16.2.2 Types of Leases

There are several types of leases but, unfortunately, there is no standard terminol-
ogy. The terms used in the USA often differ from those used in the UK. We briefly
discuss the three main types. Operating Lease

In an operating lease the lessee pays the lessor for the use of equipment over a
specified period. Usually, new equipment (for example, cars) is leased with an
operating lease but in some cases used equipment is also leased with this type of
lease. The lease period is much shorter than the equipment’s expected useful life.
At the end of the lease period, the lessor retains ownership of the equipment and
can renew the lease contract (if the lessee is interested), lease the equipment to
some other lessee, or sell the equipment as second-hand equipment. Additional
services, such as operator training (to ensure that the leased item is operated pro-
perly – for example, the leasing of specialized industrial equipment) and main-
tenance (to ensure that the equipment is in a proper operating condition and meets
the requirements stated in the lease contract), are provided by the lessor as part of
the lease contract. This kind of lease is also referred to as a “true” lease. In the
398 D. Murthy and J. Pongpech

USA, the Internal Revenue Code defines a true lease as a transaction that allows
the lessor to claim ownership and the lessee to claim rental payments as tax de-
The advantages and disadvantages of an operating lease from the lessee’s per-
spective are as follows:

• The lessee can obtain new equipment (based on the latest technologies) and
thus avoid the risks associated with equipment obsolescence.
• The lessee usually gets maintenance and other supports from the lessor so
that the business can focus on core activities.
• Equipment disposal is the lessor’s responsibility.

• If the lessee’s needs change over the lease period, then premature termina-
tion of the lease agreeement can incur penalties.
• The risks associated with the lessor do not provide the level of maintenance
needed. Finance Lease

In a finance lease, the lessee pays the lessor for the use of equipment over a
specified period. At the end of the lease period, the lessee gets the ownership of the
equipment either at no cost or at a previously established price. The entire pay-
ments by the lessee must cover the lessor’s initial investment (for acquiring the
equipment) and the profit margin. The type of equipment sold with this type of
lease can vary from very expensive industrial and commercial equipment (such as
a financial institution leasing aircraft to an airline operator) to less expensive con-
sumer products (banks or retailers leasing domestic appliances, cars, etc. to con-
sumers who own the equipment at the end of the lease). This type of lease is also
referred to as a “capital” or “full payout” lease.
The advantages and disadvantages of a finance lease from the lessee’s perspec-
tive are as follows:

• The lessee is able to spread the payments over the lease period (no need for
initial cash at purchase).
• It offers greater flexibility as the lessee can choose from a range of lease
options – especially, in the consumer product market when there are
several institutions offering different types of leases.

• If the lessee fails to make lease payments as per schedule, the leased equip-
ment can be repossessed and sold by the lessor to recover the payments
Maintenance of Leased Equipment 399

• Maintenance is often not a part of the lease agreement so that the lessee has
to provide for this separately.
• The overall cost to the lessee is significantly higher than purchase price of
the equipment because the payments include not only the financing costs,
but also other costs associated with insurance, taxes, etc. Sale and Leaseback

Under a sale and leaseback lease, the owner sells the equipment to a lessor (usually
a finance company) and leases it immediately without ever surrendering the use of
equipment. The maintenance is carried out either by the lessee or some third party.
This type of lease is used mainly for infrastructure assets such as rail transport,
electricity, sewerage and water pipe networks, etc.
The main benefit of using such a lease is that both the lessor and the lessee are
eligible for tax deductions. Other Types of Leases

For a discussion of other types of leases see Coyle (2000) and ELA (2005).

16.3 A Framework for Study of Equipment Lease

A framework for the study of equipment leasing involves several key elements and
these are shown in Figure 16.1. We discuss each of these briefly.






Figure 16.1. Framework for study of equipment lease

Customer: The customer is the lessee. The lessee can be an individual (purchasing
a car under finance lease), a business (operating industrial or commercial equip-
ment under operational lease) or a government agency (responsible for operating
an infrastructure, such as train network, under a buyback lease).
Equipment: Equipment can be an infrastructure (for example, parts of road net-
work, railway network, sewerage and water network, electricity network, etc.);
400 D. Murthy and J. Pongpech

industrial equipment (for example, trucks, cranes, plant machinery, etc.); commer-
cial equipment (for example, office furniture, vending machines, photocopiers,
etc.) and, consumer products (for example, refrigerators, computers, etc.). The cost
of the equipment (or asset) can vary significantly. Ezzel and Vora (2001) give
some interesting statistics relating to sale and leaseback, and operating leases in the
USA over the period 1984–1991.
Owner: The owner is a person or agency that owns the equipment from a legal
point of view. In the case of a finance lease, the financial institution is the owner as
the equipment is mortgaged to the institution.
Service provider: In the case of an operating lease, the lessor is the service pro-
vider. However, if the lessor decides to outsource the maintenance to some external
service agent, then the agent is the service provider. In the case of a finance lease,
the lessee is responsible for the maintenance and might decide to outsource it to an
external agent.
Outputs (products/services): If the lessee is a business, then the leased equipment
is used to produce its outputs – goods and/or services as discussed in Section 16.1.
For consumer goods, the output is the utility (in the case of a kitchen appliance) or
the satisfaction (in the case of a television) derived by the lessee.
Operator: In general, the lessee is the operator of the equipment. However, the
lessee, in turn, might hire some other business to operate the equipment and
produce the desired outputs. An example of this is a business that leases a fleet of
aircraft, then outsources the flying to another business that employs the crew and
operates the planes.
Government: Government plays an important role in the context of sale and buy-
back leases of infrastructure. The lessee can be a department of the government or
an independent unit acting as a proxy for the government. Decisions relating to
subsidy, tax incentives, etc., are decided by the government and have a significant
impact on the lease structure.
Regulator: This applies mainly for equipment used in certain industry sectors
(such as health, transport, energy) where public safety is of great concern. The re-
gulator is often an independent body that monitors and makes recommendation that
can be binding on the owners and operators of equipment.
Vickerman (2004) deals with the infrastructure maintenance issues in the
context of rail and road transport in the UK and discusses the role of government
and regulators. Interested readers should consult the references cited in the paper
for more details.

16.3.1 Different Scenarios of Leasing

There are many different scenarios depending on the number of parties involved.
Table 16.1 gives three different scenarios involving four parties. Other scenarios
can include additional parties such as the government and/or the regulator.
In the remainder of the chapter we focus our attention on industrial and
commercial equipment leased under an operating lease and this corresponds to
Scenario 1.
Maintenance of Leased Equipment 401

Table 16.1. Three different scenarios of leasing

Scenario 1 Scenario 2 Scenario 3
Number of
2 3 4
Parties Involved
First party Lessor: Owner & Lessor: Owner Lessor: Owner
Service Provider
Second party Lessee: User & Lessee: User & Lessee: User
Operator Operator
Third party -- Service Provider Service Provider
Fourth party -- -- Operator

16.3.2 Business Equipment and Operating Lease

According to Baker and Hayes (1981), some of the pioneers in business equipment
leasing were IBM and Xerox. Since then, the number of businesses that lease
business equipment has grown significantly and many kinds of equipment are
leased. ELA (2005) gives a list of some of the businesses leasing their products
under operating leases.
We focus on the maintenance (provided by the lessor) of equipment leased
under an operating lease.1 A framework to study this involves several key elements
and these are indicated in Figure 16.2.




Figure 16.2. Conceptual model of equipment leasing

Lessor: The lessor is not only the owner of the leased equipment, but also the
maintenance service provider. The lessor is a business (either manufacturer or
some other entity) and as such has certain business objectives. At the strategic level
these can include issues such ROI, market share, profits, etc. In order to achieve
these objectives, the lessor needs to have proper strategies at the strategic level (to
deal with issues such as type and number of equipment to lease, upgrade options to

In the case of a finance lease, the lessee has the option of either doing the maintenance in
house or outsourcing it to some third party. For more on maintenance outsourcing, see
Deelen et al. (2003).
402 D. Murthy and J. Pongpech

compensate for technological obsolescence, etc.) and at the operational level

(maintenance servicing, inventory of spares, crew size, etc.).
Lessee: The lessee is a business that leases the equipment to produce its outputs –
goods and/or services. The lessee has to choose which equipment to lease when
there are several competing brands, the best lease arrangement from the set of lease
options available, the terms of the lease, etc. Critical to this decision-making are
issues such as equipment availability, cost, etc. Also, the lessee needs to take into
account the effect of failures on production and their subsequent impact on
customer satisfaction. As a result, the lessee’s objectives are different from those of
the lessor’s.
Equipment: A critical factor is the reliability of equipment. One needs to differen-
tiate between new and used equipment. The reliability of new equipment is the
inherent reliability and this depends on the decisions made by the manufacturer
during the design and production of the equipment. The field reliability depends on
factors such as usage intensity (which determines the load on the equipment) and
the operating environment. In the case of used equipment, the reliability depends
on the inherent reliability and the operating and maintenance history.
Maintenance: Equipment degrades with age and usage, and ultimately fails. Main-
tenance actions can be broadly grouped into two categories – corrective main-
tenance (CM) and preventive maintenance (PM). CM actions are needed to restore
failed equipment to an operational state. PM actions are needed to control equip-
ment degradation and reduce the likelihood of failure.
Contract: The contract needs to take into account the interests of both the lessor
and the lessee. The contract defines the terms and conditions of the lease (lease
period, rental payments, renewal options, penalty for early termination, equipment
upgrade, etc.). From the lessee’s point of view, the number of failures over the
lease period and the recovery time after each failure are important as they affect
equipment availability and the smooth running of the operations. The contract can
include terms to ensure that failures occur infrequently and the recovery times are
small. The lessor incurs penalties if these terms are violated. Also, in the case of in-
centive oriented contracts, the lessor is paid a bonus if equipment related measures
(such as availability, number of failures, etc.) either exceed or don’t exceed some
specified values stated in the contract.

16.3.3 Literature Review

The literature on equipment leasing deals with a variety of issues. For a broader
overview see Baker and Hayes (1981), Schallheim (1994) and Coyle (2000).
The bulk of the literature deals with issues from the lessee’s perspective, and
these can be broadly divided into two groups – (a) management oriented and (b)
economics and finance oriented. The management oriented literature is mainly
qualitative and deals with the following issues:
• Buy vs. lease options through proper cost and benefit analysis
• Selection of the most appropriate lease option
• Negotiating the terms of the lease option
• Administration of lease contracts
Maintenance of Leased Equipment 403

See Deelen et al. (2003) and ELA (2005) for more details.
The economics and finance oriented literature looks at both the lessor and
lessee perspectives and the leased equipment market resulting from the interaction
between these two parties. Ezzel and Vora (2001), Sharpe and Nguyen (1995),
Desai and Purohit (1998), Stremersch et al. (2001), Handa (1991) and Kim et al.
(1978) are an illustrative sample where readers can find more details.
The literature on maintenance is vast and there are many survey papers and
books on the topic. They deal with a range of issues – determining optimal main-
tenance strategies, planning and implementation of maintenance actions, logistics
of maintenance, etc. References to these can be found in review/survey papers
(McCall 1965; Pierskalla and Voelker 1976; Sherif and Smith 1976; Jardine and
Buzacott 1985; Gits 1986; Thomas 1986; Valdez-Flores and Feldman 1989; Cho
and Parlar 1991; Pintelton and Gelders 1992; Dekker et al. 1997; Scarf 1997).
There are very few papers dealing with the maintenance of leased equipment and
these will be discussed later in the chapter.

16.4 Maintenance of Equipment Under an Operational Lease

The lessee has to decide first on whether to lease or buy equipment and, once a
decision is made to lease, the next step is to decide on the lease contract. The lease
contract might be decided by the lessor or by the lessee or jointly. Figure 16.3
shows the key elements that are involved in the decision-making processes of the
lessor and the lessee. We focus on the maintenance of the leased equipment in the
remainder of the chapter.
The lessor has to decide on an effective maintenance strategy for the leased
equipment. The maintenance decision depends on the following factors:
• The duration of the lease.
• The penalty terms in the lease contract.
• The usage intensity (which is under the control of the lessee) and the oper-
ating environment (which might or might not be under the control of the
• The initial state of the equipment (in the case of used equipment).
To determine the optimal maintenance for a specific leased equipment, the
lessor has to decide on the maintenance policy, and then determine the optimal
values for the parameters of this policy. In order to do this, both failures and the
effect of maintenance actions on failures need to be modeled. Figure 16.3 shows
the key elements for determining the optimal maintenance.
404 D. Murthy and J. Pongpech

Figure 16.3. Framework for decision-making with regards leased equipment

16.4.1 Equipment Failures

One needs to differentiate between first and subsequent failures. The first failure
depends on the age of the equipment (in the case of used equipment) and the sub-
sequent failures depend on the type of CM actions (to rectify failures) and the PM
actions (to avoid failures). First Failure

In the case of new equipment, the time to first failure is a random variable and
modeled by a distribution function F ( t ) . The failure density function f ( t ) and
the hazard function r ( t ) are given by

f ( t ) = dF ( t ) / dt and r ( t ) = f ( t ) ⎡⎣1 − F ( t ) ⎤⎦ (16.1)

respectively. In the case of used equipment, let A denote the age at the start of the
lease. Then, the time to first failure is given by the conditional failure distribution

F (t ) − F ( A)
F (t A) = , t ≥ A. (16.2)
1 − F ( A) Corrective Maintenance (CM) Actions

CM actions are performed to restore failed equipment to its operational state.
Depending on the effect of CM on the failure rate we have many different models.
If the failure rate after repair is essentially the same as that if the equipment had not
Maintenance of Leased Equipment 405

failed then it is called “minimal repair” (see Barlow and Hunter 1960). This is
appropriate for complex equipment where the equipment failure is due to failure of
one or a few components. The equipment becomes operational by replacing (or re-
pairing) the failed components. This action has very little impact on the reliability
characteristics of the equipment. If the failure rate changes (in either direction)
after repair, it is called “imperfect repair”. Many different types of imperfect repair
models have been proposed and for a review of such models see Pham and Wang
The time to repair is in general a random variable and needs to be modeled by a
distribution function. Typically, the time to repair is often very much smaller than
the time between random variables (in a statistical sense) so that one can ignore
this and treat repair as being instantaneous for determining failures over time. With
this assumption, the failures over time (with only CM actions) occur according to a
non-homogeneous Poisson process (NHPP) with intensity function λ0 (t ) = r (t ) ,
the hazard function defined earlier. The intensity function (characterizing the
failures over time) is also referred to as “rate of occurrence of failure” (ROCOF).
The cost of repair is also a random variable and needs to be modeled by a
distribution function. Let C f denote the average cost of each minimal repair. Preventive Maintenance (PM) Actions

PM actions are used to control the degradation process and to reduce the likelihood
of failure occurrences. Inspection, cleaning, lubrication, adjustment and calibra-
tion, replacement of degraded components, and major overhaul are some common
tasks that are carried out under PM. The effect of PM action is to improve the
reliability of the equipment. There are several ways of modeling this improvement
and we discuss two of them (Reduction in Failure Intensity and Reduction in Age)
later in the section.
The time needed to carry out PM actions can vary and needs to be modeled
properly. For minor PM actions, the time needed is small relative to the time be-
tween failures and can be ignored. For a major overhaul, the time can be significant
and cannot be ignored. The cost of PM action comprises the administration cost,
labor cost, material cost, and spare parts inventory cost, and some of these costs are

Reduction in intensity function: Here a PM action results in a reduction in the

intensity function (ROCOF). λ0 ( t ) is the intensity function without any PM
actions. Let λ ( t ) denote the intensity function with PM actions. We assume that
the time for PM action is small relative to the mean time between failures so that it
can be ignored. The effect of PM on the intensity function is given by

( ) ( )
λ t +j = λ t −j − δ j (16.3)

where δ j is the reduction resulting from the PM action at time t j . δ j depends on

the level of PM effort and constrained as follows:
406 D. Murthy and J. Pongpech

( )
0 ≤ δ j ≤ λ t −j − λ ( 0 ) (16.4)

This implies that PM action cannot make the equipment better than new.
As a result, if PM actions are carried out at time instants t j , j ≥ 1, and the
reduction in the intensity function given by δ j , j ≥ 1, then the intensity function is
given by

λ ( t ) = λ0 ( t ) − ∑δ , t
i =0
i j < t < t j +1 , (16.5)

for j ≥ 0 , with t0 = 0 and δ 0 = 0 . This implies that the reduction resulting from
action at t j lasts for all t ≥ t j as shown in Figure 16.4.

λ0 (t )
λ (t )

Time t1 t2

Figure 16.4. Effect of PM action on the intensity function for new equipment

The cost of each PM action depends on the reduction in the intensity function.
Let C p (δ ) denote the cost of PM action and this is an increasing function of δ .

Reduction in age: Used equipment can be subjected to an upgrade (or overhaul)

where components that have degraded significantly are replaced with new ones so
that the equipment is a sense younger (from a reliability point of view). If the age
of the equipment is A before it is subjected to PM action, then it can be viewed as
an equipment of virtual age A − x after the PM action. The reduction in the age is
x, 0 < x < A . As a result, the intensity function decreases after PM action as shown
in Figure 16.5.
Maintenance of Leased Equipment 407

λ0 (t )
λ (t )

A-x A Time

Figure 16.5. Effect of upgrade action on the intensity function for used equipment

The cost of this type of PM action depends on the reduction in the virtual age
and is modeled by a function Cu ( x) which is an increasing function of x . Usage Intensity and Operating Environment

Equipment is usually designed for some nominal usage intensity and operating
environment. When it is operated under these conditions, the ROCOF (with no PM
actions) is given by λ0 (t ) . If the equipment is used in a more intense mode and/or
the operating environment becomes harsher, then the ROCOF can increase signifi-
cantly. As a result, failures occur more frequently. Many different models have
been proposed to model this change. Two of the well known ones are (i) acceler-
ated failure time (AFT) model and (ii) proportional hazard (PH) model. For more
on this see, Blischke and Murthy (2000).

16.4.2 Penalties

Both the lessor and the lessee can incur penalties if they violate the terms of the
contract. In the case of the lessee, it could be the usage intensity exceeding that
specified in the contract (provided the lessor can monitor this). In the case of the
lessor, the penalties are linked to equipment failures and the time to repair failed
Two simple forms of penalty are as follows.
Penalty 1: Let N ( L) denote the number of equipment failures over the lease
period L . If N ( L) exceeds γ (a pre-specified value) the lessor incurs a penalty.
The amount that the lessor pays to the lessee at the end of the contract is
Cn [max{N ( L) − γ , 0}] .
Penalty 2: Let the random variable Y denote the time that the lessor takes to restore
failed equipment to its working state. If Y exceeds τ (a pre-specified value) then
the lessor incurs a penalty given by Ct [max{(Y − τ ), 0}] .
408 D. Murthy and J. Pongpech

16.4.3 Optimal Maintenance

Whenever a failure occurs, the lessor incurs a direct cost in restoring the failed
equipment to its operating state. Also, the lessor can incur indirect costs resulting
from the penalties incurred. As a result, the total CM costs are the sum of both the
direct and the indirect costs. These costs can be lowered through greater PM effort
but this implies increased PM costs. The total cost to the lessor as a function of the
PM effort is as shown in Figure 16.6 and the optimal PM effort is one that mini-
mizes the total costs.
Since the CM costs are uncertain, the optimal PM effort is based on minimizing
the expected total cost. This requires the lessor to first define the kind of PM policy
that would be employed and then to optimally select the parameters of the policy
so as to minimize the expected total cost.





Figure 16.6. Optimal PM effort

16.4.4 Maintenance Policies

One can define many different types of PM policies that the lessor can use. We
first consider new equipment lease. We define a few policies and indicate the
parameters that need to be optimally selected. Later we look at used equipment
lease. New Equipment Lease

Policy 1: The equipment is subjected to k preventive maintenance actions over the
lease period. The time instants at which these actions are carried out are given by
{t j ,1 ≤ j ≤ k} with ti < t j for i < j . The reduction in the intensity function during
the PM action is δ j . All failures over the lease period are rectified through
minimal repair. As a result, the policy is characterized by the parameter set
θ ≡ {k , t j , δ j ,1 ≤ j ≤ k } .
Maintenance of Leased Equipment 409

Policy 2: The equipment is subjected to preventive maintenance actions periodi-

cally so that the j −th PM action is carried at time t j = jT , j = 1, 2,..., k . After each
PM action the intensity function is reduced by δ j . All failures over the lease
period are rectified through minimal repair. The policy is characterized by the pa-
rameter set θ ≡ {T , δ j } .
Policy 3: The equipment is subjected to preventive maintenance action whenever
the intensity function reaches a specified level ρ . Each PM action reduces the
intensity function by a fixed amount δ . All failures over the lease period are
rectified through minimal repair. The policy is characterized by the parameter set
θ ≡ {ρ , δ } .
Policy 4: Let 0 < ς 1 < ς 2 < L . The equipment is subjected no PM actions in the
interval [0, ς 1 ) , periodic PM actions with period 2∆ in the interval [ς 1 , ς 2 ) and
period ∆ in the interval [ς 2 , L) . Each PM reduces the intensity function to a
specified level ν . All failures over the lease period are rectified through minimal
repair. The policy is characterized by the parameter set θ ≡ {ς 1 , ς 2 ,ν } . Used Equipment Lease

In this case, the lessor has the additional option of subjecting the equipment to an
overhaul. This can be modeled as a reduction in the virtual age so that we now
have an additional parameter x (the reduction in age). During the lease period the
lessor can use PM policies defined in Section 4.4.1.

16.5 Analysis and Optimisation of Maintenance Policies

We confine our attention to Policies 1 and 2 with new equipment lease and Policy
1 with lease of used equipment.
Let J (θ ) denote the expected total cost to the lessor. This includes the CM and
PM costs as well as the penalty costs. We assume γ = 0 so that the lessor incurs a
penalty even if there is one failure. (The expressions are slightly more complicated
when γ > 0 and the analysis is lot more difficult.) We present the final expressions
for J (θ ) and indicate references where interested readers can get the details.
It is not possible to derive any analytical results. A computational scheme is
needed to obtain the optimal values for the parameters of the policy. Our focus is
on the effect of penalty terms in the contract on the optimal maintenance strategies.
We illustrate this through numerical examples based on the Weibull intensity func-
tion given by

β −1
β ⎛t ⎞
λ0 ( t ) = ⎜ ⎟ (16.6)
α ⎝α ⎠

α is scale parameter and β is shape parameter. The repair time distribution is

given by a two-parameter Weibull distribution
410 D. Murthy and J. Pongpech

⎡ ⎛ y ⎞m ⎤
G ( y ) = 1 − exp ⎢ − ⎜ ⎟ ⎥ , 0 ≤ y < ∞ (16.7)
⎣⎢ ⎝ n ⎠ ⎦⎥

with shape parameter m < 1 (implying decreasing repair rate) and scale parameter
n > 0 . We assume the following parameter values:
Intensity function: α = 1 (year) and β > 1 (implying increasing failure rate)
Repair time: m = 0.5 and n = 0.5 (mean time to repair is one day)
Reduction in intensity function: C p (δ ) = 100 + 50δ ($)
Reduction in age: Cu ( x) = ($) with w = 10 and ϕ = 0.1
1− e ( )
−ϕ A − x

Cost parameters: C f = 100 ($), Cn = 200 ($), Ct = 300 ($)

16.5.1 Policy 1 (New Equipment Lease)

From Jaturonnatee et al. (2005), the expected total cost given by

J (θ ) = C f E ⎡⎣ N ( L ) ⎤⎦ + ∑ C (δ ) +
j =1
p j

⎧⎪ ∞ ⎫⎪
Ct E ⎡⎣ N ( L ) ⎤⎦ ⎨ ( y − τ ) g ( y ) dy ⎬ + Cn E ⎡⎣ N ( L ) ⎤⎦
∫⎩⎪ τ ⎭⎪

The first term on the LHS is the cost of rectifying failures, the second term is the
PM costs, and the third and fourth terms represent the penalty costs associated with
repair times and number of failures over the lease period. The parameters, given by
the set θ ≡ {k , t j , δ j ,1 ≤ j ≤ k } need to be selected optimally to minimize J (θ ) .

Example 16.1 Table 16.2 (extracted from Table 3 of Jaturonnatee et el. 2005)
shows k * , the optimal number of PM actions (the optimal values for the remaining
parameters are omitted) and J * (θ * ) , the corresponding expected costs for a range
of τ and Cn .
The optimisation needs to take into account the following constraint:

0< ∑δ
i =0
i < λ0 (t j ) − λ0 (0), j ≥ 1 (16.9)

with t0 = 0 and δ 0 = 0 .
Maintenance of Leased Equipment 411

Table 16.2. Optimal maintenance under Policy 1

Cn = 0 ($) Cn = 200 ($)

β τ (days) *
k *
J (θ ) k*
J (θ * )
1 4 $1002.27 5 $1298.34
2 3 $907.63 5 $1223.91
3 3 $838.08 5 $1179.39
∞ 2 $615.31 4 $1042.31
1 7 $1992.32 10 $2531.77
2 6 $1811.05 9 $2399.16
3 6 $1693.48 9 $2317.58
∞ 4 $1280.00 7 $2067.71
1 19 $7511.43 26 $8962.08
2 16 $7009.92 24 $8610.66
3 16 $6677.07 23 $8388.23
∞ 10 $5437.03 20 $7712.87

The case of no penalty corresponds to Cn = 0 and τ → ∞ . In this case, for β = 2

( )
and L = 5 , we have from Table 16.2 k * = 4 and J θ * = 1280.00 ($). With only
the penalty for repair not being completed within the specified time ( τ = 2 and
Cn = 0 ), k * increases to 6 and the expected total cost increases to 1811.05 ($).
With only the penalty for failure occurrence ( Cn = 200 and τ → ∞ ), k * increases
to 7 and the expected total cost increases to 2067.71 ($). With both penalties
( τ = 2 and Cn = 200 ), k * increases to 9 and the expected total cost increases to
2399.16 ($).The impact of the penalty is more significant as β increases.

16.5.2 Policy 2 (New Equipment Lease)

The number of PM actions carried out over the lease period is k (T ) given by the
largest integer less than [ L / T ] . The expected total cost is given by Equation 16.8
with t j = jT , j ≥ 1, and the parameters, given by the set, θ ≡ T , δ j ,1 ≤ j ≤ k (T ) , }
need to be selected optimally to minimize J (θ ) subject to the constraint given by
Equation 16.9.

Example 16.2 Table 16.3 (extracted from Pongpech and Murthy 2006) shows T *
(the optimal values for the other parameters are omitted) and the corresponding
expected total cost for β = 3 and L = 5 .

Table 16.3. Optimal maintenance under Policy 2

C n = 0 ($) Cn = 200 ($)

τ (days)
T *
J (θ * ) T *
J (θ * )
1 0.2381 $7827.21 0.1786 $9336.99
2 0.2778 $7312.50 0.1923 $8969.90
3 0.3125 $6968.53 0.2000 $8737.14
∞ 0.5000 $5750.00 0.2273 $8034.90
412 D. Murthy and J. Pongpech

When there is no penalty ( Cn = 0 and τ → ∞ ) we see from Table 16.2 that

( )
T * = 0.5 and J θ * = 5750.00 ($). The effect of the repair time penalty is that
T * decreases as τ decreases. The effect of the failure penalty is also similar, with
T * decreasing as Cn increases.

16.5.3 Policy 1 (Used Equipment Lease)

The age of the used equipment is A and the lessor carries out an overhaul which
reduces its age by x before the equipment is leased out. The analysis is similar to
Policy 1 and the expected total cost (see, from Pongpech et al. (2006) for details) is
given by

k ⎧⎪ ∞ ⎫⎪
J (θ ) = C f E ⎡⎣ N ( L ) ⎤⎦ + ∑ C p (δ j ) + Ct E ⎡⎣ N ( L ) ⎤⎦ ⎨ ( y − τ ) g ( y ) dy ⎬

j =1 ⎪⎩ τ ⎪⎭ (16.10)
+ Cn E ⎡⎣ N ( L ) ⎤⎦ + Cu ( x )

where θ ≡ { x, k , t j , δ j ,1 ≤ j ≤ k } . This differs from Equation 16.8 in two ways –

(i) E[ N ( L)] depends on A (the age of the used equipment) and (ii) the last term is
the cost of PM action before the equipment is leased out.

Example 16.3 Table 16.4 (extracted from Pongpech et al. 2006) shows x* and k *
(the optimal values for the remaining parameters are omitted) and the correspond-
ing optimal expected total cost for A = 5 , β = 2 and L = 5 .

Table 16.4. Optimal maintenance under Policy 1 for used equipment

Cn = 0 ($) Cn = 200 ($)

τ (days) *
x *
k *
J (θ ) x*
k* J (θ * )
1 3.5 7 $8484.36 4.0 10 $11312.57
2 3.5 6 $7488.90 4.0 9 $10637.18
3 3.5 6 $6884.50 4.0 9 $10231.04
∞ 2.5 4 $4792.55 3.5 7 $8918.59

With no penalty ( Cn = 0 and τ → ∞ ) we have x* = 2.5, k * = 4 and J θ * = ( )

4792.55 ($). The effect of the repair time penalty is that x* increases and then stays
constant as τ decreases. Similarly, k * increases (implying more frequent PM
actions over the lease period). The effect of the failure penalty is also similar, with
x* and k * increasing as Cn increases.
Table 16.5 shows the results with A ranging from one to seven years.
Maintenance of Leased Equipment 413

Table 16.5. Effect of variation in A on optimal strategy

A x* k* J (θ * )
1 0.0 4 $2280.00
2 0.6 4 $3111.58
3 1.2 4 $3752.68
4 2.0 4 $4290.03
5 2.5 4 $4792.55
6 3.6 4 $5198.07
7 4.2 4 $5601.10

As can be seen, x* (the reduction in age due to PM actions before the equipment is
leased out) increases with A as is to be expected since the ROCOF increases with
age. Note that no upgrade is needed when the equipment is fairly young ( A = 1 ).
Also, k * does not change when β = 2 . However, when β > 2 , then we find that
k * increases as A increases.

16.6 Topics for Future Research

In this section we briefly discuss some future research areas.

1. The occurrence of failures depends on factors such as usage intensity,

operating environment and operator skills. These can vary across the lessee
population. One way of modeling this is through the Cox regression model
where the intensity function includes an extra term to reflect the effect of
these variables.
2. The penalty terms in the lease contract studied so far are fairly simple – a
penalty when the repair time and/or the number of failures over the lease
period exceed some specified limits. The lease contract can involve more
complex penalty terms. For example, different upper limits on the number
of failures for different intervals over the lease period, the time interval
between subsequent failures, etc.
3. The time to carry out CM actions depends on the availability of repair crew
and spare parts. This raises several issues such as the optimal inventory
levels for spares, number of repair crew, etc., that the lessor needs to deal
with. Large inventory and a greater number of crews reduce the penalty
cost but increase the inventory holding and operating costs. As a result,
these parameters must be selected optimally to achieve a proper trade-off
between the two costs.
4. The research so far has focussed mainly on issues of interest to the lessor.
When the lessor offers a wide range of options the lessee has to decide on
the optimal choice. This needs to take into account the price of the lease
and a proper cost-benefit analysis of each option.
414 D. Murthy and J. Pongpech

5. From the lessor’s point of view, the size and variety of equipment to stock
for leasing are both important issues. The optimal choice of these and the
replacement decisions must take into account the needs of different lessees
and the investment needed for the purchase of new stock.

16.7 References
Baker CR, Hayes RS (1981) Lease Financing — A Practical Guide, John Wiley, New York,
Barlow RE, Hunter LC (1960) Optimum preventive maintenance policies, Operation
Research, 8:90–100
Blischke WR, Murthy DNP (2000) Reliability Modeling, Prediction, and Optimization, John
Wiley, New York, USA
Cho D, Parlar M (1991) A survey of maintenance models for multi-unit systems, European
Journal of Operational Research, 51:1–23
Coyle B (2000) Leasing, Glenlake, Chicago, USA
Deelen L, Dupleich M, Othieno L, Wakelin O (2003) Leasing for small and micro
enterprises – a guide for designing and managing leasing schemes in developing
countries, Berold, R. (ed), Cristina Pierini, Turin, Italy.
Dekker R, Wildeman RE, Van Der Duyn Schouten FA (1997) Review of multi-component
models with economic dependence, Mathematical Methods of Operations Research,
Desai P, Purohit D (1998) Leasing and selling: optimal marketing strategies for a durable
goods firm, Management Science, 44 (11):19–34
ELA (2002a) Equipment Leasing and Financial Foundation 2002 State of the Industry
Report, Price Water House Coopers, Available on
ELA (2002b) Equipment Leasing Association Online Focus Groups Report, Available on
ELA (2005) The economic contribution of equipment leasing to the U.S. economy: growth,
investment & jobs—update, Equipment Leasing Association, Global Insight, Advisory
Services Group, Available on
Ezzel JR, Vora PP (2001) Leasing versus purchasing: Direct evidence on a corporation’s
motivation for leasing and consequences of leasing, The Quarterly Review of Economics
and Finance, 41:33–47
Fishbein BK, McCarry LS, Dillon PS (2000) Leasing: A step toward producer responsi-
bility, Available on
Gits CW (1986) On the maintenance concept for a technical system: II. Literature review,
Maintenance Management International, 6:181–196
Handa P (1991) An economic analysis of leasebacks, Review of Quantitative Financing and
Accounting, 1:177–189
Jardine AKS, Buzacott JA (1985) Equipment reliability and maintenance, European Journal
of Operational Research, 116:259–273
Jaturonnatee J, Murthy DNP, Boondiskulchok R (2005) Optimal preventive maintenance of
leased equipment with corrective minimal repair, European Journal of Operational
Research, Available online 30 March 2005
Kim EH, Lweellen WG, McConnell JJ (1978) Sale-and-leaseback agreements and enterprise
valuation, Journal of Financial and Quantitative Analysis, 13:871–881
Maintenance of Leased Equipment 415

Kleiman RT (2001) The characteristics of venture lease financing, Journal of Equipment

Lease Financing, 19 (1):1–10
McCall JJ (1965) Maintenance policies for stochastically failing equipment: A survey,
Management Science, 11:493–524
Pham H, Wang H (1996) Imperfect maintenance, European Journal of Operational
Research, 94:425–438
Pierskalla WP, Voelker JA (1976) A survey of maintenance models: The control and
surveillance of deteriorating systems, Naval Logistics Research Quarterly, 23:353–388
Pintelton LM, Gelders L (1992) Maintenance management decision making, European
Journal of Operational Research, 58:301–317
Pongpech J, Murthy P (2006) Optimal periodic preventive maintenance policy for leased
equipment, Reliability Engineering and System Safety, 91(7):772–777
Pongpech J, Murthy DNP, Boondiskulchok R (2006) Maintenance strategies for used
equipment under lease, Journal of Quality in Maintenance Engineering, 12(1): 52–67
Scarf PS (1997) On the application of mathematical models to maintenance, European
Journal of Operational Research, 63:493–506
Schallheim JS (1994) Lease or Buy? Principles for Sound Decision Making, Harvard
Business School Press, Cambridge, Mass.
Sharpe S A, Nguyen H H (1995) Capital market imperfections and incentive to lease,
Journal of Financial Economics, 39:271–294
Sherif YS, Smith ML (1976) Optimal maintenance models for systems subject to failure—A
review, Naval Logistics Research Quarterly, 23:47–74
Stremersch S, Wuyts S, Rambach RT (2001) The Purchasing of Full-Service Contracts: An
Exploratory Study within the Industrial Maintenance Market, Industrial Marketing
Management, 30(1):1–12
Thomas LC (1986) A survey of maintenance and replacement models for maintainability
and reliability of multi-unit systems, Reliability Engineering, 16:297–309
Valdez-Flores C, Feldman RM (1989) A survey of preventive maintenance models for
stochastically deteriorating single-unit systems, Naval Logistics Research Quarterly,
Vickerman R (2004) Maintenance incentives under different infrastructure regimes, Utilities
Policy, 12:315–322

Computerised Maintenance Management Systems

Ashraf Labib

17.1 Introduction
Computerised maintenance management systems (CMMSs) are vital for the co-
ordination of all activities related to the availability, productivity and maintain-
ability of complex systems. Modern computational facilities have offered a dra-
matic scope for improved effectiveness and efficiency in, for example, main-
tenance. Computerised maintenance management systems (CMMSs) have existed,
in one form or another, for several decades.
The software has evolved from relatively simple mainframe planning of main-
tenance activity to Windows-based, multi-user systems that cover a multitude of
maintenance functions. The capacity of CMMSs to handle vast quantities of data
purposefully and rapidly has opened new opportunities for maintenance, facilitat-
ing a more deliberate and considered approach to managing assets.
Some of the benefits that can result from the application of a CMMS are:
• Resource control – tighter control of resources
• Cost management – better cost management and audibility
• Scheduling – ability to schedule complex, fast-moving workloads
• Integration – integration with other business systems
• Reduction of breakdowns – improved reliability of physical assets through
the application of an effective maintenance programme
The most important factor may be reduction of breakdowns. This is the aim of
the maintenance function and the rest are ‘nice’ objectives (or by-products).
This is a fundamental issue as some system developers and vendors as well as
some users lose focus and compromise reduction of breakdowns in order to main-
tain standardisation and integration objectives, thus confusing aim with objectives.
This has led to the fact that the majority of CMMSs in the market suffer from
serious drawbacks, as will be shown in the following section.
418 A. Labib

The term maintenance has many definitions. One comprehensive definition is

provided by the UK Department of Trade and Industry (DTI):
“The management, control, execution and quality of those activities which will
ensure that optimum levels of availability and overall performance of plant are
achieved, in order to meet business objectives”.
It is worth noting that the definition implies that maintenance is a managerial
and strategic activity; today, the term ‘asset management’ is often used instead. It
is also worth noting that the word ‘optimum’ was used rather than ‘maximum’
which implies that maintenance is an optimisation case, where both over-main-
tenance and under-maintenance should be avoided.
In this chapter an investigation of the characteristics of computerised main-
tenance management systems (CMMSs) is carried out in order to highlight the
need for them in industry and identify their current deficiencies. This is achieved
through the assessment of the state-of-the-art of existing CMMSs.
A proposed model is then presented to provide a decision analysis capability
that is often missing in existing CMMSs. The effect of such model is to contribute
towards the optimisation of the functionality and scope of CMMSs for enhanced
decision analysis support. The system is highly adaptive and has been successfully
applied in industry. The proposed model employs a hybrid of intelligent ap-
proaches. In this chapter, we also demonstrate the use of AI techniques in CMMS’s
and we show how it integrates with the work of Kobbacy in Chapter 9 as well as
outline features of next generation maintenance systems.
The chapter is organized as follows. Section 17.2 provides evidence of exis-
tence of ‘black holes’ in the CMMS market. An alternative is provided in Section
17.3 where a model for decision analysis called the Decision Making Grid (DMG)
is introduced. Section 17.4 describes maintenance policies that are covered by the
DMG. This is then followed by demonstration of incorporating the DMG into a
CMMS through a case study in Section 17.5 with a discussion of the results. The
final two sections (Sections 17.6 and 17.7) deal with the unmet needs in CMMSs
and a discussion of future directions for research.

17.2 Evidence of ‘Black Holes’

Most existing off-the-shelf software packages, especially CMMSs and enterprise
resource planning (ERP) systems, tend to be ‘black holes’. This term has been
coined by the author as a description of systems that are greedy for data input but
that seldom provide any output in terms of decision support. In astronomical terms,
‘black holes’ used to be stars at some time in the past and now possess such a high
gravitational force that they absorb everything that comes across their fields and do
not emit anything at all, including light. This is analogous to systems that, at worst,
are hungry for data and resources and, at best, provide the decision-maker with
information that he/she already knows. Companies consume a significant amount
of management and supervisory time compiling, interpreting and analysing the
data captured within the CMMS. Companies then encounter difficulties analysing
equipment performance trends and their causes as a result of inconsistency in the
Computerised Maintenance Management Systems 419

form of the data captured and the historical nature of certain elements of it. In
short, companies tend to spend a vast amount of capital in acquisition of off-the-
shelf systems for data collection, but their added value to the business is question-
Few books have been published about the subject of CMMSs (Bagadia 2006;
Mather 2002; Cato and Mobley 2001; Wireman 1994). However, they tend to
highlight its advantages rather than its drawbacks.
All CMMSs offer data collection facilities; more expensive systems offer
formalised modules for the analysis of maintenance data, and the market leaders
allow real time data logging and networked data sharing (see Table 17.1). Yet,
despite the observations made above regarding the need for information to aid
maintenance management, a ‘black hole’ exists in the row titled ‘Decision analy-
sis’ in Table 17.1, because virtually no CMMS offers decision support.1 This is a
definite problem, because the key to systematic and effective maintenance is
managerial decision-making that is appropriate to the particular circumstances of
the machine, plant or organisation. This decision-making process is made all the
more difficult if the CMMS package can only offer an analysis of recorded data.
As an example, when a certain preventive maintenance (PM) schedule is input into
a CMMS, for example to change the oil filter every month, the system will simply
produce a monthly instruction to change the oil filter and is thus no more than a

Table 17.1. Facilities offered by commercially available CMMS packages

Price range £ 1,000 + £ 10,000 + £ 30,000 + £ 40,000 +

Data collection    
Data analysis   
Decision analysis A “black hole”

A step towards decision support is to vary the frequency of PM depending on

the combination of failure frequency and severity. A more intelligent feature would
be to generate and prioritise PM according to modes of failure in a dynamic real-
time environment. A PM is usually static and theoretical in that it does not reflect
shop floor realities. In addition, the PM that is copied from machine manuals is
usually inapplicable because:
• All machine work in different environments and would therefore need dif-
ferent PMs
• Machine designers often have a different experience of machine failures
and means of prevention from those who operate and maintain them
• Machine vendors may have a hidden agenda of maximising spare parts re-
placements through frequent PMs
420 A. Labib

The use of CMMSs for decision support lags significantly behind the more
traditional applications of data acquisition, scheduling and work order issuing.
While many packages offer inventory tracking and some form of stock level
monitoring, the reordering and inventory holding policies remain relatively
simplistic and inefficient. See the work of Exton and Labib (2002) and Labib and
Exton (2001). Also, there is no mechanism to support managerial decision-making
with regard to inventory policy, diagnostics or setting of adaptive and appropriate
preventive maintenance schedules.
A noticeable problem with current CMMS packages regards provision of
decision support. Figure 17.1 illustrates how the use of CMMS for decision support
lags significantly behind the more traditional applications of data acquisition,
scheduling and work-order issuing.

Applications of CMMS Modules

A Black Hole
Mai ntenance budgeti ng

Pr edi cti ve mai ntenance data anal ysi s

Equi pment f ai l ur e di agnosi s

Inventor y contr ol

Spar e par ts r equi r ements pl anni ng

Mater i al and spar e par ts pur chasi ng

Manpower pl anni ng and schedul i ng

Wor k-or der pl anni ng and schedul i ng

Equi pment par ts l i st

Equi pment r epai r hi stor y

Pr eventati ve Mai ntenance pl anni ng and schedul i ng

70 75 80 85 90 95 100

Per cent ag e o f syst ems inco r p o r at ing mo d ule

Figure 17.1. Extent of CMMS module usage (from Swanson 1997)

According to Boznos (1998):

“The primary uses of CMMS appear to be as a storehouse for equipment infor-
mation, as well as a planned maintenance and a work maintenance planning
The same author suggests that CMMS appears to be used less often as a device
for analysis and co-ordination and that:
“Existing CMMS in manufacturing plants are still far from being regarded as
successful in providing team based functions”.
He has surveyed CMMS as well as total productive maintenance (TPM) and
reliability-centred management (RCM) concepts and the extent to which the two
concepts are embedded in existing marketed CMMSs. He has concluded that:
Computerised Maintenance Management Systems 421

“It is worrying the fact that almost half of the companies are either in some
degree dissatisfied or neutral with their CMMS and that the responses indicated
that manufacturing plants demand more user-friendly systems.”
This is a further proof of the existence of a ‘black hole’. To make matters
worse, it appears that there is a new breed of CMMSs that are complicated and lack
basic aspects of user-friendliness. Although they emphasise integration and logis-
tics capabilities, they tend to ignore the fact that the fundamental reason for imple-
menting CMMSs is to reduce breakdowns. These systems are difficult to handle
for both production operators and maintenance engineers; they are accounting-
and/or IT-orientated rather than engineering-orientated. Results of an investigation
(EPSRC – GM/M35291) show that managers’ lack of commitment to maintenance
models has been attributed to a number of reasons:
• Managers are unaware of the various types of maintenance models.
• A full understanding of the various models and the appropriateness of these
systems to companies is not available.
• Managers do not have confidence in mathematical models due to their
complexities and the number of unrealistic assumptions they contain.
This correlates with surveys of existing maintenance models and optimisation
techniques. Ben-Daya et al. (2001) and Sherwin (2000) have also noticed that
models presented in their work have not been widely used in industry for several
reasons, such as:
• Unavailability of data
• Lack of awareness about these models
• Restrictive assumptions of some of these models
Finally, here is an extract from the Professor Nigel Slack (Warwick University)
textbook on operations management regarding critical commentary of ERP imple-
mentations (which may as well apply to CMMSs as many of them tend to be nowa-
days classified as specialised ERP systems):
“Far from being the magic ingredient which allows operations to fully integrate
all their information, ERP is regarded by some as one of the most expensive
ways of getting zero or even negative return on investment. For example, the
American chemicals giants, Dow Chemical, spent almost half-a-billion dollars
and seven years implementing an ERP system which became outdated almost
as it was implemented. One company, FoxMeyer Drug, claimed that the ex-
pense and problems which it encountered in implementing ERP eventually
drove it to bankruptcy. One problem is that ERP implementation is expensive.
This is partly because of the need to customise the system, understand its
implications for the organisation, and train staff to use it. Spending on what
some call the ERP ecosystem (consulting, hardware, networking and compli-
mentary applications) has been estimated as being twice the spending on the
software itself. But it is not only the expense which has disillusioned many
companies, it is also the returns they have had for their investment. Some
studies show that the vast majority of companies implementing ERP are
disappointed with the effect it has had on their businesses. Certainly many
422 A. Labib

companies find that they have to (sometimes fundamentally) change the way
they organise their operations in order to fit in with ERP systems. This or-
ganisational impact of ERP (which has been described as the corporate
equivalent of dental root canal work) can have a significantly disruptive effect
on the organisation’s operations.”
Hence, theory and implementation of existing maintenance models are, to a
large extent, disconnected. It is concluded that there is a need to bridge the gap
between theory and practice through intelligent optimisation systems (e.g. rule-
based systems). It is also argued that the success of this type of research should be
measured by its relevance to practical situations and its impact on the solution of
real maintenance problems. The developed theory must be made accessible to
practitioners through IT tools. Efforts need to be made in the data capturing area to
provide necessary data for such models. Obtaining useful reliability information
from collected maintenance data requires effort. In the past, this has been referred
to as ‘data mining’ as if data can be extracted in its desired form if only it can be
In the next section we introduce a decision analysis model. We then show how
such a model has been implemented for decision support in maintenance systems.

17.3 Application of Decison Analysis in Maintenance

The proposed maintenance model is based on the concept of effectiveness and
adaptability. Mathematical models have been formulated for many typical
situations. These models can be useful in answering questions such as “how much
maintenance should be done on this machine? How frequently should this part be
replaced? How many spare should be kept in stock? How should the shutdown be
scheduled?” It generally accepted that the vast majority of maintenance models are
aimed at answering efficiency questions, that is questions of the form “how can
this particular machine be operated more efficiently?” and not at effectiveness
questions, like “which machine should we improve and how?”. The latter question
is often the one in which practitioners are interested. From this perspective it is not
surprising that practitioners are often dissatisfied if a model is directly applied to
an isolated problem. This is precisely why in the integrated approach efficiency
analysis as proposed by the author (do the things right) is preceded by effective-
ness analysis (do the right thing). Hence, two techniques were employed to
illustrate the above-mentioned concepts mainly the fuzzy logic rule based decision
making grid (DMG) and the analytic hierarchy process (AHP) as proposed by
Labib et al. (1998). The proposed model is illustrated in Figure 17.2.
The decision-making grid (DMG) acts as a map where the performances of the
worst machines are placed based on multiple criteria. The objective is to imple-
ment appropriate actions that will lead to the movement of machines towards an
improved state with respect to multiple criteria. These criteria are determined
through prioritisation based on the analytic hierarchy process (AHP) approach. The
AHP is also used to prioritise failure modes and fault details of components of
critical machines within the scope of the actions recommended by the DMG.
Computerised Maintenance Management Systems 423

The model is based on identification of criteria of importance such as downtime

and frequency of failures. The DMG then proposes different maintenance policies
based on the state in the grid. Each system in the grid is further analyzed in terms
of prioritisations and characterisation of different failure types and main contribut-
ing components.

Figure 17.2. Decision analysis maintenance system

17.4 Maintenance Policies

Maintenance policies can be broadly categorised into the technology or systems
oriented (systems or engineering), management of human factors oriented and
monitoring and inspection oriented.
RCM is a technological based concept where reliability of machines is empha-
sised. RCM is a method for defining the maintenance strategy in a coherent,
systematic and logical manner. It is a structured methodology for determining the
maintenance requirements of any physical asset in its operation context. The pri-
mary objective of RCM is to preserve system function. The RCM process consists
of looking at the way equipment fails, assessing the consequences of each failure
(for production, safety, etc), and choosing the correct maintenance action to ensure
that the desired overall level of plant performance (i.e. availability, reliability) is
met. The term RCM was originally coined by Nolan and Heap (1979). For more
details on RCM see Moubray (1991, 2001), and Netherton (2000).
424 A. Labib

TPM is human based technique in which maintainability is emphasised. TPM is

a tried and tested way of cutting waste, saving money, and making factories better
places to work. TPM gives operators the knowledge and confidence to manage
their own machines. Instead of waiting for a breakdown, then calling the main-
tenance engineer, they deal directly with small problems, before they become big
ones. Operators investigate and then eliminate the root causes of machine errors.
Also, they work in small teams to achieve continuous improvements to the pro-
duction lines. For more details on TPM see Nakajima (1988), Hartmann (1992) and
Willmott (1994).
Condition based maintenance (CBM) – not condition based monitoring – is a
sensing technique in which availability based on inspection and follow-up is
emphasised. In the British Standards, CBM is defined ast the preventive
maintenance initiated as a result of knowledge of the condition of an item from
routine or continuous monitoring.” (BS 3811, 1984). It is the means whereby
sensors, sampling of lubricant products, and visual inspection are utilised to permit
continued operation of critical machinery and avoid catastrophic damage to vital
components The integral components for the successful application of condition
monitoring of machinery are: reliable detection, correct diagnosis, and dependable
decision-making. For more details on CBM, see Brashaw (1998) and Holroyd
The proposed approach in this chapter is different from the above – mentioned
ones in that it offers a decision map adaptive to the collected data where it suggest
the appropriate use of RCM, TPM, and CBM.

17.5 The DMG Through an Industrial Case Study

This case study demonstrates the application of the proposed model and its effect
on asset management performance. The application of the model is shown through
the experience of a company seeking to achieve world-class status in asset manage-
ment. The company has implemented the proposed model which has had the effect
of reducing total downtime from an average of 800 h per month to less than 100 h
per month as shown in Figure 17.3.

17.5.1 Company Background and Methodology

In this particular company there are 130 machines, varying from robots and
machine centres to manually operated assembly tables. Notice that, in this case
study, only two criteria are used (frequency and downtime). However, if more
criteria are included, such as spare parts cost and scrap rate, the model becomes
multi-dimensional, with low, medium, and high ranges for each identified criterion.
The methodology implemented in this case was to follow three steps. These steps
are: i. criteria analysis, ii. decision mapping, and iii. decision support.
Computerised Maintenance Management Systems 425

B re a k d o wn tre n d s (h )







Nov D ec Ja n Feb M ar A pr M ay Ju n Ju l A ug Sep Oct Nov

Figure 17.3. Total breakdown trends per month

17.5.2 Step 1: Criteria Analysis

As indicated earlier, the aim of this phase is to establish a Pareto analysis of two
important criteria: downtime — the main concern of production and frequency of
calls — the main concern of asset management. The objective of this phase is to
assess how bad are the worst performing machines for a certain period of time, say
one month. The worst performers in both criteria are sorted and grouped into high,
medium, and low sub-groups. These ranges are selected so that machines are
distributed evenly among every criterion. This is presented in Figure 17.4. In this
particular case, the total number of machines is 120. Machines include CNCs,
robots, and machine centres.

Figure 17.4. Step 1: criteria analysis

426 A. Labib

17.5.3 Step 2: Decision Mapping

The aim of this step is twofold: it scales high, medium, and low groups and hence
genuine worst machines in both criteria can be monitored on this grid; it also
monitors the performance of different machines and suggests appropriate actions.
The next step is to place the machines in the “decision making grid” shown in
Figure 17.5, and accordingly, to recommend asset management decisions to man-
agement. This grid acts as a map where the performances of the worst machines are
placed based on multiple criteria. The objective is to implement appropriate actions
that will lead to the movement of machines towards the north-west section of low
downtime, and low frequency. In the topleft region, the action to implement, or the
rule that applies, is OTF (operate to failure). The rule that applies for the bottomleft
region is SLU (skill level upgrade) because data collected from breakdowns —
attended by maintenance engineers — indicates that machine [G] has been visited
many times (high frequency) for limited periods (low downtime). In other words
maintaining this machine is a relatively easy task that can be passed to operators
after upgrading their skill levels.
Machines that are located in the topright region, such as machine [B], is a
problematic machine, in maintenance words “a killer”. It does not breakdown
frequently (low frequency), but when it stops it is usually a big problem that lasts
for a long time (high downtime). In this case the appropriate action to take is to
analyse the breakdown events and closely monitor its condition, i.e. condition base
monitoring (CBM).
A machine that enters the bottomright region is considered to be one of the
worst performing machines based on both criteria. It is a machine that maintenance
engineers are used to seeing it not working rather than performing normal operat-
ing duty. A machine of this category, such as machine [C], will need to be
structurally modified and major design out projects need to be considered, and
hence the appropriate rule to implement will be design out maintenance (DOM).
If one of the antecedents is a medium downtime or a medium frequency, then
the rule to apply is to carry on with the preventive maintenance schedules. However,
not all of the media are the same. There are some regions that are near to the top left
corner where it is “easy” FTM (fixed time maintenance) because it is near to the
OTF region and it requires re-addressing issues regarding who will perform the
instruction or when will the instruction be implemented. For example, in the case of
machines [I] and [J], they are situated in region between OTF and SLU and the
question is about who will do the instruction — operator, maintenance engineer, or
sub-contractor. Also, a machine such as machine [F] has been shifted from the OTF
region due to its relatively higher downtime and hence the timing of instructions
needs to be addressed.
Other preventive maintenance schedules need to be addressed in a different
manner. The “difficult” FTM issues are the ones related to the contents of the
instruction itself. It might be the case that the wrong problem is being solved or the
right one is not being solved adequately. In this case machines such as [A] and [D]
need to be investigated in terms of the contents of their preventive instructions and
an expert advice is needed.
Computerised Maintenance Management Systems 427

Decision making grid

Low Med. High
10 20
Low O.T.F. F.T.M. C.B.M.

[H] (When ?) [F] [B]

Med. F.T.M. [I] F.T.M. F.T.M.
(Who ?) [J] (What ?) [A]
High S.L.U. F.T.M. D.O.M.
[G] (How ?) [D] [C]

CBM: condition base monitoring OTF: operate to failure

SLU: skill level upgrade DOM: design out M/C
FTM: fixed time maintenance

Figure 17.5. Step 2: decision mapping

17.5.4 Step 3: Multileveled Decision Support

Once the worst performing machines are identified and the appropriate action is
suggested; it is now a case of identifying a focused action to be implemented. In
other words, we need to move from the strategic systems level to the operational
component level. Using the analytic hierarchy process (AHP), one can model a
hierarchy of levels related to objectives, criteria, failure categories, failure details
and failed components. For more details on the AHP readers can consult Saaty
(1988). This step is shown in Figure 17.6.
The AHP is a mathematical model developed by Saaty (1980) that prioritises
every element in the hierarchy relative to other elements in the same level. The
prioritization of each element is achieved with respect to all elements in the above
level. Therefore, we obtain a global prioritized value for every element in the
lowest level. In doing that we can then compare the prioritized fault details (level 4
in Figure 17.6), with PM signatures (keywords) related to the same machine. PMs
can then be varied accordingly in an adaptive manner to shop floor realities.
The proposed decision analysis maintenance model as shown previously in
Figure 17.2 combines both fixed rules and flexible strategies since machines are
compared on a relative scale. The scale itself is adaptive to machine performance
with respect to identified criteria of importance. Hence flexibility concept is
embedded in the proposed model.
428 A. Labib

Multiple Criteria Decision Analysis (MCDA)

Level 1: Criteria
Downtime Frequency Spare Parts Bottlenecks

Level 2: Critical
System A System B System C ………

Level 3: Critical

Electrical Mechanical Hydraulic Pneumatic Software

Level 4: Fault Motor Faults No Power Faults Panel Faults Switch Faults

6/30/02 Limit Faults Proximity Faults

Dr. A.W. Pressure Faults
Labib (UMIST)
Push Button Faults

Figure 17.6. Step 3: decision support Fuzzy Logic Rule Based Decision Making Grid

In practice, however, there can exist two cases where one needs to refine the model.
The first case is when two machines are located near to each other across different
sides of a boundary between two policies. In this case we apply two different
policies despite a minor performance difference between the two machines. The
second case is when two machines are on the extreme sides of a quadrant of a
certain policy. In this case we apply the same policy despite the fact they are not
near each other. Both cases are illustrated in Figure 17.7. For both cases we can
apply the concept of fuzzy logic where boundaries are smoothed and rules are
applied simultaneously with varying weights.
In fuzzy logic, one needs to identify membership functions for each controlling
factor, in this case: frequency and downtime as shown in Figure 17.8a,b. A mem-
bership function defines a fuzzy set by mapping crisp inputs from its domain to
degrees of membership (0,1). The scope/domain of the membership function is the
range over which a membership function is mapped. Here the domain of the fuzzy
set medium frequency is from 10 to 40 and its scope is 30 (40–10), whereas the
domain of the fuzzy set high downtime is from 300 to 500 and its scope is 200
(500–300) and so on.
Computerised Maintenance Management Systems 429

Figure 17.7. Special cases for the DMG model

Low Medium High




0 10 20 30 40 50
12 (No. of times)

Low Medium High




0 100 200 300 400 500

380 (hrs)

Figure 17.8. a Membership function of frequency b Membership function of downtime

430 A. Labib

The output strategies have a membership function and we have assumed a cost
(or benefit) function that is linear and follows the following relationship (DOM >
CBM >SLU > FTM > OTF) as shown in Figure 17.9a.
The rules are then constructed based on the DMG grid where there will be 9
rules. An example of the rules is as follows:
• If frequency is high and downtime is low then maintenance strategy is SLU
(skill level upgrade).
• If frequency is low and downtime is high then maintenance strategy is
CBM (condition based maintenance).
Rules are shown in Figure 17.9b.

0 20 30 40 50
(x £1,000/unit)

Figure 17.9. a Output (strategies) membership function. b The nine rules of the DMG
Computerised Maintenance Management Systems 431

The fuzzy decision surface is shown in Figure 17.10. In this figure, given any
combination of frequency (x-axis) and downtime (y-axis) one can determine the
most appropriate strategy to follow (z axis).



Figure 17.10. The fuzzy decision surface

It can be noticed from Figure 17.11 that the relationship of (DOM > CBM
>SLU > FTM > OTF) is maintained. As illustrated in Figure 17.11, given an 380-h
downtime and a 12 x frequency, the suggested strategy to follow is CBM.

Figure 17.11. The fuzzy decision surface showing the regions of different strategies
432 A. Labib

17.5.5 Discussion

The concept of the DMG was originally proposed by Labib (1996). It was then
implemented in a company that has achieved a world-class status in maintenance
(Labib 1998a). The DMG model has also been extended to be used as a technique
to deal with crisis management in an award winning paper (Labib 1998b).
The DMG could be used for practical continuous improvement process because,
when machines in the top ten have been addressed, they will then, if and only if
appropriate action has been taken, move down the list of top ten worst machines.
When they move down the list, other machines show that they need improvement
and then resources can be directed towards the new offenders. If this practice is
continuously used then eventually all machines will be running optimally.
If problems are chronic, i.e. regular, minor and usually neglected, some of these
could be due to the incompetence of the user and thus skill level upgrading would
be an appropriate solution. However, if machines tend towards RCM then the
problems are more sporadic and when they occur could be catastrophic. Uses of
maintenance schemes such as FMEA and FTA can help determine the cause and
may help predict failures thus allowing a prevention scheme to be devised.
Figure 17.12 shows when to apply TPM and RCM. TPM is appropriate at the
SLU range since skill level upgrade of machine tool operators is a fundamental
concept of TPM, whereas RCM is applicable for machines exhibiting severe
failures (high downtime and low frequency). Also, CBM and FMEA will be ideal
for this kind of machine and hence an RCM policy will be most applicable. The
significance of this approach is that in one model we have RCM and TPM in a
unified model rather than two competing concepts.

Figure 17.12. When to apply RCM and TPM in the DMG

Computerised Maintenance Management Systems 433

Figure 17.13. Parts of PM schedules that need to be addressed in the DMG

Generally the easy preventive maintenance (PM), fixed time maintenance

(FTM) questions are who? and when? (efficiency questions). The more difficult
ones are what? and how? (effectiveness questions), as indicated in Figure 17.13.

17.6 Unmet Needs in Responsive Maintenance

According to Professor Jay Lee, of the National Science Foundation (NSF)
Industry/University Cooperative Research Centre on Intelligent Maintenance Sys-
tems (IMS) at the University of Cincinnati, unmet needs in responsive maintenance
can be categorised as follows:
• Machine intelligence – intelligent monitoring, prediction, prevention and
compensation and reconfiguration for sustainability (self-maintenance)
• Operations intelligence – prioritisation, optimisation and responsive main-
tenance scheduling for reconfiguration needs
• Synchronisation intelligence – autonomous information flow from market
demand to factory asset utilisation
It can be concluded that the challenges, and research questions facing research
and development (R&D) concerning next generation maintenance systems are:
• How to adapt PM schedules to cope dynamically with shop-floor reality
• How to feed back information and knowledge gathered in maintenance to
the designers
• How to link maintenance policies to corporate strategy and objectives
• How to synchronise production scheduling based on maintenance perform-
434 A. Labib

17.7 Future Directions and Conclusions

Training and educational programmes should be designed to address the existence
of the considerable gap between the skills that are essential to maximise the poten-
tial benefits from these advanced systems and technologies in the area of main-
tenance and asset management and the skills that currently exist in the maintenance
sections of most industries.
Existing ERP and CMMS systems tend to put much emphasis on data collec-
tion and analysis rather than on decision analysis. Although the existing teaching
programmes already address some of the issues related to next-generation main-
tenance systems, there is still room for considering other issues, such as:
• Emphasis on CMMS and ERP systems in the market, as well as their use
and limitations
• Design awareness in maintenance and design for maintainability
• Learning from failures across different industries and disciplines
• Emphasis on prognostics rather than diagnostics
• e-Maintenance and remote maintenance, including self-powered sensors
• Modelling and simulation using OR tools and techniques
• AI applications in maintenance
As the success of systems implementation are based on two factors, human and
systems, it is important to develop and nurture skills as well as to use advanced
In this chapter we have investigated the characteristics of computerised main-
tenance management systems (CMMSs) and have highlighted the need for them in
industry and identified their current deficiencies.
A proposed model was then presented to provide a decision analysis capability
that is often missing in existing CMMSs. The effect of such model was to contribute
towards the optimisation of the functionality and scope of CMMSs for enhanced
decision analysis support.
We have also demonstrated the use of AI techniques in CMMSs. We also
showed how it integrates with the work of Kobbacy in Chapter 9. Finally, we have
outlined features of next generation maintenance systems.

17.8 References
Bagadia, K. (2006), Computerized Maintenance Management Systems Made Easy,
Brashaw, C. (1998), Characteristics of acoustic emission (AE) signals from ill fitetd copper
split bearings, Proc 2nd Int. Conf on Planned Maintenance, Reliability and Quality.
Ben-Daya, M., Duffuaa, S.O. and Raouf, A. (eds) (2001), Maintenance Modelling and
Optimisation, Kluwer Academic Publishers, London.
Boznos, D. (1998), The Use of CMMSs to Support Team-Based Maintenance, MPhil thesis,
Cranfield University.
Computerised Maintenance Management Systems 435

Cato, W., and Mobley, K. (2001), Computer-Managed Maintenance Systems:A Step-by-Step

Guide to Effective Management of Maintenance, Labor, and Inventory, Butterworth
Heinemann, Oxford.
Exton, T. and Labib, A.W. (2002), Spare parts decision analysis – The missing link in
CMMSs (Part II), Journal of Maintenance & Asset Management, 17,14–21.
Hartmann, E.H. (1992), Successfully Installing TPM in a Non-Japanese Plant, TPM Press,
Inc., New York.
Holroyd, T. (2000), Acoustic Emission & Ultrasonics, Coxamoor Publishing Company,
Labib, A.W., Exton, T. (2001), Spare parts decision analysis – The missing link in CMMSs
(Part I), Journal of Maintenance & Asset Management.16(3):10–17.
Labib, A.W., Williams, G.B. and O’Connor, R.F. (1998), An intelligent maintenance model
(system): An application of the analytic hierarchy process and a fuzzy logic rule-based
controller, Journal of the Operational Research Society, 49, 745–757.
Labib, A.W. (1998a), World-class maintenance using a computerised maintenance
management system, Journal of Quality in Maintenance Engineering, 4, 66–75.
Labib, A.W. (1998b), A Logistic approach to managing the millennium information systems
problem, Journal of Logistics Information Management, 11, 285–384.
Labib, A.W. (1996), An integarted approprate productive maintenance, PhD Thesis,
University of Birmingham.
Mather, D. (2002), CMMS: A Timesaving Implementation Process, CRC PRESS, New
Moubray, J. (2001), The case against streamlined RCM, Maintenance & Asset Management,
16, 15–27.
Moubray, J. (1991), Reliability Centred Maintenance, Butterworth-Heinmann Ltd, Oxford
Nakajima, S. (1988), Total Productive Maintenance, Productivity Press, Illinois
Netherton, D. (2000), RCM Standard, Maintenance & Asset Management, 15, 12–20.
Nolan, F. and Heap, H. (1979), Reliability Centred Maintenance, National Technical
Information Service Report, # A066-579.
Saaty, T.L. (1988), The Analytic Hierarchy Process, Pergamon Press, New York.
Saaty, T.L. (1980), The Analytic Hierarchy Process: Planning, Priority Setting – Resource
Allocation, McGraw-Hill, New York.
Sherwin, D., (2000) A review of overall models for maintenance management, Journal of
Quality in Maintenance Engineering, 6, 138–164
Swanson, L. (1997), Computerized Maintenance Management Systems: A study of system
design and use, Production and Inventory Management Journal, Second Quarter: 11–14.
Willmott, P. (1994), Total Productive Maintenance. The Western Way, Butterworth
Heinemann Ltd., Oxford
Wireman, T. (1994), Computerized Maintenance Management Systems, 2nd edition,
Industrial Press Inc, New York.

Risk Analysis in Maintenance

Terje Aven

18.1 Introduction
This chapter discusses the use of risk analysis to support decision making on
maintenance activities. In recent years there has been a growing interest in the use
of risk analysis and risk based (informed) approaches for guiding decisions on
maintenance, see, e.g., Vatn et al. (1996), Clarotti et al. (1997), Dekker (1996) and
Cepin (2002), and this topic has also been given much attention in industry see for
example van Manen et. al. (1997), Knoll et al. (1996), Perryman et al. (1995) and
Podofillini et al. (2006). This chapter provides a critical review of some of the key
building blocks of the theories and methods developed. We also discuss some
critical factors for ensuring a successful use of risk analysis for maintenance
applications. The issues discussed include:
• Risk descriptions and categorisations
• Uncertainty assessments
• Risk acceptance and risk informed decision making
• Selection of appropriate methods and tools
An example is presented of a detailed risk analysis, showing the effect of mainten-
ance efforts on risk.
The chapter is organised as follows. First in Section 18.2 we review the basic
elements of risk management and risk management processes, and clarify the risk
perspective adopted in this chapter. Then in Section 18.3 we address the use of risk
analysis to support decisions on maintenance. Various types of decision situations
and analyses are covered. Section 18.4 presents the case mentioned above. In
Section 18.5 we discuss key building blocks of the theories and methods devel-
oped, as well as the critical factors for ensuring a successful use of risk analysis for
maintenance applications. Section 18.6 concludes. When not otherwise stated, we
use terminology from ISO (2002).
438 T. Aven

List of abbreviations:
PLL Potential loss of life (expected number of fatalities per year)
FAR Fatal accident rate (expected number of fatalities per 100 million
exposed hours)
ETA Event tree analysis
FTA Fault tree analysis
CCA Cause consequence analysis
FMECA Failure mode and effect and criticality analysis
HAZOP Hazard and operability studies
RIF Risk influencing factor
BORA Barrier operational risk analysis
RCM Reliability centred maintenance
HMI Human machine interface
TTS Technical condition safety

18.2 Basics of Risk Management and Risk Analysis

18.2.1 General

The purpose of risk management is to ensure that adequate measures are taken to
protect people, the environment and assets from harmful consequences of the
activities being undertaken, as well as balancing different concerns, in particular
risks and costs. Risk management includes both measures to avoid the occurrence
of hazards and reduce their potential harm. Traditionally risk management was
based on a prescriptive regulating regime, in which detailed requirements were set
to the design and operation of the arrangements. This regime has gradually been
replaced by a more goal oriented regime, putting emphasis on what to achieve
rather than the solutions.
Risk management is an integral aspect of a goal oriented regime. It is acknowl-
edged that risk cannot be eliminated but must be managed. There is an enormous
drive and enthusiasm in various industries and society as a whole nowadays to
implement risk management in the organizations. There are high expectations, that
risk management is the proper framework for obtaining high levels of performance.
To support decision making on design and operation, risk analyses are conduc-
ted. The analyses cover identification of hazards and threats, cause analyses, con-
sequence analyses and risk description. Evaluations of the results of the analyses
are carried out. The totality of the analyses and the evaluations are referred to as
risk assessments. Risk assessment is followed by risk treatment, which is the
process and implementation of measures to modify risk, including measures to
avoid, reduce (“optimize”), transfer or retain risk. Risk transfer means sharing with
another party the benefit or loss associated with a risk. It is typically affected
through insurance. Risk management covers all co-ordinated activities to direct and
control an organisation with regard to risk. The risk management process is the
systematic application of management policies, procedures and practices to the
tasks of establishing the context, assessing, treating, monitoring, reviewing and
communicating risks; see Figure 18.1.
Risk Analysis in Maintenance 439








Figure 18.1. Risk management process (based on ISO 2005)

Risk management involves managing to achieve an appropriate balance

between realizing opportunities for gains while minimizing losses. It is an integral
part of good management practice and an essential element of good corporate
governance. It is an iterative process consisting of steps that, when undertaken in
sequence, enable continuous improvement in decision making and facilitate con-
tinuous improvement in performance.
“Establishing the context” (see Figure 18.1) defines the basic frame conditions
within the risks must be managed and sets the scope for the rest of the risk
management process. The context includes the organization’s external and internal
environment and the purpose of the risk management activity. This also includes
consideration of the interface between the external and internal environments. The
context means definition of suitable decision criteria as well as structures for how
to carry out the risk assessment process.
440 T. Aven

Risk analysis is often used in combination with risk acceptance criteria, as

inputs to risk evaluation. Sometimes the term risk tolerability limits is used instead
of risk acceptance criteria. The criteria state what is deemed as an unacceptable
risk level. The need for risk reducing measures is assessed with reference to these
criteria. In some industries and countries, it is a requirement in regulations that
such criteria should be defined in advance of performing the analyses.

18.2.2 Risk Perspective Adopted

The discussion in this chapter is based on a risk perspective characterised by the
following points:

1. Risk is defined by the combination of possible consequences associated

with an activity and the assessor’s uncertainty about these consequences.
The consequences are normally expressed by quantities that can be
measured (such as money, loss of lives, etc.). A set of quantities are typical-
ly needed to give a proper description of the consequences. We refer to
these quantities as observable quantities or just observables.
2. Risk (uncertainty) is quantitatively expressed by probabilities and expected
values. We assess the uncertainties and assign probabilities (and hence we
assign values for risk). A probability is always conditional on some infor-
mation and knowledge.
3. Risk analyses provide decision support, by analysing and describing risk
(uncertainty). The risk analysts analyse the risks, and evaluate the risks, i.e.
they discuss the significance of the risks, in relation to comparable activities
and possible criteria. The analyses need to be evaluated in light of the
premises, assumptions and limitations of these analyses. The analyses are
based on a background information that must be reviewed, together with the
results of the analyses. The decision maker performs what we refer to as a
managerial review and judgment.
4. It is essential to make a distinction between what the expected values
determined at the point of decision making are, and what the real outcomes
are. The expected values give to varying degree good predictions of the
future outcomes. Uncertainty and safety management are justified by refer-
ence to these outcomes and not the expected values alone.
5. What is acceptable risk and the need for risk reduction cannot be deter-
mined just by reference to the results of risk analyses. To be precise, we do
not accept a risk, but we accept a solution, with all its attributes.
6. Cost-benefit analyses means calculating expected net present values with a
risk adjusted discount rate or risk-adjusted cash-flows. In a societal context,
the society’s willingness to pay is the appropriate reference, whereas for
businesses it is the decision maker’s willingness to pay that is to be used.
7. Cost-effectiveness analyses means calculating measures such as the ex-
pected cost per expected saved life.
Risk Analysis in Maintenance 441

8. A multi-attribute analysis is an analysis of the various attributes (costs,

safety, …) of the decision problem, separately for each attribute.
9. Risk and decision analyses need extensive use of sensitivity and robust

Thus we adopt a broad perspective on risk, acknowledging that risk cannot be

distinguished from the context it is a part of, the aspects that are addressed, those
who assess the risk, the methods and tools used, etc.
Following our definition of risk, a low degree of uncertainty does not ne-
cessarily mean a low risk, or a high degree of uncertainty does not necessarily
mean a high level of risk. As risk is defined as the combination of possible
consequences and the associated uncertainties (quantified by probabilities), any
judgment about the level of risk needs to consider both dimensions. For example,
consider a case where only two outcomes are possible, 0 and 1, corresponding to 0
and 1 fatality, and the decision alternatives are A and B, having uncertainty
(probability) distributions (0.5, 0.5), and (0.0, 1.0), respectively. Hence for alterna-
tive A there is a higher degree of uncertainty than for alternative B. However,
considering both dimensions, we would of course judge alternative B to have the
highest risk as the negative outcome 1 is certain to occur.
The above building blocks are premises for the analysis and discussion in this
chapter. For their justification and suitability we refer to Aven and Kristensen
(2005), Aven et al. (2007) and Aven and Vinnem (2007). Some aspects of particu-
lar importance for the maintenance applications are addressed in Section 5.

18.3 Risk Analysis to Support Decisions on Maintenance

Our starting point is a decision maker facing some decision points in a project.
These decision points include problems and opportunities related to maintenance.
Having identified the main decision points, adequate decision alternatives need to
be generated and assessed, relating to whether or not to execute an activity,
alternative maintenance policies, etc. Our focus is on situations characterized by a
potential of rather large consequences, large associated uncertainties and/or high
probabilities of what will be the consequences, if the alternatives are in fact being
realised, i.e. high risks according to our definition of risk. The consequences and
associated uncertainties relate to economic performance, possible accidents leading
to loss of lives and/or environmental damage, etc. Risk analyses are considered to
give valuable decision support in such situations.
In this chapter we are particularly concerned about how the maintenance
activities are reflected in the risk analysis. A distinction between different types of
analysis methods is then required. To identify hazards and risks, FMECA (failure
mode and effect and criticality analysis) and HAZOP (hazard and operability
studies) are two of the most common methods. In FMECA categories of the
possible consequences and associated likelihoods are introduced and the criticality
is determined using a risk matrix approach. Using this approach, different main-
tenance strategies may be assessed with respect to risk (criticality) and compared
using the risk matrix. This is a crude risk analysis. The next level of sophistication
442 T. Aven

of risk analysis we obtain when models are developed to represent cause and/or
consequence scenarios. The standard tools used are FTA (fault tree analysis) and
ETA (event tree analysis) and the combination of the two, CCA (cause conse-
quence analysis). These models are important elements in a qualitative risk analy-
sis, and provide the basis for a quantitative risk analysis. These are all standard risk
analysis methods and we refer to texts books for description of discussion of these
methods; see, e.g. Aven (1992) and Modarres (1993).
The models are used to identify critical systems, and thus provide a basis for
selecting appropriate maintenance activities. To illustrate this, let R be a risk index,
for example expressing the expected number of fatalities (PLL) or the probability
of a system failure, and let Ri be the risk index when subsystem i is in the
functioning state. Then a common way of ranking the different subsystems is to
compute the risk improvement potential (also referred to as the risk achievement
worth) Ii = Ri – R, i.e. the maximum potential risk improvement that can be
obtained by improving system i (Aven 1992; Haimes 1998). The potential Ii is
referred to as a risk importance measure. An application of this approach is
presented in Brewer and Canady (1999). Criteria are established based on such a
ranking to identify when maintenance improvements are needed to reduce risks.
Identifying critical items is an important basis for maintenance management, and is
one of the key steps in various maintenances frameworks, e.g. the RCM (reliability
centred maintenance) approach (Andersen and Neri 1990).
In risk analysis, the maintenance efforts are incorporated by:

1. Showing the relation between maintenance effort and component perform-

2. Showing the relation between component performance and overall risk

An example demonstrating the component level 1 is the periodical testing of a

component, where the component has a failure rate λ and the testing interval is τ.
Then the unavailability of the component is approximated by λτ/2, expressing the
mean fractional down time of the component. We refer to the literature for further
details on this example and related models and methods, including Markov
methods; see, e.g. Aven (1992), Rausand and Høyland (2003) and Modarres
(1993). The component measures are often expressing features of the performance
of safety barriers, reflected in the event trees. In this way a link is established
between the component performance level and risk (level 2). For the periodical
testing example, suppose that the component is a safety barrier modelled as a
branching event of the event tree. Then the unavailability λτ/2 expresses the prob-
ability that this barrier is not functioning at a demand.
In Figure 18.2 we present a model for integrating maintenance activities and
risk analysis, taken from Apeland and Aven (2000), which also shows the two
levels 1 and 2 mentioned above. On the low system level we have maintenance,
component and operating characteristics, describing alternative maintenance
actions and strategies, alternative components available and relevant operating
Risk Analysis in Maintenance 443

Predictions concerning alternative maintenance strategies’ effect on the main

objectives are normally subject to uncertainty, and in the model we apply risk
analyses for expressing this uncertainty. In risk analyses we evaluate the effect of
different low system level alternatives on the maintenance performance and
component performance, for example described through time to failures and test
The system performance describes how the maintenance and component
performance affect systems on different levels, for example through resulting
production capacity, quality, availability and reliability, and through occurrence of
accidents and other undesirable events.
On the high system level we have the organization’s main objectives. In the
figure we refer to indices describing risk closely linked to the main objectives as
system attributes. One example could be the PLL value. Since the main objectives
should include elements relating to humans, the environment and assets/financial
interests, the risk indices would normally relate to each of these categories.
Applying this model will result in risk results for each relevant low system
level alternative, and this forms a basis for making decisions of which maintenance
alternative to apply.
To be able to quantify the effect of the performed maintenance actions on the
organization’s main objectives, the following issues have to be discussed:
• Which system attributes should be applied for describing performance
related to the main objectives?
• Which indices should be applied for describing low and intermediate
analysis level performance?
• How should risk analyses be applied for describing the relationship between
high and low system level elements, how should engineering judgments be
integrated into the analyses and how should uncertainty be expressed?
We return to these issues in Sections 18.4 and 18.5.
Risk-based inspection is an example of a risk informed approach (Faber 2002).
Here risk analysis principles are used to manage inspection programs for plant
equipment. The need for inspections and the level of inspections are determined by
references to the risks, for example described by risk matrices and expected cost
figures, and of course other relevant information.
444 T. Aven

High Main
system objectives

High analysis
level Historical
A data
n System
R a
i l
s y
k s System
i analysis level performance Suitable
s models
Low Maintenance Component
analysis performance performance

Low Maintenance Operating Component

system charac- charac- charac-
level teristics teristics teristics

Figure 18.2. Model showing the relationship between maintenance efforts and risk
(Apeland and Aven 2000)

Traditionally, risk analysis using FTA and ETA have not had the level of detail
that is necessary to support many decision related to maintenance. However, recent
developments within risk analysis allow for more detailed analysis taking into
account risk influencing factors, for example maintenance activities. In Section 18.4
we will look closer into this type of risk analysis and show how maintenance
activities can be incorporated. Here we summarise the basic features of the method,
using a cause analysis based example as an illustration:

1. Identify top events A that summarise essential barrier performance. An

example is ‘ignition’ or ‘avoid ignition’ given a specific leakage scenario.
The event A must be precisely defined – no ambiguity can exist.
2. Establish a deterministic model that links A and events Bi and quantities Xi
on a more detailed level. A fault tree is an example of such a model.
3. Specify a set of operational and management factors Fi that could influence
the performance of the barriers, and which have not been included in the
fault tree model. Examples of such factors are the quality of the main-
tenance work, the level of competence and the adequacy of organisation.
Risk Analysis in Maintenance 445

4. Specify probabilities P(Bi| F), where F is the vector of the Fis.

5. Use probability calculus to obtain P(A| F).

To carry out such an analysis there are a number of challenges, of which the
following are some of the more important:
• Determine which F factors that should be included in the fault tree. The F
factors are fixed, meaning that the probability assignments are conditioned
on these factors. If some of the F factors are to be considered unknown to
the analyst, these factors need to be included in the fault tree, or the factors
should be divided into two categories, reflecting unknown factors on the
one hand and some given factors on the other. Such a distinction is made in
the SAM-method (Pate-Cornell and Murphy 1996).
• Find adequate procedures for specifying the probabilities P(Bi|F). These
procedures need to be based on models and methods used for barrier per-
formance analyses, such as human reliability analysis.
We refer to Section 18.4. The above analysis provides decision support, by de-
scribing the effect of maintenance efforts on risk. To make a decision costs and
others aspects also need to be considered, and an important issue is then how this
should be done. A standard approach is the cost-benefit analysis based on compu-
tation of the expected net present value. We will discuss this issue in Section 18.5.

18.4 A Case
In this section we present a risk analysis incorporating operational and main-
tenance factors. The presentation is based on Sklet et al. (2005), and is referred to
as the BORA (barrier and operational risk analysis) approach. The approach is
inspired by the I-Risk method (Papazoglou et al. 2003). The case relates to an
offshore installation, and releases of hydrocarbons.
The BORA approach consists of the following steps:

1. Development of a basic risk model.

2. Assignment of industry average frequencies/probabilities of initiating events
and basic events.
3. Identification of risk influencing factors (RIFs) and development of risk in-
fluence diagrams.
4. Assessment of the status of RIFs.
5. Calculation of installation specific frequencies/probabilities.
6. Calculation of installation specific risk, incorporating the effect of technical
systems, technical conditions, human factors, operational conditions, and
organizational factors.
446 T. Aven

18.4.1 Development of a Basic Risk Model

The basic building blocks of the BORA model are barrier block diagrams, event
trees, fault trees, and influence diagrams. Barrier block diagrams are used to
illustrate the event scenarios and the effect of barrier systems on the event
sequences and consist of initiating events, barriers aimed to influence the event
sequence in a desired direction, and possible outcomes of the event sequence.
Event trees are used in the quantitative analysis of the scenarios. The performance
of the safety barriers are analyzed using fault trees. Influence diagrams are used to
analyze how the RIFs affect the initiating events in the event trees and the basic
events in the fault trees.
This case restricts attention to modeling of the containment function (“prevent
release of hydrocarbons”). For this function a number of release scenarios have
been modeled by use of barrier block diagrams. Each barrier block diagram com-
prises the following:
• An initiating event, i.e. a deviation from the normal situation which may
cause a release of hydrocarbons.
• Barrier systems aimed to prevent release of hydrocarbons.
• The possible outcomes of the event sequence, which depend upon the
successful operation of the barrier system(s).
The barrier block diagram for the release scenario “Release due to valve(s) in
wrong position after maintenance” is illustrated in Figure 18.3.

Initiating event Barrier functions End event

Detection of valve(s) in Detection of release prior

wrong position to normal production

Valve(s) in wrong Self control /

”Safe state”
position after checklists
Failure revealed
maintenance (isolation plan)

3rd party control

of work

Leak test

Release of

Figure 18.3. Barrier block diagram for one release scenario

Risk Analysis in Maintenance 447

As seen in Figure 18.3, several of the barriers are non-physical by nature, thus
requiring human and operational factors to be included in the risk model.
In order to perform a quantitative risk analysis, frequencies/probabilities of
three main types of events need to be quantified:

1. The frequency of the initiating event, i.e. in the example case: “The fre-
quency of valve in wrong position after maintenance”.
2. The probability of failure of the barrier systems, which for the example case
includes: i) failure to reveal valve(s) in wrong position after maintenance by
self control/use of checklists, ii) failure to reveal valve(s) in wrong position
after maintenance by third party control of work, and iii) failure to detect
potential release during leak test prior to start-up.
3. The (end event) frequency of release of hydrocarbons due to valve in wrong
position (needed for further analysis of the effect of the consequence

The frequency of the initiating event is in our example a function of the annual
number of maintenance operations where valve(s) may be set in wrong position in
hydrocarbon systems, and the probability of setting a valve in wrong position per
maintenance operation.
In order to determine the probability of failure of barrier systems, the barrier
systems may be further analyzed by use of fault trees as shown in Figure 18.4.

Failure to reveal valve(s) in

wrong position after
maintenance by self control/
use of checklists

Operator fails to detect

Self control not
a valve in wrong
performed/ checklists
position by self control/
not used
use of checklists

Use of self control/

Activity specified, but
checklists not
not performed
specified in program

A11 A12

Figure 18.4. Fault tree for failure of one barrier

Corresponding analysis may be performed for all barriers for all the identified
release scenarios. For further illustration of the quantification methodology in the
BORA project, we consider the initiating event and the basic events shown in
Figures 18.3 and 18.4:
448 T. Aven

• Valve(s) in wrong position after maintenance that may cause release (the
initiating event).
• Use of self control/checklists not specified in program (basic event A11).
• Use of self control/checklists specified, but not performed (basic event A12).
• The operator fails to detect valve(s) in wrong position by self control/use of
checklists (basic event A13).

18.4.2 Assignment of Average Frequencies/Probabilities

The first step in the quantification process is to assign industry average frequencies
and probabilities for all the initiating events in the event trees and basic events in
the fault trees.
Generic data may be found in generic databases or company internal databases.
Alternatively, industry average values can be established by use of expert judg-
ment. For our example case, Table 18.1 shows the assigned industry average fre-
quencies and probabilities for the initiating events and basic events in Figure 18.4.

Table 18.1. Assigned average frequencies (F) and probabilities (P)

Event description Assigned values

Annual frequency of valve(s) in wrong position after maintenance F=6
that may cause release
Failure to specify self control/use of checklist P = 0.1
Failure to perform self control/use of checklist P = 0.05
Failure of operator to detect valve(s) in wrong position by self P = 0.06
control/use of checklist

18.4.3 Qualitative Risk Influence Modeling

RIFs for every initiating event in the event trees and every basic event in the fault
trees need to be identified. An example of an influence diagram for the basic event
“Operator fails to detect a valve in wrong position by self check/checklist” is
shown in Figure 18.5.

Area technician fails to

detect a valve in wrong
position by self control/
use of checklists

Maintain- Competence Procedures

HMI ability/ of area for self Work permit
accessibility technician control

Figure 18.5. Influence diagram for the basic event “Operator fails to detect a valve in wrong
position by self check/checklist”
Risk Analysis in Maintenance 449

Table 18.2 shows the RIFs for the all the relevant events in our example case.

Table 18.2. Proposed RIFs for basic events in the example case
Event description RIFs
Valve in wrong position after Process complexity
aintenance Maintainability/accessibility
HMI (valve labeling and position feedback features)
Time pressure
Competence (of area technician)
Work permit
Self control/use of Program for self control
checklists not specified
Self control/use of Work practice (regarding use of self control/checklists)
checklists not performed Time pressure
Work permit
Area technician fails to detect HMI (valve labeling and position feedback features)
valves(s) in wrong position by Maintainability/accessibility
self control/ use of checklists Time pressure
Competence (of area technician)
Procedures for self control
Work permit

The next step is the quantification process.

18.4.4 Scoring of RIFs

The first step is to assess the status of the RIFs. Two schemes are being used for
scoring of RIFs.

Scheme 1. Use of results from existing projects like technical condition safety –
TTS (Thomassen and Sørum 2002), the risk level on the Norwegian continental
shelf (PSA 2004), and investigations of incidents. The TTS project is a review
method to map and monitor the technical safety level based on the status of safety
critical elements and safety barriers, and each system is given a score (rating)
according to predefined performance standards. Table 18.3 shows the definition of

Table 18.3. Definition of grades in the TTS project

Rating Description of safety level
A Condition is significantly better than the reference level
B Condition is in accordance with the reference level
C Conditions satisfactory, but does not fully comply with the reference level
D Condition is acceptable and within the statutory regulations’ minimum
intended safety level, but deviates significantly from the reference level
E Condition with significant deficiencies as compared with “D”
F Condition is unacceptable
450 T. Aven

Scheme 2. Expert judgment of status of RIFs on a specific platform. A scoring

scheme for each RIF will be developed as a basis for this assessment. An example
of a scoring scheme is shown in Table 18.4.

Table 18.4. Example of scoring scale for the RIF procedures

Score Grade characteristics for the RIF procedures
A Almost perfect procedures, with checklists, highlighting of important
information, illustrations, etc.
B Procedures better than industry average
C Industry average procedures
D Poorly written procedures and no highlighting
E Procedures incomplete, out-of-date, inaccurate much cross-referencing, etc.
F No procedures, even though the task demands them

18.4.5 Calculation of Installation Specific Frequencies/Probabilities

The next task is to adjust the industry average probabilities based on the scoring of
the RIFs. Three main aspects are discussed: a) the formulas for calculation of
installation specific frequencies/probabilities, b) assignment of appropriate RIF
scores, and c) weighting of RIFs. The procedure is illustrated by use of numbers
from the example case. Principles for Adjustment

The following principles for adjustment are proposed. Let Prev(A) be the
“installation specific” probability of the failure event A. The probability Prev(A) is
determined by the following procedure:

Prev = Pave ∑ w ⋅Q
i =1
i i (18.1)

where Pave is the industry average probability, wi is the weight/importance of RIF

no. i for the event, Qi is a measure of the status of RIF no. i, and n is the number of
RIFs. Here

i =1
i =1 (18.2)

The challenge is now to determine appropriate values for Qi and wi. Determining Appropriate Values of Qi

To determine the Qis we need to associate a number to each of the score A–F. This
can be done in many ways, and the proposed scheme is:
Risk Analysis in Maintenance 451

• Determine by expert judgment Plow as the lower limit for Prev

• Determine by expert judgment Phigh as the upper limit for Prev
• Then put for i =1,2,…n

⎧ Plow / Pave if si = A

Qi = ⎨1 if si = C (18.3)
⎪ P / P if s = F
⎩ high ave i

where si denotes the score or status of RIF no i. Hence if the score si is A, and Plow
is 10% of Pave, then Qi is equal to 0.1. And if the score si is F, and Phigh is ten times
higher than Pave, then Qi is equal to 10. If the score si is C, then Qi is equal to 1.
Furthermore, if all scores are C, then Prev = Pave, if all scores are A, then Prev = Plow,
and if all scores are F, then Prev = Phigh.
Note that in this study we use a fixed factor of ten to describe the variations
caused by different scores, from A to F. That is, if all scores are A, Plow is 10% of
Pave, and if all the scores si are F, then Phigh is ten times higher than Pave.
Furthermore; we have adopted the grade score from the TTS project; A=3,
B=2, C=1, D=0, E= –2 and F= –5. Thus we have, letting Qi(j) denote the value of
Qi if the score si takes the value j, the results shown in Table 18.5.

Table 18.5. Adaptation of scores from the TTS-project

Score si =j 3 (A) 2 (B) 1 (C) 0 (D) –2 (E) –5 (F)
Qi (j) 0.10 1 10

Hence it remains to determine Qi (j) for j = 2, 0, and – 2. Using a linear transform

seems natural, and we obtain the following Q values;
For j=0 and – 2 (E and D):
Qi (j) = Qi (–5) + (j – 1)(Qi (1)– Qi (–5))/(1– (–5)).
And for j=2 (B):
Qi (j) = Qi (1) + (j – 1) (Qi (3)– Qi (1))/(3–1),
which gives the values for Qi as shown in Table 18.6.

Table 18.6. Specifed values for Qis

Score si =j 3 (A) 2 (B) 1 (C) 0 (D) –2 (E) –5 (F)
Qi(j) 0.10 0.55 1 2.5 5.5 10 Weighting of RIFs

To determine the weights wi, we start from a weight wi equal to 10 assigned to the
most important RIF (RIF no i). The other RIFs are afterwards given relative
weights (10 – 8 – 6 – 4 – 2). The idea is to think of relative changes in the
probability given that the score of RIF no. i is changed from A to F. According to
Equation 18.2, normalization is required to ensure that the sums of the wis are
equal to 1.
452 T. Aven Calculation Example

An example on results from calculation of Prev when Pave = 0.01, Phigh = 0.1, and
Plow = 0.001 is shown in Table 18.7.

Table 18.7. Example – calculation of Prev

RIF no i Weight Normalized Status Qi wi * Q i
of RIF i (wi) weight of RIF i (si)
1 4 0.12 B 0.55 0.065
2 6 0.18 C 1 0.176
3 4 0.12 E 5.5 0.647
4 6 0.18 D 2.5 0.441
5 10 0.29 C 1 0.294
6 4 0.12 D 2.5 0.294
Sum 34 1.0 – – 1.918

By use of (18.1), Prev is equal to (Pave x 1.918). In our example case, the RIF
analysis gave an increase of the probability of occurrence of the basic event by a
factor of 1.9 (from Pave = 0.01 to Prev = 0.019).

18.4.6 Recalculation of the Installation Specific Risk

A revised value for the installation specific risk may be calculated by use of the
platform specific data (Prev) as input data in the risk model (event trees/fault trees)
described above.

18.4.7 Remarks

We refer to Sklet et al. (2005) for a detailed discussion of this approach, and
relevant references for similar methods.
Compared to a traditional QRA model, the BORA approach is a more detailed
method, and includes considerably more risk influencing factors that gives more
detailed information of factors contributing to the total risk, i.e. a more detailed
risk picture. The analysis allows one to study the effect of maintenance efforts on
risk, and thus provide support for maintenance decisions. The risk analysis can be
used to identify the critical factors, as well as expressing the effect of risk reducing

18.5 Discussion of Critical Issues

In maintenance applications, it is common to define risk as the expected loss, i.e.
risk is equal to the probability of failure multiplied by the consequence of failure,
see e.g. Khan and Haddara (2003), or in probabilistic terms, E[X], where X is the
possible consequences measured for example in fatalities or economic values. It
seems to be a common understanding among many risk and maintenance analysts
Risk Analysis in Maintenance 453

that the use of expected values is the appropriate criterion for determining the best
policies. The justification is the statistical property of a mean. If we consider a
large set of similar activities and Xi is the consequences of the i-th activity, then the
law of large number says that under certain conditions the mean of the Xis is
approximately equal to EXi. Also the portfolio theory supports the use of the ex-
pected values; see e.g. Abrahamsen et al. (2005).
The use of traditional cost-benefit analyses to support decision making is based
on the same type of logic. Cost-benefit analyses means that we assign monetary
values to all relevant attributes, including costs and safety and summarise the
performance of an alternative by the expected net present value, E[NPV]. The main
principle in transformation of goods into monetary values is to find out what the
maximum amount society is willing to pay to obtain an improved performance.
Use of cost-benefit analysis is seen as a tool for obtaining efficient allocation of the
resources, by identifying which potential actions are worth undertaking and in what
fashion. By adopting the cost-benefit method the total welfare is optimised. This is
the rationale for the approach. Although cost-benefit analysis was originally
developed for the evaluation of public policy issues, the analysis is also used in
other contexts, in particular for evaluating projects and activities in firms. The
same principles apply, but using values reflecting the decision maker’s benefits and
costs, and the decision maker’s willingness to pay.
However, risk is more than expected values. The most common definition of
risk in the engineering community is that risk is the combination of consequences
and probability, i.e. the combination (X, P), where P refers to probability; see e.g.
ISO (2002). We extend this definition by using the pair (X, U), where U refers to
uncertainty. Probability is a way of expressing the uncertainties. Following these
perspectives on risk, there is a need to see beyond the expected values. The argu-
ments can be summarised as follows.
What we search for is desirable outcomes X, for example no accidents and high
profit. In practice we have a finite number of projects, and the mean numbers based
on these projects are not the same as the expected value. An accident could result
in losses that are significant also in a corporate perspective – the standard deviation
of the project loss could be significant relative to the total cash flow of the firm.
And since the uncertainties in the consequences are large, the assumptions and
suppositions made in the calculation of the expected value may influence the
results to large extent. The assessments made should be seen as considerations
based on relevant information, but there could be different assessments, different
views and different perspectives on the uncertainties. This applies in particular to
assigned, small probabilities of rare events.
A complicating factor is that safety and risk involve the balance between
different attributes, including lives and money. The above expected value approach,
for example based on cost-benefit analyses, is based on one being able to transform
all values to one unit, the economic value. And from a business perspective, firms
may argue that this is the only relevant value. All relevant values should be trans-
formed to this unit. This means that the expected costs of accidents and lives should
be incorporated in the evaluations.
But what is the economic value of a life? For most human beings it is infinite;
most people would not be willing to give his or her life for a certain amount of
454 T. Aven

money. We say that a life has a value in itself. But of course, an individual may
accept a risk for certain money or other benefits. And for the firm, this is the way
of thinking – the balance of costs and risk. The challenge is however to perform
this balance. What are reasonable numbers for the firm to use for valuing that a life
has a value in itself? Obviously there are no correct answers, as it is a managerial
and strategic issue. High values may be used if it can be justified that this would
produce high performance levels, on both safety and production.
Consequently, uncertainty needs to be considered, beyond the expected values,
which means that the principles of robustness and caution (precaution) have a role
to play. A risk-aversion behaviour is often the result. The point is that we put more
weight on possible negative outcomes than the expected values support. Many
firms seem in principle to be in favour of a risk neutral strategy for guiding their
decisions, but in practice it turns out that they are often risk averse. The justifica-
tion is partly based on the above arguments. In the case with a large accident, the
possible total consequences could be rather extreme – the total loss for the firm in a
short and long term perspective is likely to be high due to loss of production,
penalties, loss of reputation, changes in the regulation regimes, etc. The overall
loss is difficult to quantify – the uncertainties are large – and it is seldom done in
practice, but the overall conclusion is that investments in safety are required. The
expected value is not the only basis for making this conclusion. We apply a
cautionary principle, expressing that in the face of uncertainty, caution should be a
ruling principle. For example, in a process plant, major hydrocarbon leaks might
occur, requiring investments in various safety systems and barriers to reduce the
possible consequences – we are cautious. Uncertainties in phenomena and pro-
cesses justify investments in safety.
Thus to conclude on maintenance alternatives, we need an approach which
provide decision support beyond expected values. We recommend an assessment
process following a structure as summarized in the following (Aven and Vinnem
For a specified alternative, say A, we assess the consequences or effects of this
alternative seen in relation to the defined attributes (safety, costs, reputation, etc.).
Hence we first need to identify the relevant attributes (X1, X2, …) and then assess
the consequences of the alternative for these attributes. These assessments could
involve qualitative or quantitative analysis. Regardless of the level of quantifica-
tion, the assessments need to consider both what the expected consequences are, as
well as uncertainties related to the possible consequences. Often the uncertainties
could be large. In line with the adopted perspective on risk, we recommend a struc-
ture for the assessment according to the following scheme:

1. Identify the relevant attributes (safety, costs, reputation, alignment with

main concerns, ...).
2. What are the assigned expected consequences, i.e. E[Xi] given the available
knowledge and assumptions?
3. Are there special features of the possible consequences? In addition to
assessing the consequences on the quantities Xi, some aspects of the possible
consequences might need special attention. Examples may for example be
the temporal extension, aspects of the consequences that could cause social
Risk Analysis in Maintenance 455

mobilization, i.e. violation of individual, social or cultural interests and

values generating social conflicts and psychological reactions by individuals
and groups who feel afflicted by the risk consequences. A system based on
the scheme developed by Renn and Klinke (2002) is recommended; see
Sandøy et al. (2005).
4. Are the large uncertainties related to the underlying phenomena, and do
experts have different views on critical aspects? The aim is to identify factors
that could lead to consequences Xi far from the expected consequences E[Xi].
A system for describing and characterising the associated uncertainties are
outlined in Sandøy et al. (2005). This system reflects features such as the
current knowledge and understanding about the underlying phenomena and
the systems being studied, the complexity of technology, the level of pre-
dictability, the experts’ competence, and the vulnerability of the system. If a
quantitative analysis is performed, the uncertainties are expressed by prob–
ability distributions.
5. The level of manageability during project execution – to what extent is it
possible to control and reduce the uncertainties, and obtain desired out-
comes? The expected values and the probabilistic assessments performed in
the risk analyses provide predictions for the future, but some risks are more
manageable than others, meaning that the potential for reducing the risk is
larger for some risks compared to others. By proper uncertainty and safety
management, we seek to obtain desirable consequences. This leads to
considerations on for example how to run processes reducing risks (un-
certainties) and how to deal with human and organisational factors and
obtain a good safety culture.

Hence for each alternative and attribute we may have information covering the
following points:
• Predictions of attribute (e.g. zero fatalities)
• Expected value (e.g. 0.1 fatalities)
• Probability distribution (e.g. expressing a probability of a “major accident”)
• Risk description on a “lower level” (e.g. prediction of number of leaks,
expected number of leaks, etc.)
• Aspects of the consequences
• Uncertainty factors
• Manageability factors
These assessments provide a basis for comparing alternatives and making a de-
Compared to standard ways of presenting risk results, this basis is much more
comprehensive. In addition, sensitivity analyses and robustness analyses are to be
performed. Of course, the depth of the analysis will be a function of the decision
situation, the risks involved and the resources to be used. The full risk descriptions
as outlined above would be used only in special situations, requiring a comprehen-
sive decision support basis.
We refer to Aven and Vinnem (2007) for further reflections on the above
issues, and in particular the use of cost-benefit analyses. A key question discussed
456 T. Aven

is to what extent it is appropriate to adjust the value of a (statistical) life and adjust
the discount rate to take into account the uncertainties.
In maintenance application there is often reference to the use of risk acceptance
criteria, as upper limits of risk acceptance expressed for example by the PLL or
FAR values; see e.g. Khan and Haddara (2003). We are sceptical to the prevailing
thinking concerning risk acceptance criteria; see Aven and Vinnem (2005, 2007).
We all agree on the need for considering risk as a basis for making decisions under
uncertainty. Such considerations must however be seen in relation to other con-
cerns, costs and benefits. Care should be shown when using pre-determined risk
acceptance criteria in order to obtain good arrangements, plans and measures, as
they easily lead to the wrong focus – using risk analysis to verify that these limits
are met and there is no drive for risk reduction and safety improvements.
The use of risk acceptance criteria cannot replace managerial review and
judgement. The decision support analyses need to be evaluated in the light of the
premises, assumptions and limitations of these analyses. The analyses are based on
a background information that must be reviewed together with the results of the
analyses. Risk analysis provides decision support, not hard decisions. We refer to
Aven and Vinnem (2007).

18.6 Conclusions
This chapter has presented and discussed the use of risk analysis for the selection
and prioritisation of maintenance activities. The chapter has reviewed some critical
aspects of risk analysis important for the successful implementation of such analy-
ses in maintenance. This relates to risk descriptions and categorisations, uncer-
tainty assessments, risk acceptance and risk informed decision making, as well as
selection of appropriate methods and tools.
In the risk analysis, the maintenance efforts are incorporated by:
• Showing the relation between maintenance effort and component perform-
• Showing the relation between component performance and overall risk in–
An example is shown in Section 18.4. This example demonstrates some of the
problems related to incorporating the maintenance efforts into the risk analysis.
The analysis needs to be rather detailed to support the decision making. Develop-
ing suitable methodology is not straightforward, for example on how to assign
installation specific probabilities, based on the information available (including
reliability and maintenance data). Further research is undoubtedly required to give
confidence in the methods to be used. A detailed analysis requires substantial input
data, and the data must be relevant. Such analyses cannot be performed without
extensive use of expert judgment. However, expert judgment is not to be seen as
something negative. The risk analysis is a tool for summarising the information
available (including uncertainties), and expert judgment constitutes an important
part of this information.
Risk Analysis in Maintenance 457

18.7 References
Abrahamsen, E.B., Aven, T., Vinnem, J.E. and Wiencke, H.S. (2005) Safety Management
and the use of expected values. Risk, Decision and Policy, 9, 347–358.
Andersen, R.T. and Neri, L. (1990) Reliability-Centred Maintenance. Management and
Engineering Methods, Elsevier Applied Sciences, London.
Apeland, S. and Aven, T. (2000) Risk based maintenance optimization: foundational issues.
Reliability Engineering and System Safety, 67, 285–292.
Aven, T. (1992), Reliability and Risk Analysis, Elsevier Applied Science, London.
Aven, T. and Jensen, U. (1999) Stochastic Models in Reliability, Springer-Verlag, New
Aven, T. and Kristensen, V. (2005) Perspectives on risk – Review and discussion of the
basis for establishing a unified and holistic approach. Reliability Engineering and
System Safety, 90, 1–14.
Aven, T. and Vinnem, J.E. (2005) On the use of risk acceptance criteria in the offshore oil
and gas industry. Reliability Engineering and System Safety, 90, 15–24.
Aven, T., Vinnem, J.E. and Wiencke, H.S. (2007) A decision framework for risk
management. Reliability Engineering and System Safety, 92, 433–448.
Aven, T. and Vinnem, J.E. (2007) Risk Management, with Applications from the Offshore
Oil and Gas Industry, Springer Verlag, New York.
Brewer, H.D. and Canady, K.S. (1999) Probabilistic safety assessment support for the
maintenance rule at Duke Power Company. Reliability Engineering and System Safety,
63, 243–249.
Cepin, M. (2002) Optimization of safety equipment outages improves safety. Reliability
Engineering and System Safety, 77, 71–80.
Clarotti, C.A., Lannoy, A. and Procaccia, H. (1997) Probabilistic risk analysis of ageing
components which fail on demand; A Bayesian model: Application to maintenance
optimization of diesel engine linings. In Proceedings of Ageing of materials and
methods for the assessment of lifetimes of engineering plant, Cape Town, pp. 85–94.
Dekker, R. (1996) Applications of maintenance optimization models: A review and analysis.
Reliability Engineering and System Safety, 51, 229–240.
Faber, M.H. (2002) Risk-Based Inspection: An Introduction, Structural Engineering
International, 12, 186–194.
Haimes, Y.Y. (1998) Risk modeling, Assessment, and Management, Wiley, New York.
ISO (2002) Risk management vocabulary. ISO/IEC Guide 73.
Khan, F.I. and Haddara, M.M. (2003) Risk-based maintenance (RBM): a quantitative
approach for maintenance/inspection scheduling and planning, Journal of Loss
Prevention, 16, 561–573.
Knoll, A., Samanta, P.K. and Vesely, W.E. (1996) Risk based optimization of the Frequency
of EDG on-line maintenance at Hope Creek. In Proceedings of Probabilistic Safety
Assessment, Park City, pp. 378–384.
Modarres, M. (1993) What Every Engineer should Know about Reliability and Risk
Analysis, Marcel Dekker, New York.
van Manen, S.E., Janssen, M.P. and van den Bunt, B. (1997) Probability-based optimization
of maintenance of the River Maas Weir at Lith. In Proceedings of European Safety and
Reliability conference (ESREL), Lisbon, pp. 1741–1748.
Papazoglou, I.A., Bellamy, L.J., Hale, A.R., Aneziris ON, Post JG, Oh JIH. (2003) I-Risk:
development of an integrated technical and Management risk methodology for chemical
installations. Journal of Loss Prevention in the Process Industries, 16, 575 – 591.
Paté-Cornell, E.M. and Murphy, D.M. (1996) Human and management factors in
probabilistic risk analysis: the SAM approach and observations from recent applications.
Reliability Engineering and System Safety, 53, 115–126.
458 T. Aven

Perryman, L.J., Foster, N.A. and Nicholls, D.R. (1995) Using PRA in support of
maintenance optimization, International Journal of Pressure Vessels & Piping, 61, 593–
Podofillini, L., Zio, E. and Vatn, J. (2006) Risk-informed optimisation of railway tracks
inspection and maintenance procedures, Reliability Engineering and System Safety, 91,
PSA, 2004. “Trends in Risk Levels on the Norwegian Continental Shelf Main report Phase 4
– 2003” (in Norwegian). The Petroleum Safety Authority Norway, Stavanger, Norway.
Rausand, M. and Høyland, A. (2003) System Reliability Theory, Wiley, New York.
Renn, O. and Klinke, A. (2002) A New approach to risk evaluation and management: Risk-
based precaution-based and discourse-based strategies, Risk Analysis, 22, 1071–1094.
Sandøy, M., Aven, T. and Ford, D. (2005) On integrating risk perspectives in project
management. Risk Management: an International Journal, 7, 7–21.
Sklet, S., Hauge, S., Aven, T. and Vinnem, J.E. (2005) Incorporating human and
organizational factors in risk analysis for offshore installations. Proceedings ESREL
2005, pp. 1839–1847.
Thomassen, O., Sørum, M. 2002. Mapping and monitoring the safety level. SPE 73923,
Society of Petroleum Engineers.
Vatn, J., Hokstad, P. and Bodsberg, L. (1996) An overall model for maintenance
optimization. Reliability Engineering and System Safety, 51, 241–257.

Maintenance Performance Measurement (MPM)


Uday Kumar and Aditya Parida

19.1 Introduction
Maintenance is an important support function for the business processes with
significant investment in physical assets which plays an important role in achieving
organizational goals. However, the cost of maintenance and downtime is too high
for many industries. For example, the cost of maintenance in a highly mechanized
mine can be 40–60% of the operating cost (Campbell 1995), the maintenance
spending in the UK’s manufacturing industry ranges from 12 to 23% of the total
factory operating costs (Cross 1988) and as per a study in Germany, the annual
spending on maintenance in Europe is around 1500 billion euros (Altmannshopfer
2006). All these have motivated the senior managers and maintenance engineers to
measure the contribution of maintenance towards total business goals or in terms of
return on investment, etc.
Prior to the 1940s, maintenance was considered as a necessary evil and the
general attitude to maintenance was “It costs what it costs.” During 1950–80, with
the advent of techniques like preventive maintenance and condition monitoring, the
perception changed to “maintenance is an important support function and it can be
planned and controlled.” Today maintenance is considered as an integral part of the
business process and it is perceived as: “It creates additional value” (Liyanage and
Kumar 2003). The creation of additional value by maintenance is expressed in
terms of increased productivity, better utilisation of plant and system, lower
accident rates and better working environment. With increasing awareness that
maintenance creates additional value in the business process; more and more
companies are treating maintenance as an integral part of the business process, and
maintenance function has become an essential element of strategic thinking of
many companies involved in service and manufacturing industry. With this change
in the mindset of senior asset managers and owners, it has become essential to
measure the performance of manufacturing process to understand the tangible and,
if possible, intangible contribution of maintenance towards business goals. How-
ever, without any formal measures of performance, it is difficult to plan, control
460 U. Kumar and A. Parida

and improve the maintenance process. With this, the focus has shifted to measure
the performance of maintenance. Maintenance performance needs to be measured
to evaluate, control and improve the maintenance activities for ensuring achieve-
ment of organizational goals and objectives.
In recent years, maintenance performance measurement (MPM) has received a
great amount of attention from researchers and practitioners due to a paradigm shift
in maintenance. This chapter deals with the broad topic of performance measure-
ment (PM), metrics and measures for MPM, reviews the existing MPM frame-
works, discusses various issues and challenges associated with the development
and implementation of an MPM system. The outline of the chapter is as follows: an
overview of various PM frameworks and their development are presented in
Section 19.2. Definitions of maintenance performance indicator (MPI), and MPM
system, and their salient features are discussed in Section 19.3. The important
issues associated with the development of MPM system are discussed in Section
19.4, while the MPIs under different criteria are explained in Section 19.5. The
MPM system and the framework are explained in Section 19.6. Some of the MPIs
and MPM system in different industries are discussed in Section 19.7. The final
section concludes the chapter with limitations of the current literature and practice.

19.2 Performance Measurement – An Overview

In the past two decades, performance measurement (PM) has received a great
amount of attention from researchers, practitioners and from industry as well.
Andersen and Fagerhaug (2002) have listed the reasons for measuring perform-
ance, like providing employees with the feedback on the work they are performing,
and necessary information based on which correct decision making by the em-
ployee and management can be made, helping in implementing strategies and poli-
cies for an organization, and using PM data to monitor the performance trend over
PM is defined as the process of quantifying the efficiency and effectiveness of
past and future activities. Major issues related to this field concern what to measure
and how to measure it in a practically feasible and cost-effective way (Neely
1999). Measurement thus gives the status of the variable, compares the data with
target or standard data and points out what actions should be taken and where they
should be used for as corrective and preventive measures. It is extremely difficult
to develop models for supporting the decision making process, without adequate
data (Wealleans 2000). Today, PM is related to product, operation process,
partnering, stakeholders and the production. PM of process essentially involves
mapping of the process, measurement of the performance, undertaking root-cause
analysis and bench marking of the performance. PM is a multi-disciplinary activity
as it involves multiple stakeholders. A PM system needs to have features such as
integrated, linking all the perspectives in a balanced manner, besides having a
holistic approach for the entire organization to achieve the stakeholders’ goals at
various levels.
Maintenance Performance Measurement (MPM) System 461

19.2.1 Metrics, Measures and Indicators

Performance measure is the term used when talking about PM in general. Per-
formance indicators (PIs) are measures that describe how well an operation is
achieving its objectives. A PI of an activity is a ratio of two variables: the output to
the input of that activity. A performance measure thus can be defined as metrics for
quantifying the efficiency and/or effectiveness of past or future activities, where as
a performance metric is the definition of the scope, content and component parts of
a broadly based performance measures (Neely et al. 2002). The characteristics of
performance measures include relevance, interpretability, timeliness, reliability and
validity (Al-Turki and Duffuaa 2003). PI is a more specific measurement gauges or
it indicates performance. PIs are broadly classified as leading or lagging indicators.
Leading indicators are performance driver and are used for understanding the
present status and taking corrective measures to achieve the desired target. A
leading indicator is of the non-financial and statistical type that fairly and reliably
predicts in advance. A leading indicator thus works as a performance driver and
ascertains the present status in comparison with the reference indicator level. In
maintenance departmental level, condition monitoring indicators such as noise,
vibration, thermograph measurement and particles in oil can be leading indicator.
Lagging indicators are outcome measures and provide basis for studying the devia-
tions after the completion of the activities. Cost of maintenance and mean time
between failures (MTBF), are few examples of lagging indicators. Since PIs are
just the indicator of performance, Key performance indicator (KPI) is an aggrega-
tion of various PIs in a logical way. Thus, KPI is more strategic and important
indicator of performance (Wireman 1998). The main purpose of KPI is to pinpoint
possible areas for improvement within an organization.
Until 1980, the PM was mostly based on financial measures. Kaplan and Norton
(1992) suggested the balanced scorecard as a more pragmatic and progressive
framework to measure the performance in a balanced way. The balanced scorecard,
with its four perspectives, focuses on both tangible and intangible perspectives of
the business process like; customers, internal processes, financial, and innovation
and learning. Subsequently, various researchers have developed frameworks con-
sidering non-financial measurements and intangible assets to achieve competitive
advantages (Parida and Kumar 2006). Some studies have shown that companies
using an integrated balanced PM system perform better than those which do not
measure their performance (Kennerly and Neely 2003; Lingle and Schiemann
1996). Some of the major PM frameworks developed by various authors and
researchers are Du Pont Pyramid (Chandler 1977), PM matrix (Keegan et al. 1989),
results and determinants matrix (Fitzgerald et al. 1991), balanced scorecard (BSC)
(Kaplan and Norton, 1992), SMART pyramid, (Lynch and Cross 1991), integrated
PM framework (Medori and Steeple 2000), performance prism, (Neely et al. 2002),
BSC of advanced information (Abran and Buglione 2003), and European Founda-
tion for Quality Management (EFQM) (Wongrassamee et al. 2003).
462 U. Kumar and A. Parida

19.3 Maintenance Performance Measurement (MPM)

Generally, a maintenance performance measurement (MPM) system forms part of
the organization’s operational system and includes all related maintenance per-
formance indicators (MPIs) and their interrelationship within the whole main-
tenance process. MPM is the process of measuring maintenance performance, to
know how well the maintenance process is performing and to identify the oppor-
tunities for improvement. In a MPM system, data are collected, analyzed and
relevant information extracted for timely decision making. MPM is a complex task
involving measurement of varying inputs and multiple outputs of the maintenance
process. One way of measuring the performance is to develop PIs and implement
them with a total involvement of entire organisation. An indicator is a function of
several metrics, when used for measurement of maintenance is called a main-
tenance performance indicator (MPI). MPIs are ratio of two maintenance related
variables, which needs to be defined beforehand and their values may change with
time. For example, the value of an MPI may change after five years as compared to
the first year. MPIs are the means to measure the performance of a maintenance
process and are used to facilitate the understanding and measurement of the past
performance, so that future prediction can be visualized resulting in appropriate
decision making. MPIs can act as an early warning system for operation and
maintenance process, indicating the present status of the process, so as to enable
evaluation, prediction and corrective action.
The data from measurement tells us the status of the job carried out and what
action to be taken thereafter, and to indicate where those actions should be
targeted. For example, the MPIs could be used for financial reports, for control of
performance of employees and other resources like the costing and appraisal
system, for finding competitive position with in business organizations like the
customer satisfaction and competitor ranking, for health, safety and environmental
(HSE) rating for production industry, and finding internal effectiveness, like the
overall equipment effectiveness (OEE) for the manufacturing and process industry.
The selection of MPIs to follow up the contribution of maintenance is an important
but a complex issue. Thus, the structure of the MPI needs to be considered from
different perspectives of the maintenance process in an integrated manner. There
are a large number of MPIs used by different industries to-day which need to be
carefully identified and selected to meet the specific requirements of the organiza-
While defining or identifying MPIs, it is important to relate them to both the
process inputs and the process outputs. If this is carried out properly, then MPIs
can (Kumar and Ellingsen 2000):
• Provide or identify basis for resource allocation and control
• Facilitate to identify the problem areas
• Provide individuals and team with the means to measure his/their perform-
• Provide teams/individuals the means to measure his/their contribution to the
business objectives
• Facilitate easy benchmarking of performance
Maintenance Performance Measurement (MPM) System 463

• Provide trends in performance

• Indicate the contribution of maintenance to overall business objectives
Some of the MPIs provide quality information for monitoring operational
safety performance during the implementation phase of the MPM system. These
MPIs are critical for the industries, especially for nuclear power plants, where
safety aspects play an important role. The characteristics of operational safety per-
formance indicators as applicable to nuclear power plants, which can be applied to
other industry as well, are (IAEA 2000):
• There is a direct relationship between the indicator and safety
• Necessary data are available or capable of being generated
• Indicators are unambiguous, their significance is understood, can be ex-
pressed in quantitative terms and local action can be taken on basis of indi-
• They are not susceptible to manipulation, a manageable set, meaningful, can
be validated and integrated into normal operational activities
• They can be linked to the cause of a malfunction, the accuracy of the data at
each level can be subjected to quality control and verification
The MPIs could be time and target-based, giving a positive or negative
indication. An MPI could be trend-based in some cases. If it is positive or steady,
meaning that everything is working well, then no action may be required to be
undertaken. If it shows a negative trend and has crossed the lower limit of the
target, then the decision is to act immediately. Whenever the value of the MPI falls
within the target limits (as set by the decision maker), then the decision is “wait
and see”. Different types of graphs and figures could be used for indicating the
health state of the technical system using different color codes for “excellent”,
“satisfactory”, “improvement required” and “unsatisfactory performance level”.
There could be other visualization techniques using bar charts or other graphical
tools for monitoring MPIs.
SMART test developed by the Department of Energy (DOE) can also be used
effectively to describe the five characteristics of an MPI (DOE 2002);

S = Specific; clear and focused to avoid misinterpretation. Should include

measure assumptions and definitions and be easily interpreted.
M = Measurable; can be quantified and compared to other data. It should allow
for meaningful statistical analysis. Avoid “yes/no” measures except in lim-
ited cases, such as start-up or systems-in-place situations.
A = Attainable; achievable, reasonable, and credible under conditions expec–
R = Realistic; fits into the organization's constraints and is cost-effective.
T = Timely, the indicator should be reflecting the status in real time and on
464 U. Kumar and A. Parida

19.4 Development and Implementation Issues

Today, many companies involved in industrial production do measure their main-
tenance performances in order to remain competitive in the market. However,
improper implementation and management of measurement system aiming to use
new measures to reflect new priorities often lead to ineffective results. This is due
to the failure of the organization to discard measures reflecting old priorities,
uncorrelated and inconsistent indicators and inadequate measurement techniques
(Meyer and Gupta 1994). Understanding the need for MPM in the business for
effective management of maintenance and its work process is critical for the
development and successful implementation of the MPM. In order to develop the
MPM system, maintenance performance issues are required to be considered,
which include the complexity of tasks, multiple inputs and outputs of maintenance
process and stakeholders continuously changing requirements. Some of the impor-
tant issues associated with the development of MPM system are as follows.

19.4.1 Measuring Values Created by the Maintenance

The most important issue in developing an MPM system is to measure the value
created by maintenance process. As a manager, one must know that what is being
done is what is needed by the business process, and if the maintenance output is
not contributing/creating any value for the business, it needs to be restructured. For
example, ratio of investment made and trends in cost per ton.

19.4.2 Justifying Investment

The second issue in developing an MPM system is to justify the organization’s

investment made in maintenance organisation; not so much as to whether one is
doing the right thing, but whether the investment they are making is producing a
return on the resources that are being consumed.

19.4.3 Revising Resource Allocations

The third issue issue in developing an MPM system is to determine if additional

investment is required in maintenance and to justify it. Alternatively, such
measurement of activities also permits one to determine the need for change or
how to carry out the current activities more effectively by using the resources allo-

19.4.4 Health, Safety and Environmental (HSE) Issues

The fourth issue is to understand the contribution of maintenance towards HSE

issues. A bad maintenance performance can lead to accidents (safety issue) and
pollutions (health hazards and environmental issues), besides encouraging an
unhealthy work culture and environment.
Maintenance Performance Measurement (MPM) System 465

19.4.5 Adapting to New Trends in Operation and Maintenance Strategy

New operating and maintenance strategies are adopted and followed by industries
in quick response to market demand, for the reduction of production loss and
process waste. MPM measures the value created by the maintenance. Some of the
important questions related to strategy are as follows:
• How does one assess and respond to stakeholders’ (internal and external)
• How does one translate the corporate goal and strategy into targets and
goals at the operational level (converting a subjective vision into objective
• How does one integrate the results and outcomes from the operational level
to develop lead indicators at the corporate level (converting objective out-
comes into strategic KPIs and linking them to strategic goals and targets)?
• How to support innovation and training for the employees to facilitate an
MPM oriented culture?

19.4.6 Measuring What is Easy to Measure

Most organizations make the mistake of measuring what is easy to measure, rather
than what is required to be measured. Thus, over a period of time, the indicators
are out of tune with the corporate strategy. Besides, a large amount of undesired
data creates the data overload, which are rarely utilised for analysis or decision
making. Therefore, the MPIs need to be identified and selected to meet the specific
requirement of the organization and its related issues.

19.4.7 Organizational Issues

Today organizations are trying to adopt a flat and compact organizational structure,
a virtual work organization, and empowered, self-managing, knowledge manage-
ment work teams and work stations. The organisational maintenance issues are to
measure maintenance effectiveness and resources spent on maintenance. Typically
in an organization, the top level looks for the investment and decides the corporate
strategy, based on which the operation and maintenance strategies are formulated.
Depending on the maintenance strategy, maintenance program and policies are
defined, which are implemented by the middle level. The operational level under-
takes the actual tasks of performing the activities. The issues pertaining to organi–
zation are:
• Need for developing a reliable and meaningful MPM system.
• Commitment of the top management for the MPM system.
• Converting the subjective corporate goals to specific targets and MPIs re-
quired to be measured.
• Involvement of the employees in implementation of the MPM system.
• Method and means of these measurements.
• Periodicity (time period) of such measurements.
466 U. Kumar and A. Parida

• Analysis of the collected data, its conversion to information, owner of the

information and its accountability with in the organization.
• Effective and efficient communication within and outside the organization
on issues related to information and decision making.

19.5 Framework for MPM System

A conceptual framework explains, either graphically or in narrative form, the main
things to be studied, the key factors or variables and the presumed relationship
between them. Frameworks can be rudimentary, elaborate, theory driven, descrip-
tive or causal (Miles and Huberman 1994). The MPM framework linking to
multiple criteria of MPIs needs to consider, from the internal and external stake-
holder’s requirements, the different hierarchical levels of the organization. There is
also a need to map the maintenance process and identify the gap between the
maintenance planning and execution, so that the MPIs can take care of these gaps.

19.5.1 Multiple Hierarchical Levels in MPM System

In order to accomplish the top-level objectives of the espoused maintenance

strategy, these objectives need to be cascaded into team and individual goals. The
adoption of fair processes is the key to successful alignment of these goals. It helps
to harness the energy and creativity of committed managers and employees to drive
the desired organizational transformations (Tsang 1998). Murthy et al. (2002)
mentioned that maintenance management needs to be carried out in both strategic
and operational contexts and the organizational structure is generally structured
into three levels. At each level, the linkage and relationship between maintenance
and operation needs to be clearly understood. Defining the measures and the actual
measurements for monitoring and control constitute an extremely complex task for
large organizations. The complexity of MPM is further increased for multiple crite-
ria objectives.
In the MPM system, MPIs are considered from the multiple hierarchical levels.
The first hierarchical level could correspond to the corporate or strategic level, the
second to the tactical or managerial level, and the third to the functional/opera-
tional level. Depending on the organizational structure, the hierarchical levels
could be more than three. Three hierarchical levels given in Figure 19.1 (adapted
from Parida and Kumar 2006) are considered for the proposed MPM framework.
The top level is responsible for framing the mission/vision statement, goals,
objectives, which form part of the strategic management. They decide the invest-
ment to be made for the infrastructure, manpower and what will be the conse-
quences or likely return on investment. The detailed activities are not in focus at
this level. The maintenance data at the functional level are aggregated and linked to
tactical or middle level to help the management for analysis and decision making at
strategic or tactical level. The corporate KPIs are cascaded down from strategic to
MPIs at operational level in a top-down manner and the MPIs are aggregated from
operational to strategic level in a bottom-up information flow. The subjectivity
increases as we integrate the objective outcomes from the shop floor to the
Maintenance Performance Measurement (MPM) System 467

organizational goal at higher level. An illustration of the breaking down of the

corporate goals to an objective targets at shop floor level is shown in Figure 19.2.
Similarly, Figure 19.3 exhibits an example of aggregation of MPIs.

Figure 19.1. Linkages between objective outcomes at operational level to strategic level and
breaking down of goals into objective targets

As shown in the figure, while cascading down the corporate goals of a mining
company with an installed capacity of 0.6 million ton per month, the monthly
production target of 0.51 million ton per month of iron ore pellet will cascade down
to a system availability of 96% at the tactical level, which must be translated into
maximum allowed planned stop of 20 h per month and unplanned plant stop of 8.8 h
per month. Similarly, while aggregating the MPIs such as planned and unplanned
stops needs to be aggregated to higher level in terms of availability and capacity
utilization. The calculations are as under:
• Plant capacity = 0.6 million ton per month
• Saleable quantity = 0.51 million ton per month
• Plant capacity is 835 tons per hour
• Goals (tactical): Availability (A) = 96%, Speed (P) = 90%
and Quality (Q) = 99%
• OEE = A ⫻ P ⫻ Q = .96 ⫻ .90 ⫻ .99 = 0.85
• Non-availability = 24 ⫻ 30 ⫻ 0.4 = 28.8 h per month
• Planned stop = 20 h/month and unplanned stop = 8.8 h/month
468 U. Kumar and A. Parida

Figure 19.2. An example of cascading down of corporate goal to operational targets

Figure 19.3. Aggregation of MPIs from operational level to corporate level

Maintenance Performance Measurement (MPM) System 469

Since the actual production and the OEE level has gone down, now the
management has to take remedial measures and appropriate decision making to
achieve the desired level of OEE and production.

19.5.2 Multiple Criteria of MPM System

The objectives of the organizational decision makers are expressed in terms of

different criteria. For example, at the beginning of twentieth century, financial cost
was the single criteria used by the managers. After the 1980s, it was felt by the
management that a single criterion is unable to meet their entire objectives and the
concept of multiple criteria evolved. When there are a number of criteria, the
multi-criteria choice problem arises, which is solved by obtaining information
about all the criteria and their relative priorities. For the MPM system, different
MPIs are being grouped under different criteria as per organizations requirements,
based on the stakeholders need. The multiple criteria of the MPIs can be con-
sidered from a balanced and integrated point of view. Besides the four perspectives
(customer, financial, internal processes and innovation and learning) of Kaplan and
Norton (1992), three more criteria like the HSE, employees’ satisfaction and main-
tenance task related, are considered and included in the MPM framework. Some of
the MPIs thus grouped under seven criteria associated with the development of the
MPM framework are selected to improve productivity, quality and safety of the
organization (Parida et al. 2005). The seven criteria considered are discussed
below. Plant/Equipment Related Indicators

The indicators under this criterion measure the performance pertaining to the plant
and equipment of the organizations. These MPIs provide relevant information to
the management at different hierarchical level for appropriate decision making.
Some of the MPIs under this criterion are:
• Availability. The availability is represented by the percentage of the plant
availability used for manufacturing/production. This is calculated as the
ratio of the mean time to failure (MTTF) to the total time, i.e. MTTF plus
the mean time to repair (MTTR).
• Performance (output per hour). This MPI indicates the speed of production
and is expressed as a percentage of the production/performance speed.
• Quality. This MPI refers to the quality of the product/service. This is the
percentage of good parts produced out of the total number of parts pro-
duced. The overall equipment effectiveness (OEE) is one of the main
benchmarks or key performance indicators for the total process of a com-
pany. The OEE is a multiplication of the equipment availability, perform-
ance and impact of quality.
• Number of minor and major stops. This indicator is the number of stops,
either minor or major. Stoppages can also be quantified in time (hours and
470 U. Kumar and A. Parida

• Down-time for the number of minor and major stops. This is expressed in
hours and minutes for the total number of stops or for each minor and
major stop.
• Rework. Rework due to maintenance lapses (for example; not sharpening
the tools) expressed in time (hours and minutes), the number of pieces on
which rework has been carried out and the cost of the rework undertaken. Maintenance Task Related Indicators

MPIs under this criterion pertains to the maintenance tasks carried out. These MPIs
indicates the efficiency and effectiveness of the maintenance department of the
organizations. The MPIs are:
• Change over time
• Planned maintenance task (preventive maintenance)
• Unplanned maintenance tasks (corrective maintenance)
• Response time for maintenance Finance/Cost-related Indicators

The finance or the cost related MPIs are the most sought information for the
management; these measures are valuable in summarizing the readily measurable
economic results of the business. The MPIs under this criterion relate to main-
tenance and production costs. Besides, management of the organization can include
other financial and cost related MPIs or PIs as per their need. Some of the MPIs of
this criterion are:
• Maintenance cost/unit
• Production cost per unit
• Total maintenance cost Customer Satisfaction

Customer satisfaction is one of the most important criteria for an organization to
focus on. This criterion measures the organization’s performance to satisfy the
customers which are formulated from the organizational business strategy. Some of
the MPIs under this criterion are:
• Number of quality complaints
• Low quality returns (number/quantity)
• Customer satisfaction (value-for-money feedback etc.)
• Customer retention
• Number of new customers added Learning and Growth

This criterion is related to infrastructure of the organization required for creating
long term growth and improvement. The global competitive environment compels
the companies to continuously improve their capabilities for delivering required
value to the customers and other stakeholders. Root cause analysis is carried out
Maintenance Performance Measurement (MPM) System 471

for checking the frequency of failure and time taken to fix the failure. Some of the
MPIs considered under this criterion are:
• Number of new ideas generated for improvement
• Skills and competency development/training Health, Safety, and the Environment (HSE)

Today, all plants and organizations are compelled to consider criteria related to
societal and environmental issues, besides economy. All safety precautions like
protective clothing and safety against chemical cleaning are undertaken by the
organization as they are mandatory requirements. Health and safety, which forms
part of the societal requirements, besides the environmental issues, are considered
by the organization under this criterion. Some of the MPIs under this criterion are:
• Number of incidents/accidents
• Lost time due to HSE issues
• Number of legal cases
• Number of compensation cases/amount of compensation paid
• Number of HSE complaints Employee Satisfaction

Employees are one of the important partner and stakeholders of any organization
today. Therefore, their satisfaction is essential to successfully implement MPM
system and achieve the desired goals of the organizations. Samples of MPIs which
indicates the motivation and satisfaction level of the employees, under this crite-
rion are:
• Employee absentees
• Employee complaints
• Employee retention

19.5.3 Multiple Criteria and Hierarchical MPM Framework

While developing an MPM framework, multiple criteria and hierarchical levels of

the organization are considered. Based on the stakeholders’ requirements, corporate
objectives and strategy, multiple criteria MPIs are considered for integrating them
to different hierarchical levels of the organization involving the employees at all
levels. At the functional level, the corporate objectives are converted to specific
measurable targets. It is essential that all the employees speak the same language
throughout the entire organization.
In addition to external stakeholders’ requirements, the internal aspects like the
capacity and capability of the organization comprising the departments, employee
requirements, the organizational climate and skill enhancement are taken into
consideration. An MPM framework considering the multi-criteria and hierarchical
approach is given in Table 19.1, with sample MPIs.
472 U. Kumar and A. Parida

Table 19.1. A multi-criteria hierarchical maintenance performancemeasurement (MPM)

Front-end Hierarchical Level 1 Level 2 Level 3
process level
Multi- Strategic/top Tactical/middle Functional/
- Timely delivery criteria management management operational
- Quality
- HSE issues Equipment/ - Capacity utilization - Availability - Production rate
process - OEE - Number of defects/rework
  related - Production rate - Number of
- Quality stops/downtime
- Number of stops - Vibration & thermography
effectiveness Cost/finance - Maintenance budget - Maintenance production - Maintenance cost per ton
related - ROMI cost per ton
- Customers/ - Maintenance/production
stakeholders cost
- Compliance
with regula- Maintenance - Cost of maintenance - Quality of maintenance - Change over time
tions task related tasks task - Planned maintenance
- Change over time task
- Planned maintenance task - Unplanned maintenance
 - Unplanned maintenance task

Learning - Generation of a number - Generation of number - Generation of number

growth & of new ideas of new ideas of new ideas
innovations - Skill improvement training - Skill improvement training - Skill improvement training
Customer - Quality complaint - Quality complaint numbers - Quality complaint
- Reliability satisfaction numbers - Quality return numbers
- Productivity related - Quality return - Customer satisfaction - Quality return
- Efficiency - Customer satisfaction - New customer addition - Customer satisfaction
- Growth &
innovation  - Customer retention

Health, - Number of accidents - Number of - Number of accidents/

Back-end safety & - Number of legal cases accidents/incidents incidents
process security - HSSE losses - Number of legal cases - HSSE complaints
environment - HSSE complaints - Compensation paid
- Process - HSSE complaints
- Supply chain Employee - Employee satisfaction - Employee tumover rate - Employee absentees
- HSE satisfaction - Employee complaints - Employee complaints - Employee complaints

The MPIs at functional and tactical levels gets aggregated as KPI at the
strategic level. For example, MPIs like the availability, performance (production
rate) and quality at operational level aggregates to OEE at the tactical level, and to
capacity utilization at the strategic level under plant/equipment criteria.
Maintenance Performance Measurement (MPM) System 473

19.6 Some Examples from Different Industries

Each industry has its own system for MPM; especially it is more relevant for
industries like nuclear power, oil and gas, etc. MPM framework and indicators to
monitor, control and evaluate various performances are in use by different indus-
tries. More and more industries are trying to develop specific MPIs for their own
organization and identify the indicators best suited to their industry. Some of the
industries, where MPIs has been tried out are in the nuclear, oil and gas (O & G),
railway, process industry and energy sectors. A different approach has been ap-
plied to developing the MPM framework and indicators for different industries, as
per the stakeholders’ requirements. Some of the MPIs used in different industries
are briefly discussed.

19.6.1 Nuclear Industry

The International Atomic Energy Agency (IAEA) has been actively sponsoring
work in the area of indicators to monitor nuclear power plant (NPP) operational
safety performance from the early 1990s. The safe operation of the nuclear power
plants is the accepted goal for the top management. A high level of safety results
from the integration of the good design, operational safety and human perform-
ance. In order to be effective, a holistic and integrated approach is required to be
adopted for providing a performance measurement framework and identifying the
performance indicators with desired safety attributes for the operation of the nu-
clear plant.
The NPP performance parameters include both safety and economic perform-
ance indicators, with overriding safety aspects. To assess the operational safety of
NPP, a set of tools like the plant safety aspect (PSA), regulating inspection, quality
assurance and self assessment are used. Two categories of indicators commonly
applied are risk based indicators and safety culture indicators. Operational Safety Performance Indicators

Indicator development starts with attributes usage and the operational safety
performance indicators are identified. Under each attribute, overall indicators are
established for providing overall evaluation of relevant aspects of safety perform-
ance and under each overall indicator, strategic indicators are identified. The
strategic indicators are meant for bridging the gap between the overall and specific
indicators. Finally, a set of specific indicators are identified/developed for each
strategic indicator to cover all the relevant safety aspects of NPP. Specific indica-
tors are used to measure the performance and identify the declining performance,
so that management can take corrective decisions. Some of the indicators as used
in the plants are given in Table 19.2 (IAEA 2000).
474 U. Kumar and A. Parida

Table 19.2. Some of the operational safety performance indicators

Attributes Overall Strategic Specific

indicators indicators indicators
1. Operates 1. Operating 1. Forced power 1. No of forced power reductions and
smoothly performance reductions & outages due to internal causes
outages 2. No of forced power reductions &
outages due to external causes
2. State of 1. Corrective 1. No of corrective work orders issued
structures, work orders for safety system
systems and issued 2. No of corrective work orders issued
components for risk important BOP systems
3. Ratio of corrective work orders
executed to work orders
4. No of pending work orders for more
than 3 months
2. Material 1. Chemistry Index (WANO
condition performance indicators)
2. Ageing related indicators
3. State of the 1. Fuel reliability (WANO)
barriers 2. RCS leakage
3. Containment leakage

19.6.2 Oil and Gas Industry

The cost of maintenance and its influence on the total system effectiveness of the
oil and gas industry is too high to ignore (Kumar and Ellingsen 2000). The safe
operations of oil and gas production units are the accepted goal for the manage-
ment of the industry. A high level of safety is essential through the integration of
good design, operational safety and human performance. To be effective, an inte-
grated approach is required to be adopted for identifying the MPIs with desired
safety attributes for the operation of the oil and gas production unit.
Some of the MPIs reported from plant level to result unit level to result area
level for the Norwegian oil and gas industry grouped into different categories are
as follows (Kumar and Ellingsen 2000):
• Production
– Produced volume (Sm3)
– Planned production (Sm3)
• Technical integrity
– Backlog preventive maintenance (man-hours)
– Backlog corrective maintenance (man-hours)
• Maintenance
– Maintenance man-hours total
– Maintenance man-hours safety systems
Maintenance Performance Measurement (MPM) System 475

• Deferred production
– Due to maintenance (Sm3)
– Due to operation (Sm3)
– Due to drilling/well operations (Sm3)
– Weather and other causes (Sm3)

19.6.3 Railway Industry: Example from Rail Infrastructure

Railway operation and maintenance is meant to provide acceptable service to users,

while meeting the regulating authorities’ requirements. Today, one of the require-
ments for infrastructure managers is to achieve cost effective maintenance activi-
ties and a punctual and cost-effective railroad transport system. As a result of a
research project for the Swedish railroad transport system, some of the identified
maintenance performance indicators are (Åhren and Kumar 2004):
• Capacity utilization of infrastructure
• Capacity restriction of infrastructure
• Hours of train delays due to infrastructure
• Number of delayed freight trains due to infrastructure
• Number of disruption due to infrastructure
• Degree of track standard
• Markdown in current standard
• Maintenance cost per track-km
• Traffic volume
• Number of accidents involving railway vehicles
• Number of accidents at level crossings
• Energy consumption per area
• Use of environmental hazardous material
• Use of non-renewable materials
• Total number of functional disruptions
• Total number of urgent inspection remarks

19.6.4 Process and Utility Industries

Measuring maintenance performance has drawn considerable interest in the utility,

manufacturing and process industry over the last decade. Organizations are keen to
know the return on investment made in maintenance investments, while meeting
business objectives and strategy. Under challenges of increasing technological
changes, implementing an appropriate performance measurement system in an
organization ensures that actions are aligned to strategies and objectives of the
organization. The MPIs for the utility industry in an energy sector will vary with
that of the process industry. Some of the MPIs as identified for an energy sector
organization of Europe are:
476 U. Kumar and A. Parida

(a) Customer satisfaction related

• SAIDI (system average interruption duration index)
• CAIDI (customer average interruption duration index)
• CSI (customer satisfaction index)

(b) Cost related

• Total maintenance cost
• Profit margin

(c) Plant/ process related

• Down time
• OEE rating

(d) Maintenance task related

• Number of unplanned stop (no & time)
• Number of emergency work
• Inventory cost

(e) Learning and growth/innovation related

• Number of new ideas generated
• Skill and improvement training

(f) Health, safety and environment related

• Number of accidents
• Number of HSE complaints

(g) Employee satisfaction related

• Employee satisfaction level

19.7 Concluding Remark

The MPM system for each organization needs to be different as each organization
is unique. It is required that a holistic and balanced MPM system should be
developed and implemented by involving all the stakeholders of the maintenance
Even though there has been a several fold growth in research publications deal-
ing with the area of performance measurement, the researchers and the managers
dealing with the specific area of maintenance are still continuing with their efforts to
find universal maintenance performance measurement system which shows the
“added value” generated by the maintenance process and its contribution towards
the business goal of the company. Therefore, in future, it will be challenging for
maintenance professional to show the contribution of maintenance, towards the total
business goal and provide metrics to measure the “added value” generated by the
maintenance process. Thus, future research will need to focus on the understanding
of the maintenance process and developing simple, and easy to implement, perform-
Maintenance Performance Measurement (MPM) System 477

ance measurement frameworks. There is a further scope to study the impact of dif-
ferent culture and human behavioral aspects associated with MPM.

19.8 References
Abran, A. and Buglione, L. (2003), A multidimensional performance model for consolidating
Balanced Scorecards, Advances in Engineering Software, 34, pp. 339–349
Åhren, T and Kumar, U. (2004), Use of maintenance performance indicators: a case study at
Banverket. Conference proceedings of the 5th Asia-Pacific Industrial Engineering and
Management Systems Conference (APIEMS2004). Gold Coast, Australia
Altmannshoffer, R. (2006). Industrielles FM, Der Facility Manager (In German), April
Issue, pp. 12–13.
Al-Turki, U. and Duffuaa, S. (2003), Performance measures for academic departments,
International Journal of Educational Management, Vol. 17, No. 7, pp. 330–338
Andersen, B. and Fagerhaug, T. (2002), Eight steps to a new performance measurement
system, Quality Progress, 35, 2, pp. 1125.
Campbell, J.D. (1995), Uptime: Strategies for Excellence in Maintenance Management.
Portland, OR: Productivity Press
Chandler, A.D. (1977), The Visible Hand: the Managerial Revolution in American Business,
Boston, MA, Harvard University Press, pp. 417
Cross, M. (1988), Raising the value of maintenance in the corporate environment,
Management Research News, Vol. 11, No. 3, pp. 8–11
DOE-HDBK-1148-2002 (2002) Work Smart Standard (WSS) Users’ Handbook, Department
of Energy, USA,
Fitzgerald, L., Johnson, R., Brignall, S., Silvestro, R. and Voss, C. (1991), Performance
Measurement in Service Businesses, London, CIMA
IAEA, International Atomic Energy Agency, (2000), A Framework for the Establishment of
Plant specific Operational Safety Performance Indicators, Report, Austria
Kaplan, R.S. and Norton, D.P. (1992), The balanced scorecard: measures that drive
performance, Harvard Business Review, January–February, pp. 71–79
Keegan, D., Eiler, R. and Jones, C. (1989), Are your performance measures obsolete?
Management Accounting, June, pp. 45–50
Kennerly, M. and Neely, A. (2003), Measuring performance in a changing business
environment, International Journal of Operation and Production Management, Vol. 23,
No. 2, pp. 213–229
Kumar, U. and Ellingsen, H. P. (2000), Development and implementation of maintenance
performance indicators for the Norwegian oil and gas industry, Conference proceedings
of 15th European Maintenance Conference (Euro Maintenance 2000), Gothenburg,
Lingle, J.H. and Schiemann, W.A. (1996), From balanced scorecard to strategy gauge: is
measurement worth it? Management Review, March, pp. 56–62
Liyanage, J.P. and Kumar, U. (2003), Towards a value-based view on operations and
maintenance performance management, Journal of Quality in Maintenance Engineering,
Vol. 9, pp. 333–350
Lynch, R.L. and Cross, K.F. (1991), Measure up!: the Essential Guide to Measuring
Business Performance, London, Mandarin
Medori, D. and Steeple, D. (2000), A framework for auditing and enhancing performance
measurement systems, International Journal of Operation & Production Management,
Vol. 20, No. 5, pp. 520–533
478 U. Kumar and A. Parida

Meyer, M.W. and Gupta, V. (1994), The performance paradox, in Straw, B. M. and
Cummings, L.L. (Eds), Research in Organizational Behavior, Vol. 16, Greenwich, CT,
JAI Press, pp. 309–369
Miles, M.B. and Huberman, A.M. (1994). Qualitative Data Analysis, Sage Publication,
California, USA.
Murthy, D.N.P, Atrens, A. and Eccleston, J.A. (2002), Strategic maintenance management,
Journal of Quality in Maintenance Engineering, Vol. 8, No. 4, pp. 287–305
Neely, A.D. (1999), The performance measurement revolution: why now and where next,
International Journal of Operation and Production Management, Vol. 19, No. 2, pp.
Neely, A., Adams, C. and Keenerly, M. (2002), The Performance Prism, Prentice Hall,
Financial Times, Harlow, UK
Parida, A., Chattopadhyay, G. and Kumar, U. (2005), Multi criteria maintenance performance
measurement: a conceptual model, in Proceedings of the 18th International Congress of
COMADEM, 31st Aug–2nd Sep 2005, Cranfield, UK, pp. 349–356
Parida, A. and Kumar, U. (2006), Maintenance performance measurement (MPM): issues
and challenges, Journal of Quality in Maintenance Engineering, Vol. 12, No. 3, pp.
Tsang, A.H.C. (1998), A strategic approach to managing maintenance performance, Journal
of Quality in Maintenance Engineering, Vol. 4, No. 2, pp. 87–94
Wealleans, D. (2000), Organizational Measurement Manual, Abingdon, Oxon, GBR,
Ashgate Publishing Limited
Wireman, T. (1998), Developing Performance Indicators for Managing Maintenance, New
York, Industrial Press, Inc.
Wongrassamee, S., Gardiner, P.D. and Simmons, J.E.L. (2003), Performance measurement
tools: the balanced scorecard and the EFQM Excellence Model, Measuring Business
Performance, Vol. 7, pp. 14–29

Forecasting for Inventory Management

of Service Parts

John E. Boylan and Aris A. Syntetos

20.1 Introduction
Service parts are ubiquitous in modern societies. Their need arises whenever a
component fails or requires replacement. In some sectors, such as the aerospace
and automotive industries, a very wide range of service parts are held in stock, with
significant implications for availability and inventory holding. Their management
is therefore an important task.
A distinction should be drawn between preventive maintenance and corrective
maintenance. Demand arising from preventive maintenance is scheduled and is
deterministic, at least in principle. Demand arising from corrective maintenance,
after a failure has occurred, is stochastic and requires forecasting.
Fortuin and Martin (1999) categorise the contexts for service logistics as
• Technical systems under client control (e.g. machines in production depart-
ments, transport vehicles in a warehouse);
• Technical systems sold to customers (e.g. telephone exchange systems,
medical systems in hospitals)
• End products used by customers (e.g. TV sets, personal computers, motor
In the first context, there is usually a specialist department within the client
organization performing maintenance activities and managing service parts in-
ventories. In the second context, a specialist department within the vendor organi-
zation will generally undertake these tasks. In both cases, a large amount of infor-
mation is known by the vendor, or can be shared with the vendor. This information
may include scheduled (preventive) maintenance activities, times between failures,
usage rates and condition of equipment.
When a wealth of data is available, it is possible to identify explanatory
variables which may be used to predict the demand of service parts. For example,
480 J. Boylan and A. Syntetos

Ghobbar and Friend (2002) showed that the average demand interval for aircraft
spare parts depends on the aircraft utilization rate, the component overhaul life and
the type of primary maintenance process. In a further study, Ghobbar and Friend
(2003) showed how forecast accuracy depends on various characteristics of the
demand process, including the seasonal period length, as well as the primary
maintenance process. Hua et al. (2006) used two zero-one explanatory variables,
plant overhaul and equipment overhaul, to help predict demand of spare parts
in the petrochemical industry. In other cases, explanatory variables have been used
to predict part of the demand for a stock keeping unit (SKU). For example,
Kalchschmidt et al. (2006) identified clusters of customers whose sales were
correlated with promotional activities and clusters of customers that were unaffec-
ted, using appropriate forecasting methods for each group.
In the third context, parts are used by consumers and much less information is
available. Fortuin and Martin (1999, p 957) commented, “Clients are anonymous,
their usage of consumer products and their ‘maintenance concept’ are not known”.
Most demand arises from purely corrective maintenance (e.g. on TV sets, personal
computers) required in the case of a defect. Even when preventive maintenance
occurs (e.g. on motor cars), prediction is complicated by the ‘maintenance concept’
of consumers being unknown. For example, customers may not bring in their cars
at the correct time for a service, or may not bring them in at all. In many practical
situations where end products are used by consumers, the vendor must gauge
demand for service parts from the demand history alone. Such demand patterns are
often sporadic, with occasional ‘spikes’ of demand. Alternatively, demand for an
SKU may be decomposed into regular and irregular components (Kalchschmidt et
al. 2006). In both cases, sporadic demand for service parts poses a considerable
challenge to those responsible for managing inventories. It is this challenge that
will be addressed in this chapter.
The remainder of the chapter is structured as follows. In the next section we
address issues pertinent to the classification of service parts for forecasting and
inventory management related purposes. Parametric and non-parametric approaches
to forecasting service parts requirements are then discussed in Sections 20.3 and
20.4 respectively. In Section 20.5, we present various metrics appropriate for
measuring the performance of the inventory management system whereas in
Section 20.6 we review the limited number of studies that provide empirical
evidence on: i) the performance of forecasting methods for service parts and ii) the
empirical fit of statistical distributions to the corresponding underlying demand
patterns. Finally, the conclusions of our work are summarized in Section 20.7.

20.2 Classification of Service Parts

Service parts for consumer products are highly varied, with differing costs, service
requirements and demand patterns. Classification of stock keeping units (SKUs) is
widely adopted by organizations, but the method of classification varies widely.
This is to be expected, as classification serves a number of different purposes.
Forecasting and Inventory Management for Service Parts 481

20.2.1 Service Requirement

The first aim of classification is to determine service requirements. It is common

for organizations to segment their service parts, assigning higher service-level
targets to some segments than others. A direct approach is to classify according to
a part’s service criticality. The ‘criticality’ may be determined informally or by
formal methods such as failure mode, effects and criticality analysis (FMECA).
According to this method, criticality analysis is defined as “A procedure by which
each potential failure mode is ranked according to the combined influence of
severity and probability of occurrence” (Department of Defense 1980, p 3). This
approach is likely to be more appropriate for those situations where technical
systems are being managed by the vendor or are under client control. However, it
may also be applicable to service parts for consumers. For example, safety-critical
automotive components, such as brakes, may be assigned to a higher criticality
category than automotive accessories, such as furry dice.
Alternatively, an ABC (Pareto) classification can be used to determine service
requirements. A Pareto report lists all SKUs in descending order, by total volume,
or total value of sales. An ABC analysis by value is often used as a proxy for
criticality, with the A items being assumed to be the most critical and requiring the
highest service levels. Some authors also argue that the sophistication of the
replenishment method should reflect the ABC classification: “For a true C item
the low total of replenishment, carrying and shortage costs implies that, regardless
of the type of control system used, we cannot achieve a sizable absolute savings in
these costs. Therefore, the guiding principle should be to use simple procedures
that keep the control costs per SKU quite low…” (Silver et al. 1998, p 359). For a
single SKU, this is undoubtedly correct. However, for many hundreds or thousands
of SKUs, the argument has less force. The additional savings accruing from more
sophisticated forecasting and stock control methods potentially outweigh any
additional investment or system running costs.
A further disadvantage of the ABC classification is that it is not obvious in
what different ways the categories should be treated. (This basic requirement of
inventory classification schemes was first discussed by Williams 1984.) In particu-
lar, the choice of forecasting methods for slower demand items depends on the
degree of intermittence and the variability of demand, neither of which are fully
captured by a Pareto analysis by volume or value. Therefore, if a Pareto classifica-
tion is adopted, it may be advantageous to supplement it with further categoriza-
tions. For example, classification based on value of sales is often used to determine
the frequency of orders to be placed, whereas classification of demand characteris-
tics (to be examined later in this section) is a more effective way to determine the
order levels and the safety stocks.
Categorization of service parts by cost is common practice. In some organi-
zations, a two-way classification by cost and volume is employed. This allows
greater flexibility in adjusting service targets, by category, in order to achieve over-
all targets at minimum cost. It is a slightly more sophisticated variation of the ABC
approach, again requiring supplementary classifications for forecasting purposes.
482 J. Boylan and A. Syntetos

20.2.2 Inventory Decision

A product life cycle approach is often used in marketing, with three phases of
growth, maturity and decline. A similar classification may be adopted for stock
control, with the phases aligned directly to the decisions required for the inventory
management of service parts.
Fortuin (1980) suggested three phases: initial, normal and final. In the initial
phase, when the part is introduced, there are two decisions: i) should the item be
stocked and ii) if so, what are the initial stock requirements? In the normal phase,
an inventory policy must be determined and the parameters estimated. If an order-
up-to (OUT) policy is adopted, for example, then the order-up-to-Level must be
calculated. As the part nears the end of its life, suppliers may become reluctant to
manufacture small volumes, as required by clients, particularly if the part has high
manufacturing set-up costs. In this final phase, a decision must be taken on the size
of a single order to cover all remaining demand (sometimes known as an ‘all time
buy’). Teunter (1998) analysed this problem from a theoretical perspective, while
Teunter and Fortuin (1998) reported a case-study of a company facing such a

20.2.3 Forecasting Approach

Forecasting approaches may be broadly divided into two categories:

• Dependent on explanatory variables (causal methods)
• Dependent only on the history of demand (time-series methods)
A classification of service parts according to the product life cycle can assist in
choosing the better approach. As discussed in the first section of this chapter, the
choice of forecasting approach is mainly determined by the availability of data on
explanatory variables, such as the timing of preventive maintenance activities.
However, the forecasting approach is also driven by the availability of demand his-
tory data which, in turn, is determined by the stage of the service part’s life cycle.
Causal methods are particularly useful in the initial phase, when the part is
introduced, since the lack of an adequate length of demand history precludes the
use of extrapolative time-series methods. Models linking sales to promotional
expenditure, for example, can be applied. In the normal phase, which is the focus
of this chapter, causal methods are used when maintenance activities are under the
control of the vendor or the client (if the client is not an end-consumer). For con-
sumer clients, historical data for the explanatory variables are usually not available,
and time-series methods are used to forecast service parts’ requirements. In the
final phase, when an ‘all time buy’ from a supplier is required, extrapolative
methods can be applied. For example, a regression model on the logarithm of sales
against time may be used, assuming an exponential decline in demand over time.

20.2.4 Forecasting Method

Faster moving service parts are commonly forecast using time-series methods. The
specific method that should be employed depends on the characteristics of the
Forecasting and Inventory Management for Service Parts 483

demand pattern. For non-intermittent demand, exponential smoothing methods are

often used, with appropriate variants for trended, damped trended and seasonal
data. For intermittent demand, with some periods showing no demand at all, differ-
ent methods are needed.
Demand is said to be ‘intermittent’ if it is “infrequent in the sense that the
average time between consecutive transactions is considerably larger than the unit
time period, the latter being the interval of forecast updating” (Silver et al. 1998,
p 127). An item with ‘erratic demand’ is “one having primarily small demand
transactions with occasional very large transactions” (Silver 1970, p 7).
Intermittent and erratic demand patterns are very common amongst service
parts. If an item is both intermittent and erratic, it is said to be ‘lumpy’. The graph
in Figure 20.1 shows examples of intermittent and lumpy demand patterns (based
on annual demand history for two service parts used in the aerospace industry).



Demand (Units)





Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Time period
Slow demand Lumpy demand

Figure 20.1. Intermittent and lumpy demand patterns

A further approach to the categorization of service parts is to examine the

sources of intermittence and erraticness in the demand pattern. Bartezzaghi et al.
(1996) identified the following factors contributing to these demand characteristics:

1. Numerousness of potential customers

2. Frequency of customer requests
3. Heterogeneity of customers (measured by Gini’s index)
4. Variety of customers’ requests (measured by the coefficient of variation of
the demand of a single customer)
5. Correlation between customers’ requests
484 J. Boylan and A. Syntetos

The first and second factors determine the intermittence of demand. In response
to this intermittence, for those SKUs with very few customers, it may become
feasible to liaise directly with them, and to enhance forecasts accordingly.
The third and fourth factors determine the ‘erraticness’ of demand. As orders
become more irregular, exploiting early information at the customer level becomes
more attractive. Of course, such early indications are not always available. This
will often be the case when addressing consumer demand. It is also possible that
early confirmed orders may give a good indication of final orders. This is particu-
larly useful when there is a strong correlation between customers’ demands.
The five factors, and their effect on intermittence, erraticness and lumpiness,
are summarised in Figure 20.2.

Numerousness of customers
Frequency of individual orders

Correlation between customers’ requests Lumpiness

Heterogeneity of customers
Variety of customers’ requests

Fig. 20.2. Categorization based on the sources of demand characteristics

For those items without early indicators, forecasting must be undertaken using
a purely time-series approach. This is usually linked to a demand distribution, so
that inventory levels may be set to achieve high percentage service level targets.
Many inventory management systems make distributional assumptions of demand
according to the ABC classification. For example, A and B items may be taken to
be normally distributed, whilst C items are assumed to be Poisson. In practice,
however, many service parts have demand that is more erratic than Poisson
(sometimes known as ‘over-dispersed’). The Poisson dispersion index (ratio of the
variance to the mean of demand, including zero demands) can be used to classify
SKUs as Poisson or non-Poisson. If the index is close to unity, then a Poisson
distribution is indicated; if the index is greater than unity, then other distributions,
such as the negative binomial, may be more appropriate, or a non-parametric
approach may be required, as discussed in Section 20.4.
An obvious way to classify service parts is by frequency of demand. As demand
occurrence becomes more infrequent, with some periods having no demand at all, a
number of difficulties emerge. From a forecasting perspective, methods such as
Forecasting and Inventory Management for Service Parts 485

single exponential smoothing (SES) can no longer be recommended (Croston 1972).

Second, assumptions of normality of demand become unsustainable, as demand
becomes more skewed. Johnston and Boylan (1996) examined the conditions under
which Croston’s method (designed for intermittent demand and reviewed in the next
section of this chapter) is more accurate than SES. The authors concluded, on the
basis of simulation of a wide range of conditions, that Croston’s method is more
accurate (in terms of mean square error) when the average demand interval (p)
exceeds 1.25 review periods. In a recent case-study, Boylan et al. (2006) showed
that overall forecast accuracy is robust to the choice of break-points above 1.25 but
much less so to values below 1.25. For practical applications, break-points may be
determined using simulation studies. The important point is that it is preferable to
identify conditions for superior forecasting performance, and then to categorise
demand based on these results, rather than the other way round.
A complementary method of classification is by variability of demand size.
From a forecasting perspective, Syntetos et al. (2005) re-examined the comparison
between methods such as Croston (1972), based on intermittent demand, and SES.
This study was based on comparison of approximate expressions for the theoretical
mean square error of different methods. The authors identified two key categoriza-
tion variables, namely the average demand interval (p) and the squared coefficient
of variation of demand size (CV 2). (Note that the latter measure ignores zero
demand periods.) Comparisons between forecasting methods yield regions of supe-
rior performance. The performance of SES depends on whether it is assessed at all
points in time, or only immediately after demand occurrences (which trigger stock-
orders in most inventory systems). The performance of Croston’s method does not
depend on this timing. Suppose that SES is compared with Croston’s method, for
the forecasts made immediately after a demand occurrence; then the matrix pre-
sented in Figure 20.3 is obtained.

Low High
p=1.34 (break-point)
Erratic Lumpy
(Croston) (Croston)
CV =0.28
Smooth Intermittent
(SES) (Croston)

Fig. 20.3. Categorisation of SKUs by forecast accuracy

In summary, service parts may be in the initial, normal or final phases of their
life cycle. In this chapter, we focus attention on the normal phase. Although service
parts may be classified as A, B or C in a Pareto analysis, it is likely that most parts
486 J. Boylan and A. Syntetos

will be categorized as C. The service requirements for the part may be guided by
criticality and cost considerations, as well as the ABC classification. Further
refinements are necessary to the Pareto classification in order to allocate the most
appropriate forecasting methods to each SKU.
Enhancement of the ABC classification in the manner described above gives a
coherent approach to classification according to forecasting performance and a
foundation for theoretically-informed usage of terms such as ‘erratic’, as shown in
Figure 20.4 (after Syntetos 2001, adapted by Boylan et al. 2006).

Mean inter-
demand interval

Mean demand size Slow


Erratic Lumpy
Coefficient of variation
of demand sizes
Non-erratic Clumped

Fig. 20.4. Categorisation of demand patterns for service parts

As in Figure 20.2, a ‘lumpy’ SKU is defined as one that is both ‘intermittent’

and ‘erratic’; definitions of ‘slow’ and ‘clumped’ are also included. Figure 20.4
offers a different perspective from Figure 20.2. The former diagram shows the
measures that may be used to classify SKUs as intermittent and erratic, whereas the
latter showed the factors that lead to intermittence and erraticness. An understand-
ing of both issues is required for effective forecasting of service parts.
Forecasting and Inventory Management for Service Parts 487

20.3 Parametric Forecasting

Practical parametric approaches to inventory management rely upon estimates of
some essential demand distribution parameters. The decision parameters of the
inventory systems (such as the re-order point or the order-up-to-level) are then
based on these estimates.
Different inventory systems require different variables to be forecasted. Some
of the most well cited, for example (R, s, S) policies (Naddor 1975; Ehrhardt and
Mosier 1984), require only estimates of the mean and variance of demand. (In such
systems, the inventory position is reviewed every R periods and if the stock level
drops to the re-order point s enough is ordered to bring the stock up to the order-
up-to-level S.)
In other cases, and depending on the objectives or constraints imposed on the
system, such estimates are also necessary, although they do not constitute the ‘key’
quantities to be determined. We may consider, for example, an (R, S) or an (s, Q)
policy operating under a fill-rate constraint – known as P2 and discussed in Section
20.5. (In the former case, the inventory position is reviewed periodically, every R
periods, and enough is ordered to bring it up to S. In the latter case, there is a
continuous review of the inventory position and as soon as that drops to, or below,
s an order is placed for a fixed quantity Q.) In those cases we wish to ensure that
x% of demand is satisfied directly off-the-shelf and estimates are required for the
probabilities of any demands exceeding S or s. Such probabilities are typically
estimated indirectly, based on the mean demand and variance forecast in conjunc-
tion with a hypothesized demand distribution. Nevertheless, a reconstruction of the
empirical distribution through a bootstrapping (non-parametric) procedure would
render such forecasts redundant and this issue is further discussed in the following
section. Similar comments apply when these systems operate under a different
service driven constraint: there is no more than x% chance of a stock-out during
the replenishment cycle (this service measure is known as P1). Consequently, we
need to estimate the (100–x)-th percentile of the demand distribution.
In summary, parametric approaches to forecasting involve estimates of the
mean and variance of demand. In addition, a demand distribution needs also to be
hypothesized, in the majority of stock control applications, for the purpose of
estimating the quantities of interest. Issues related to the hypothesized demand
distribution are addressed in the following sub-section. The estimation of the mean
and variance of demand is addressed in SSections 20.3.2 and 20.3.4 respectively.

20.3.1 The Demand Distribution

Demand for service parts is most commonly intermittent in nature. The demand
pattern is characterized by infrequent demands, often of variable size, occurring at
irregular intervals. Consequently, as discussed in Section 20.3.2, it is preferable to
model demand from constituent elements, i.e. the demand size and inter-demand
interval. Therefore, compound theoretical distributions (that explicitly take into
account the size-interval combination) are typically used in such contexts of
application. We first discuss some issues related to modelling demand arrivals and
488 J. Boylan and A. Syntetos

hence inter-demand intervals. We then extend our discussion to compound demand

In a service parts demand context, two demand generation processes have
dominated the literature. If time is treated as a discrete (whole number) variable,
demand may be generated based on a Bernoulli process, resulting in a geometric
distribution of the inter-demand intervals. When time is treated as a continuous
variable, the Poisson demand generation process results in negative exponentially
distributed inter-arrival intervals.
There is sound theory in support of both geometric and exponential distribution
for representing the time interval between successive demands. There is also em-
pirical evidence in support of both distributions (e.g. Dunsmuir and Snyder 1989;
Kwan 1991; Willemain et al. 1994; Janssen 1998; Eaves 2002). With Poisson
arrivals of demands and an arbitrary distribution of demand sizes, the resulting
distribution of total demand over a fixed lead time is compound Poisson. Inter-
demand intervals following the geometric distribution in conjunction with an
arbitrary distribution for the sizes, results in a compound binomial distribution.
Regarding the compound Poisson distributions, the stuttering Poisson, which is
a combination of a Poisson distribution for demand occurrence and a geometric
distribution for demand size, has received the attention of many researchers (for
example: Gallagher 1969; Ward 1978; Watson 1987). Another possibility is the
combination of a Poisson distribution for demand occurrence and a normal distri-
bution for demand sizes (Vereecke and Verstraeten 1994), although the latter
assumption has little empirical support. Quenouille (1949) showed that a Poisson-
logarithmic process yields a negative binomial distribution (NBD). When order
occasions are assumed to be Poisson distributed and the order size is not fixed but
follows a logarithmic distribution, total demand is then negative binomially distri-
buted over time.
Another possible distribution for representing demand is the gamma distribu-
tion. The gamma distribution is the continuous analogue of the NBD and “although
not having a priori support (in terms of an explicit underlying mechanism such as
that characterizing compound distributions), the gamma is related to a distribution
which has its own theoretical justification” (Boylan 1997, p 168). The gamma
covers a wide range of distribution shapes, it is defined for non-negative values
only and it is generally mathematically tractable in its inventory control applica-
tions (Burgin and Wild 1967; Burgin 1975; Johnston 1980). Nevertheless if it is
assumed that demand is discrete, then the gamma can be only an approximation to
the distribution of demand. At this point it is important to note that the use of both
NBD and gamma distributions requires estimation of the mean and variance of
demand only. In addition, and as discussed in section 20.6, there is empirical
evidence in support of both distributions and therefore they are recommended for
practical applications.
Vereecke and Verstraeten (1994) presented an algorithm developed for the
implementation of a computerised stock control system for spare parts in a chemi-
cal plant. Of the items, 90% were classified as lumpy, with the remaining 10%
consisting of slow or fast movers. The demand was assumed to occur as a Poisson
process with a package of several pieces being requested at each demand occur-
rence. The parameters of the distribution of the demand size can be estimated from
Forecasting and Inventory Management for Service Parts 489

the variance and the average of the demand history data of each item. The resulting
distribution of demand per period was called a ‘package Poisson’ distribution. The
same distribution has appeared in the literature under the name ‘hypothetical SKU’
(h-SKU) Poisson distribution (Williams 1984), where demand is treated as if it
occurs as a multiple of some constant, or ‘clumped Poisson’ distribution, for mul-
tiple item orders for the same SKU of a fixed ‘clump size’ (Ritchie and Kingsman
1985) (please also refer to Figure 20.4 where a definition of ‘clumped’ demand is
offered). In an earlier work, Friend (1960) also discussed the use of a Poisson
distribution for demand occurrence, combined with demands of constant size. The
‘package Poisson’ distribution requires, as the Poisson distribution itself, an esti-
mate of the mean demand only.
If demand occurs as a Bernoulli process and orders follow the logarithmic-
Poisson distribution (which is not the same as the Poisson-logarithmic process that
yields NBD demand) then the resulting distribution of total demand per period is
the log-zero-Poisson (Kwan 1991). The log-zero-Poisson is a three parameter dis-
tribution and requires a rather complicated estimation method. Moreover, it was
found by Kwan (1991) to be empirically outperformed by the NBD. Hence, the
log-zero Poisson cannot be recommended for practical applications. One other
compound binomial distribution appeared in the literature is that involving nor-
mally distributed demand sizes (Croston 1972, 1974). However, and as discussed
above, a normality assumption is unrealistic and therefore the distribution is not re-
commended for practical applications.

20.3.2 Estimation of Mean Demand: Size – Interval Methodology

Single exponential smoothing (SES) and simple moving averages (SMA) are often
used in practice to forecast intermittent demand. Both methods have been shown to
perform satisfactorily on real service parts data. However, the ‘standard’ fore-
casting method for such items is considered to be Croston’s method (Croston 1972,
as corrected by Rao 1973). Croston suggested treating the size of orders ( z t ) and
the intervals between them ( p t ) as two separate series and combining their ex-
penentially weighted moving averages (obtained using SES) to achieve a forecast
of the demand per period. (Recently, some adaptations of Croston’s method have
appeared in the literature that rely upon SMA rather than SES estimates and such
modifications are further discussed later in this section.)
In Croston’s work, both demand sizes and intervals were assumed to have
constant means and variances, for modelling purposes, and demand sizes and
demand intervals to be mutually independent. Demand was assumed to occur as a
Bernoulli process. Subsequently, the inter-demand intervals are geometrically
distributed (with mean p ). The demand sizes were assumed to follow the normal
distribution (with mean µ and variance σ 2 ).
These assumptions have been challenged in respect of their realism (see, for
example, Willemain et al. 1994) and they have also been challenged in respect of
their theoretical consistency with Croston’s forecasting method. The latter issue is
further discussed in Section 20.3.3.
490 J. Boylan and A. Syntetos

Croston’s method works in the following way: SES estimates of the average
size of the demand ( zˆt ) and the average interval between demand incidences ( pˆ t ),
are made after demand occurs (using the same smoothing constant value, α ). If no
demand occurs, the estimates remain exactly the same. The forecast of demand per
period ( Yˆt ) is given by: Yˆt = zˆt / pˆ t . If demand occurs in every time period,
Croston’s estimator is identical to SES. For constant lead times of length L , the
mean lead-time demand estimate ( YˆL ) is then obtained as follows:

YˆL = LYˆt (20.1)

Despite the theoretical superiority of such an estimator, modest benefits were

recorded in the literature when the method was actually applied on real data.
Syntetos and Boylan (2001) showed that Croston’s estimator is biased. The bias is
introduced by estimating the probability of demand occurrence from the average
inter-demand interval (inversion bias). This is now explained in some more detail.
We start with Croston’s assumptions:

Ε( zt ) = Ε( zˆt ) = µ (20.2a)
Ε( pt ) = Ε( pˆ t ) = p (20.2b)

According to Croston, the expected estimate of demand per period in that case
would be: Ε(Yˆt ) = Ε( zˆt / pˆ t ) = Ε( zˆt ) / Ε( pˆ t ) = µ / p (i.e. the method is unbiased).
If it is assumed that estimators of demand size and demand interval are
independent, then

⎛ zˆ ⎞ ⎛ 1 ⎞
Ε ⎜ t ⎟ = Ε( zˆt )Ε ⎜ ⎟ (20.3)
⎝ pˆ t ⎠ ⎝ pˆ t ⎠


⎛ 1 ⎞ 1
Ε⎜ ⎟ ≠ (20.4)
⎝ t⎠ Ε ( pˆ t )

and therefore Croston’s method is biased. It is clear that this result does not depend
on Croston’s assumptions of stationarity and geometrically distributed demand
More recently, Boylan and Syntetos (2003), Syntetos and Boylan (2005) and
Shale et al. (2006) presented correction factors to overcome the bias associated with
Croston’s approach. Some of these papers discuss: i) Croston’s applications under a
Poisson demand arrival process and ii) estimation of demand sizes and intervals
using an SMA (using the ratio of the former to the latter as an estimate of demand
per period). The correction factors are summarized in the Table 20.1. (where k is the
length of the moving average and α is the smoothing constant for SES).
Forecasting and Inventory Management for Service Parts 491

Table 20.1. Bias correction factors

Demand generation process
Bernoulli Poisson
α α
1− 1−
SES 2 2 −α
Estimation Syntetos and Boylan (2005) Shale et al.(2006)

k k −1
SMA k +1 k
Boylan and Syntetos (2003) Shale et al. (2006)

At this point it is important to note that SMA and SES are often treated as
equivalent when the average age of the data in the estimates is the same (Brown,
1963). A relationship links the number of points in an arithmetic average (k) with
the smoothing parameter of SES ( α ) for stationary demand. Hence it may be used
to relate the correction factors presented in Table 20.1 for each of the two demand
generation processes considered. The linking equation is

2 −α
k= (20.5)

20.3.3 Method – Model Inconsistencies

Snyder (2002) pointed out that Croston’s model assumes stationarity of demand
intervals and yet an SES estimator is used, implying a non-stationary demand
process. The same comment applies to demand sizes. Snyder commented that this
renders the model and method inconsistent and he proposed some alternative
models, and suggested a new forecasting approach based on parametric bootstrap-
ping. Shenstone and Hyndman (2005) developed this work by examining Snyder’s
models. In their paper they commented on the wide prediction intervals that arise
for non-stationary models and recommended that stationary models should be re-
considered. However, they concluded: “... the possible models underlying Croston’s
and related methods must be non-stationary and defined on a continuous sample
space. For Croston’s original method, the sample space for the underlying model
included negative values. This is inconsistent with reality that demand is always
non-negative” (Shenstone and Hyndman, 2005, pp 389–390).
In summary, any potential non-stationary model assumed to be underlying
Croston’s method must have properties that do not match the demand data being
modeled. Obviously, this does not mean that Croston’s method and its variants are
not useful. Such methods do constitute the current state of the art in intermittent
demand parametric forecasting. An interesting line of further research would be to
consider stationary models for intermittent demand forecasting rather than restrict-
492 J. Boylan and A. Syntetos

ing attention to models implying Croston’s method. For example, Poisson auto-
regressive models have been suggested to be potentially useful by Shenstone and
Hyndman (2005).

20.3.4 Estimation of Demand Variance

In parametric forecasting and inventory control applications, estimating the vari-

ability of the lead-time demand forecast error is equally important to estimating the
level of demand itself. In this section we address issues related to the estimation of
the error variance but it is important to note that such an estimate may not always
be required. Under the assumption of Poisson distributed demand, for example, an
estimate of the mean demand only would be sufficient.
In practical applications, and assuming constant lead-times, the variance of the
lead-time forecast error is most often taken as the sum of the error variances of the
individual forecast periods. In particular, if L is the length of the lead-time (con-
stant), and σˆ t is the standard deviation of the one-step ahead forecast error at time
t , then the standard deviation of the lead-time forecast error ( σˆ L ) is estimated as

σˆ L = Lσˆ t (20.6)

In theory, the standard deviation of the one-step-ahead forecast error σˆ t can be

estimated by using either the mean squared error (MSE) or the mean absolute devia-
tion (MAD) approach. However, the ‘smoothing’ versions of those error measures
are most often used in practice to improve the responsiveness of the system (see, for
example, Silver et al. 1998):

σˆ t = MSEt (20.7)

where MSEt = α (Yt −1 − Yˆt −1 ) 2 + (1 − α ) MSEt −1 , or

σˆ t ≈ 1.25MADt (20.8)

where MADt = α Yt −1 − Yˆt −1 + (1 − α ) MADt −1

In the above calculations, α is the smoothing constant, Yt −1 the actual demand

in period t − 1 and Yˆt −1 the forecast of demand for period t − 1 .
If the mean demand level fluctuates over time (steady state model or auto-
regressive integrated moving average – ARIMA process of order (0,1,1)), it has
been shown (Johnston and Harrison 1986) that Equation 20.6 neglects any correla-
tion between the estimates of demand. This correlation exists, at least in part,
because of the uncertainty in the estimate of the true underlying level of demand
that is carried from one period to another.
Forecasting and Inventory Management for Service Parts 493

When using SES, under the above model formulation, the standard deviation of
the lead time forecast error was shown (Johnston and Harrison 1986) to be
correctly calculated as follows:

σˆ L = L + α ( L − 1) L(1 + α (2 L − 1/ 6)) σˆ t (20.9)

Under the stationary mean model assumption (the demand level is assumed to
be constant) the forecast error correlation still exists because of the uncertainty
associated with the variance of the forecasts, which is carried forward from one
period to another. In addition, if a biased estimator is in place to forecast future
demand requirements, the auto-correlation can be also attributed to the bias. This
issue has been analytically addressed by Strijbosch et al. (2000) and Syntetos et al.

20.4 Non-parametric Forecasting

As discussed in Section 2, demand for service parts may often be lumpy in nature.
Considering Figure 20.3, such SKUs are characterized by very infrequent demand
occurrences (intermittence) coupled with highly erratic demand sizes, when de-
mand occurs.
Croston’s method and its variants (in conjunction with an appropriate dis-
tribution) have been reported to offer tangible benefits to stockists forecasting
intermittent demand. (Relevant empirical evidence follows in Section 20.6.) Never-
theless, there are certainly some restrictions regarding the degree of lumpiness that
may be dealt with effectively by any parametric distribution. In addition to the
average inter-demand interval, the coefficient of variation of demand sizes has
been shown in the literature to be very important from a forecasting perspective for
demand classification purposes. However, as the data become more erratic, the true
demand size distribution may not comply with any standard theoretical distribu-
tion. This challenges the effectiveness of any parametric approach. When SKUs
exhibit a lumpy demand pattern such as that presented in Figure 20.1, one could
argue that only non-parametric approaches may provide opportunities for further
improvements in this area.
Willemain et al. (2004) developed a patented non-parametric forecasting
method for intermittent demand data. Their method is not model-based but instead
is a heuristic that combines a Markov process, bootstrapping and ‘jittering’ to
simulate an entire distribution for lead-time demand rather than a single forecast.
The method works according to the following steps:

1. Obtain historical demand data in chosen time buckets (e.g. days, weeks,
2. Estimate transition probabilities for two-state (zero vs. non-zero) Markov
3. Conditional on last observed demand, use Markov model to generate a
sequence of zero/non-zero values over forecast horizon
494 J. Boylan and A. Syntetos

4. Replace every non-zero state marker with a numerical value sampled at

random, with replacement, from the set of observed non-zero demands
5. ‘Jitter’ the non-zero demand values – this is effectively an ad hoc procedure
designed to allow greater variation than that already observed. The process
enables the sampling of demand size values that have not been observed in
the demand history
6. Sum the forecast values over the horizon to get one predicted value of lead
time demand (LTD)
7. Repeat steps 3 – 6 many times
8. Sort and use the resulting distribution of LTD values.

Willemain et al. (2004, p 381) argued that “… we need to assess the quality not
of a point forecast of the mean but of a forecast of the entire distribution”, but they
conceded that it is impossible to compare this on an item-specific basis. Instead,
the authors recommended pooling percentile estimators across items and measur-
ing the conformance of the observations (expressed using the corresponding
percentiles) to a uniform distribution. The researchers claimed significant improve-
ments in forecasting accuracy achieved by using their approach over single ex-
ponential smoothing and Croston’s method. (Issues related to assessing forecasting
performance are further considered in the next section.)
Gardner and Koehler (2005) criticized this study in terms of its methodological
arrangements and experimental structure, pointing out that:
• Willemain et al. did not use the correct lead time demand distribution for
either SES or Croston’s method. This was a twofold criticism consisting of
arguments against the use of Equation 20.6 for estimating the lead-time
demand variance (please refer to Section 20.3.4) and the use of the normal
distribution for representing demand
• They did not consider published modifications to Croston’s method such as
the estimator proposed by Syntetos and Boylan (2005)
Further empirical evidence is required in order to develop our understanding of
the benefits offered by such a non-parametric approach. In particular, a comparison
between the recently developed adaptations of Croston’s method – see Table 20.1
(in conjunction with an appropriate distribution) – with the bootstrapping approach
should prove to be beneficial from both theoretical and practitioner perspectives.

20.5 Performance Measurement

In assessing the performance of an inventory management system, there are two
essential measures: stock-holding cost and service level. Stock-holding costs are
relatively straightforward to interpret. They are generally calculated as a percent-
age of the value of inventory investment, where the percentage takes into account
such factors as the cost of capital, insurance, warehousing and obsolescence costs.
Often, the same percentage is applied to all service parts. However, one could
Forecasting and Inventory Management for Service Parts 495

argue that slow-moving service parts should attract a higher percentage cost, since
these parts are at the highest risk of obsolescence.
‘Service level’ is generally interpreted as ‘off the shelf availability’ but the way
in which it is measured varies. Three common measures are defined as follows
(Silver et al. 1998):
• The fraction of replenishment cycles in which total demand can be
delivered from stock (known as P1). This is equivalent to a specified prob-
ability of no stock-outs during a replenishment cycle.
• The fraction of total demand that can be delivered from stock (known as
P2). This is also called the ‘fill rate’.
• The fraction of time during which there is stock on the shelf (known as P3).
This is sometimes used when equipment is needed for emergency purposes.
The ‘fill rate’ is probably the measure with the greatest appeal to practitioners,
since it relates most directly to customer satisfaction. Care is needed with its appli-
cation, since different results are obtained if it is calculated over a lead-time or
over all time. If unsatisfied demand is back-ordered, Brown (1967) showed that

P2 = 1 − (1 − P2LT ) (20.10)

where P2LT is the measure over lead-time, P2 is the measure over all time, L is the
lead-time, S is total demand in a year, Q is the order-quantity and it is assumed that
LS < Q.
Ronen (1982) showed that, if unsatisfied demand is lost, then

P2 = (20.11)
(1 − P2LT ) + 1

These measures are based on the fraction of units satisfied from stock. Some
organizations also use measures that relate to the successful completion of an
‘order-line’ for a number of units of the same SKU. Typically, these are based on
the fraction of order-lines completely satisfied (partial satisfaction does not count).
Boylan and Johnston (1994) identified relationships between such measures and
In addition to these standard measures, other suggestions have been made.
Gardner (1990) recommended the use of trade-off curves, showing the effect of
inventory investment on the average delay in filling backorders. Separate curves
are drawn for each forecasting method, allowing the manager to see at a glance if
one method dominates the others.
Sani and Kingsman (1997) proposed the use of average regret measures. The
service regret is the amount each method falls short of the maximum service level
over all methods for that SKU. (The ‘method’ may be a forecasting or inventory
method.) The regret is then divided by the maximum service level and the ratios
496 J. Boylan and A. Syntetos

are averaged across all SKUs. A cost regret measure is defined similarly. This
approach allows more detailed assessments of the interaction between forecasting
and inventory methods. Eaves and Kingsman (2004) suggested assessment of fore-
cast performance according to implied stock-holdings. These are based on a calcu-
lation of the exact safety margin providing a maximum stock-out of zero. The
advantage of this approach is that it gives monetary values of stock-savings. How-
ever, these savings may not be achieved in practice using a standard stock control
method based on the mean and variance of lead-time forecasts.
Whilst it is essential to assess stock-holding costs and service levels, it is also
important to be able to diagnose the reasons for any deterioration in these measures.
Boylan and Syntetos (2006) argue that, since this may arise as a result of forecasting
methods or inventory rules (see Figure 20.5), the accuracy of forecasting methods
should also be monitored.

Forecasting Stock-holding
method costs

Inventory system Service

rules level

Figure 20.5. Inventory system performance measurement

Forecast error measures may be used to detect changes in forecast accuracy

over time or to determine the relative accuracy of two (or more) forecasting
methods. For faster moving service parts, a measure that serves both purposes and
is easy to interpret is the mean absolute percentage error (MAPE):

1 n
Yt − Yˆt
t =1 Yt

where n is the number of historical forecasts included in the error measure, Yt is

the observation at time t and Yˆt is the forecast of demand at time t.
Unfortunately, this error measure is not defined for zero observations. This
rules out its application for service parts with intermittent demand. An alternative
measure, suggested by Makridakis (1993) and used by Makridakis and Hibon
(2000), is the Symmetric MAPE (sMAPE):

1 n | Yt − Yˆt |
sMAPE = ∑
n t =1
| Yt + Yˆt | / 2
Forecasting and Inventory Management for Service Parts 497

However, as pointed out by Syntetos (2001) and by Boylan and Syntetos

(2006), the sMAPE will always be 200 for any period when the actual demand is
zero, regardless of the size of the error. Therefore, the sMAPE does not discrimi-
nate between forecast methods in this case. Since it does not allow a satisfactory
comparison of forecast methods, it is inappropriate for intermittent demand.
A simple measure that can be used to assess the bias of forecasting methods is
the mean error (ME):

ME =
∑ (Y − Yˆ )
t =1
t t (20.14)

This measure is simple to interpret: if it is close to zero, then the forecast method
is unbiased; negative values indicate that forecasts are consistently too high, while
positive values show that forecasts are too low. Its use is recommended for inter-
mittent service parts.
If a forecast method has high forecast errors, but is approximately unbiased, then
the positive and negative errors cancel one another out, yielding a mean error close
to zero. To capture the degree of error, regardless of sign, other error measures are
required. The mean absolute error (MAE) is often used for an individual SKU and is
defined as follows:

∑ Y − Yˆ
t =1
t t (20.15)

This error measure should not be averaged over a whole set of parts, since it
may be dominated by a few SKUs with large errors. To avoid this problem, four
alternatives have been suggested: the MAE: mean ratio, the geometric mean ab-
solute error, the percentage better measure and the mean absolute scaled error.
Each of these measures will be reviewed in turn.
Hoover (2006) proposed the application of the MAE: mean ratio for intermit-
tent demand:

∑ Y − Yˆ t t n
t =1
∑ Y − Yˆ
t =1
t t
MAE:Mean = n
= n
∑Yt =1
t ∑Yt =1

This measure is robust to outlying data and is easy to interpret. Hyndman

(2006) observed that the MAE: mean ratio assumes that the data is stable over time
and that, for seasonal intermittent data, the measure may become unreliable. This
problem may be overcome, for non-trended data, by calculating the measure over a
full set of seasonal cycles. If the data is trended, however, then Hyndman’s criti-
498 J. Boylan and A. Syntetos

cism stands. Therefore, the MAE: mean ratio can be recommended for non-trended
intermittent service parts.
A second alternative to the MAE is the geometric mean absolute error (GMAE)
defined below; for a single series:

n 1/ n
⎛ ⎞
GMAE = ⎜

∏i =1
Yt − Yˆt ⎟


This can be generalized across series by taking the geometric mean again to
obtain the geometric mean (across series) of the geometric mean (across time) of
the absolute errors (GMGMAE):

1/ n 1/ N
⎛ N ⎛ n ⎞ ⎞
GMGMAE = ⎜ ∏ ⎜ ∏ Yit − Yˆit ⎟ ⎟ (20.18)
⎜ i =1 ⎝ t =1 ⎠ ⎟⎠

where Yit is the observation for the i-th SKU at time t, Yˆit is the forecast of
demand for the i-th SKU at time t, and N is the number of SKUs.
An outlying observation, producing a large error by any statistical method, will
affect the GMAE similarly for all methods, and so the ratio of the GMAE for one
method to another will be robust to outliers. (The same robustness property applies
to the GMGMAE). This was first shown by Fildes (1992), using a general argu-
ment, and applied to intermittent data by Syntetos and Boylan (2005). In fact, these
authors used a slightly more complex measure, the geometric root mean square
error (GRMSE); however, Hyndman (2006) pointed out that the GRMSE and the
GMAE are identical.
Although the measure is robust to outliers, it is sensitive to zero errors (Boylan
and Syntetos 2006). Just one exact forecast will yield a zero error and a zero
GMAE, regardless of the size of the other errors. This problem may be overcome,
for stationary errors, by using the geometric mean (across series) of the arithmetic
mean (across time) of the absolute errors (GMAMAE):

1/ N
⎛ N
⎛1 ni
⎜ ∏ ⎜⎜ ∑ Yit − Yˆit ⎟⎟ ⎟

⎝ i =1 ⎝ ni t =1 ⎠⎠

This measure collapses to zero only if a series has zero forecast errors for all
periods of time, and so is more robust to zero errors than the GMGMAE. The
measure is also robust to occasional large forecast errors, provided the remaining
errors are stable, and are not unduly affected by trend or seasonality. It can there-
fore be recommended for application in these cases.
Another approach, which is simple to use and interpret, is the percentage better
method. According to this approach, for each service part, one forecast method is
compared to another according to a criterion such as mean error or geometric root
mean square error (Syntetos and Boylan 2005). The percentage better shows the
Forecasting and Inventory Management for Service Parts 499

percentage of series for which one method has the lower error. This approach is
robust to large forecast errors and the results can be subjected to formal statistical
tests (Syntetos 2001). It is a useful measure, although it does not quantify the
degree of improvement in forecast error.
Hyndman (2006) recently suggested a new error measure for intermittent de-
mand. This measure, known as the mean absolute scaled error (MASE), is defined
as follows:

MASE = mean( qt ) (20.20a)

Yt − Yˆt
qt = (20.20b)
1 n

n − 1 i =2
Yt − Yt −1

The errors are scaled based on the in-sample MAE from the naïve forecasting
method (i.e. the forecast for the next period is this period’s observation). The
measure is robust to outliers, and is valid for all non-constant series.
Hyndman (2006) gave an example of the application of the MASE on
intermittent data from a major Australian lubricant manufacturer. He compared the
out-of-sample MASE of four methods: naïve, overall mean, single exponential
smoothing and Croston’s method. The naïve method has the lowest MASE. This
result is valid statistically, but is counter-intuitive from an inventory-management
perspective. Boylan and Syntetos (2006) commented that the naïve method is
sensitive to large demands and will generate high forecasts. Its use will almost
certainly lead to over-stocking and possibly to obsolescence. This example high-
lights the danger of relying on statistical error measures alone. As noted earlier in
this section, attention should always be paid to the stock-holding and service
implications of different forecasting methods. Improvements in forecasting accu-
racy do not necessarily translate into improved stock-control performance. How-
ever, if stock-control performance has deteriorated, then forecast error measures
can be used to diagnose problems with forecasting methods, and to suggest alter-

20.6 Empirical Evidence

Empirical evidence on the performance of forecasting methods for service parts is
not extensive. The same is true regarding the empirical fit of various distributions
to the underlying demand patterns of such SKUs. In this section the relevant
studies are reviewed.

20.6.1 Statistical Distributions

Kwan (1991) conducted research to identify the theoretical distributions that best fit
the empirical distributions of demand sizes, inter-demand intervals and demand per
unit time period for low demand items. Regarding inter-demand intervals, both the
500 J. Boylan and A. Syntetos

geometric and the negative exponential distribution were found to provide a good fit
to the demand patterns observed. The geometric distribution was also found to be a
reasonable approximation to the distribution of inter-demand intervals, for real
demand data, by Dunsmuir and Snyder (1989) and Willemain et al. (1994). Janssen
(1998) tested the Bernoulli demand generation process on a set of empirical data
obtained from a Dutch wholesaler of fasteners. The results indicated that the
Bernoulli demand generation process is a reasonable approximation for intermittent
demand processes. Finally, Eaves (2002) examined the demand patterns associated
with 6795 service parts from the Royal Air Force (UK). The findings of this detailed
study provide support for both Poisson and Bernoulli processes. In particular, the
geometric distribution was found to provide a statistically significant fit (5% sig-
nificance level) to 91% of his sample whereas the negative exponential distibution
fitted 88% of the demand histories examined.
Kwan (1991) tested the empirical fit of the log-zero-Poisson (lzP) and negative
binomial (NBD), amongst other possible underlying demand distributions. The
NBD was found to be the best, fitting 90% of the SKUs. Boylan (1997) tested the
goodness-of-fit of four demand distributions (NBD, lzP, condensed negative bino-
mial distribution (CNBD) and gamma distribution) on real demand data. The CNBD
arises if we consider a condensed Poisson incidence distribution (‘censored’ Poisson
process in which only every second event is recorded) assuming that the mean rate
of demand incidence is not constant, but varies according to a gamma distribution.
The empirical sample used for testing goodness-of-fit contained the six months
histories of 230 SKUs, demand being recorded weekly. The analysis showed strong
support for the NBD. The results for the gamma distribution were also encouraging,
although not as good, for slow moving SKUs, as the NBD.

20.6.2 Demand Estimators

Willemain et al. (1994) compared SES and Croston’s method on both theoretically
generated and empirical intermittent demand data (54 series). They concluded that
Croston's method is robustly superior to exponential smoothing and can provide
tangible benefits to stockists dealing with intermittent demand. A very important
feature of their research, though, was the fact that industrial results showed very
modest benefits as compared with the simulation results.
Sani and Kingsman (1997) compared the performance (service level and in-
ventory costs) of various empirical and theoretically proposed stock control poli-
cies for low demand items as well as that of various forecasting methods (SMA,
SES, Croston) on 30 service parts. Their results indicated: i) the very good overall
performance of SMA; ii) the fact that stock control policies that have been
developed in conjunction with specific distributional assumptions – such as the
power approximation (explicitly built upon the assumption of a compound Poisson
underlying demand pattern – please also refer to Section 20.3) perform particularly
Willemain et al. (2004) assessed the forecast accuracy of SES, Croston’s
method (both in conjunction with a hypothesised normal distribution) and the non-
parametric approach that they proposed (please refer to Section 20.4) on 28,000
Forecasting and Inventory Management for Service Parts 501

service inventory items. They concluded that the bootstrap method was the most
accurate forecasting method and that Croston’s method had no significant ad-
vantage over SES. As discussed in Section 20.4, some reservations have been
expressed regarding the study’s methodology. Nevertheless, the bootstrapping
approach is intuitively appealing for very lumpy demand items. More empirical
studies are needed to substantiate its forecast accuracy in comparison with other
Syntetos and Boylan (2005) conducted an empirical investigation to compare
the forecast accuracy of SMA, SES, Croston’s method and a bias-corrected
adaptation of Croston’s estimator (termed the Syntetos-Boylan approximation,
SBA; please refer to Table 20.1). The forecast accuracy of these methods was
tested, using a wide range of forecast accuracy metrics, on 3000 service parts from
the automotive industry. The results demonstrated quite conclusively the superior
forecasting performance of the SBA method. In a later project, Syntetos and
Boylan (2006) assessed the empirical stock control implications of the same esti-
mators on the same 3000 SKUs. The results demonstrated that the increased
forecast accuracy achieved by using the SBA method (also known as the ‘Approxi-
mation Method’) is translated to a better stock control performance (service level
achieved and stock volume differences). A similar finding was reported in an
earlier research project conducted by Eaves and Kingsman (2004). They compared
the empirical stock control performance (implied stock holdings given a specified
service level) of the above discussed estimators on 18750 service parts from the
Royal Air Force (UK). They concluded that ‘the best forecasting method for a
spare parts inventory is deemed to be the approximation method’ (Eaves and
Kingsman 2004, p 436).

20.7 Conclusions
Service parts, particularly those subject to corrective maintenance, present a
considerable challenge for both forecasting and inventory management. If stocking
decisions are made injudiciously, then the result will be poor service or excessive
stock-holdings, possibly leading to obsolescence. Conversely, effective forecasting
and stock control will lead to cost savings and improved customer service
A number of stock-control methods may be employed for slow-moving service
parts. Sani and Kingsman (1997) recommended the (R, s, S) policy, based on its
inventory cost and service performance in an empirical study. However, empirical
evidence is not extensive, and further research is needed in this area.
Classification of service parts is an essential element in their management. Four
purposes are served by classification:
• Determination of service targets
• Establishment of inventory decisions
• Choice of forecasting approach
• Choice of forecasting method
502 J. Boylan and A. Syntetos

Determining service level requirements may be supported by a criticality classi-

fication, undertaken using management judgment or a more formal approach, such
as an assessment of the risk and severity of a part failure. Alternatively, a Pareto
classification may be used as a proxy for criticality, with the A items being deemed
the most important. A variation on this approach is to use a matrix of cost and sales
volume to determine service requirements.
Inventory decisions relate directly to a product life cycle classification. Initial
provisioning decisions must be taken during the initial phase of the life cycle.
Decisions should be taken regarding stocking locations, too. In the normal phase,
the inventory policy and the appropriate parameters must be determined. The in-
ventory rules depend on forecasts of demand over lead-time, so the most appro-
priate forecasting method should be chosen. In the final phase, an ‘all time buy’
requires a decision on the final order quantity.
The product life cycle can also be used to help determine the forecasting ap-
proach: causal or time-series. Causal methods are often used in the initial phase,
because of the lack of data on demand history. In the normal phase, causal methods
also have an important role, if data on explanatory variables are available. Some
models have been proposed, for example, that link the sales rate with the renewal
function associated with the part replacement in order to derive the demand for
spares (Blischke and Murthy 1994). If such data are not available, then time-series
methods are used, usually based on exponential smoothing. In the final phase,
regression-based extrapolations have been recommended, assuming an exponential
decline of demand.
A further aim of classification, in the normal phase, is to determine the most
appropriate forecasting method. By examining the sources of intermittence and
erraticness of demand, it may be possible to identify parts with few customers and
to forecast using advance information or to predict responses to promotional
activity. For SKUs where this is not possible, two approaches have been proposed:
bootstrapping and distribution-based. In the former case, no distributional assump-
tions are made and the lead-time distribution of demand is generated by re-sam-
pling from previous observations. In the latter case, a demand distribution must be
determined and its parameters estimated. Classification by the shape of the demand
distribution allows the system to determine whether the Poisson, the compound
Poisson or some other distribution should be used. Classification by demand
frequency and by demand size variability allows the system to choose between
smoothing methods such as single exponential smoothing (for non-intermittent,
non-lumpy data) and methods such as Croston’s (for intermittent or lumpy data).
The time-series method that should be employed for service parts depends on
the characteristics of the demand pattern. For non-intermittent demand, exponential
smoothing methods should be employed, with appropriate variants being chosen
for trended, damped trended and seasonal data. For intermittent demand series,
Croston’s method is the standard approach and has been adopted by a number of
forecast packages. The performance of Croston’s method can be improved by
applying an appropriate adjustment factor to reduce the bias of the forecast. This
has been shown by Eaves and Kingsman (2004) and by Syntetos and Boylan
(2006) to improve the inventory performance of the system.
Forecasting and Inventory Management for Service Parts 503

The performance of forecasting methods for non-intermittent demand can be

assessed directly using measures such as the mean absolute percentage error
(MAPE) and the mean absolute scaled error (MASE). However, the MAPE is not
defined for intermittent series. For an individual series, forecast accuracy may be
assessed using the mean error and the mean absolute error: mean demand ratio.
The latter measure has the shortcoming of producing unreliable results for trended
data, but trend is often barely perceptible in intermittent series. Alternatively, a
geometric mean (across series) of the arithmetic mean (across time) of the absolute
errors (GMAMAE) can be used. Statistical measures of forecast accuracy should
not be used alone, however, since optimization of forecast accuracy does not
necessarily lead to optimization of inventory performance. Inventory measures
assessing inventory costs and service level, particularly the fill rate, should also be
considered. Taken together, forecast accuracy and inventory measures provide the
manager with a comprehensive overview of the system’s performance.
Empirical evidence on the forecasting and inventory management of service
parts is not extensive, but has grown in recent years. There is good empirical sup-
port for both compound Bernoulli and compound Poisson demand distributions.
The negative binomial distribution has been found to be a good fit to many parts’
distributions, although some SKUs do not appear to be well represented by any
standard statistical distribution. Simple forecasting methods often work well, with
the simple moving average being a good benchmark method for service parts with
intermittent demand. Croston’s method and its bias-reduced variants, including the
Syntetos-Boylan approximation (SBA), should be considered. The SBA method
has been shown to perform well from both forecasting and inventory management
In summary, there has been substantial progress in research on forecasting for
inventory management of service parts over recent decades. Three challenges
remain: for researchers to resolve theoretical inconsistencies and develop more
powerful methods, for software manufacturers to reflect the state of the art in their
packages, and for both researchers and software developers to work with practi-
tioners to broaden the base of empirical evidence in this field.

20.8 References
Bartezzaghi E, Verganti R, Zotteri G, (1996) A framework for managing uncertain lumpy
demand. Paper presented at the 9th International Symposium on Inventories, Budapest,
Blischke WR, Murthy DNP, (1994) Warranty cost analysis. Marcel Dekker, Inc., New York
Boylan JE, (1997) The centralisation of inventory and the modelling of demand. Unpublished
Ph.D. Thesis, University of Warwick, UK
Boylan JE, Johnston FR, (1994) Relationships between service level measures for inventory
systems. Journal of the Operational Research Society 45: 838–844
Boylan JE, Syntetos AA, (2003) Intermittent demand forecasting: size-interval methods
based on average and smoothing. Proceedings of the International Conference on
Quantitative Methods in Industry and Commerce, Athens, Greece
Boylan JE, Syntetos AA, (2006) Accuracy and accuracy-implication metrics for intermittent
demand. Foresight: the International Journal of Applied Forecasting 4: 39–42.
504 J. Boylan and A. Syntetos

Boylan JE, Syntetos AA, Karakostas GC, (2006) Classification for forecasting and stock-
control: a case-study. Journal of the Operational Research Society: in press
Brown RG, (1963) Smoothing, forecasting and prediction of discrete time series. Prentice-
Hall, Inc., Englewood Cliffs, N.J.
Brown RG, (1967) Decision rules for inventory management. Holt, Reinhart and Winston,
Burgin TA, (1975) The gamma distribution and inventory control. Operational Research
Quarterly 26: 507–525
Burgin TA, Wild AR, (1967) Stock control experience and usable theory. Operational
Research Quarterly 18: 35–52
Croston JD, (1972) Forecasting and stock control for intermittent demands. Operational
Research Quarterly 23, 289–304
Croston JD (1974) Stock levels for slow-moving items. Operational Research Quarterly 25:
Department of Defense USA, (1980) Procedures for performing a Failure Mode, Effects and
Criticality Analysis. MIL-STD-1629A
Dunsmuir WTM, Snyder RD, (1989) Control of inventories with intermittent demand.
European Journal of Operational Research 40: 16–21
Eaves AHC, (2002) Forecasting for the ordering and stock holding of consumable spare
parts. Unpublished Ph.D. thesis, Lancaster University, UK
Eaves A, Kingsman BG, (2004) Forecasting for ordering and stock holding of spare parts.
Journal of the Operational Research Society 55: 431–437
Ehrhardt R, Mosier C, (1984) A revision of the power approximation for computing (s, S)
inventory policies. Management Science 30: 618–622
Fildes R, (1992) The evaluation of extrapolative forecasting methods. International Journal
of Forecasting 8: 81–98
Fortuin L, (1980) The all-time requirements of spare parts for service after sales –
theoretical analysis and practical results. International Journal of Operations and
Production Management 1: 59–69
Fortuin L, Martin H, (1999) Control of service parts. International Journal of Operations and
Production Management 19: 950–971
Friend JK, (1960) Stock control with random opportunities for replenishment. Operational
Research Quarterly 11: 130–136
Gallagher DJ, (1969) Two periodic review inventory models with backorders and stuttering
Poisson demands. AIIE Transactions 1: 164–171
Gardner ES, (1990) Evaluating forecast performance in an inventory control system.
Management Science 36: 490–499
Gardner ES, Koehler AB, (2005) Correspondence: Comments on a patented bootstrapping
method for forecasting intermittent demand. International Journal of Forecasting 21:
Ghobbar AA, Friend CH, (2002) Sources of intermittent demand for aircraft spare parts
within airline operations. Journal of Air Transport Management 8: 221–231
Ghobbar AA, Friend CH, (2003) Evaluation of forecasting methods for intermittent parts
demand in the field of aviation: a predictive model. Computers and Operations Research
30: 2097–2014.
Hoover J, (2006) Measuring forecast accuracy: omissions in today’s forecasting engines and
demand-planning software. Foresight: the International Journal of Applied Forecasting
4: 32–35
Hua ZS, Zhang B, Yang J, Tan DS, (2006) A new approach of forecasting intermittent
demand for spare parts inventories in the process industries. Journal of the Operational
Research Society: in press.
Forecasting and Inventory Management for Service Parts 505

Hyndman RJ, (2006) Another look at forecast-accuracy metrics for intermittent demand.
Foresight: the International Journal of Applied Forecasting 4: 43–46
Janssen FBSLP, (1998) Inventory management systems; control and information issues.
Published Ph.D. thesis, Centre for Economic Research, Tilburg University, The
Johnston FR, (1980) An interactive stock control system with a strategic management role.
Journal of the Operational Research Society 31: 1069–1084
Johnston FR, Boylan JE, (1996) Forecasting for items with intermittent demand. Journal of
the Operational Research Society 47: 113–121
Johnston FR, Harrison PJ, (1986) The variance of lead-time demand. Journal of the
Operational Research Society 37: 303–308
Kalchschmidt M, Verganti R, Zotteri G, (2006) Forecasting demand from heterogeneous
customers. International Journal of Operations and Production Management 26: 619–638
Kwan HW, (1991) On the demand distributions of slow moving items. Unpublished Ph.D.
thesis, Lancaster University, UK
Makridakis S, (1993) Accuracy measures: theoretical and practical concerns. International
Journal of Forecasting 9: 527–529
Makridakis S, Hibon M, (2000) The M3-Competition: results, conclusions and implications.
International Journal of Forecasting 16: 451–476
Naddor E, (1975) Optimal and heuristic decisions in single and multi-item inventory
systems. Management Science 21: 1234–1249
Quenouille MH, (1949) A relation between the logarithmic, Poisson and negative binomial
series. Biometrics 5: 162–164
Rao AV, (1973) A comment on: Forecasting and stock control for intermittent demands.
Operational Research Quarterly 24: 639–640
Ritchie E, Kingsman BG, (1985) Setting stock levels for wholesaling: performance
measures and conflict of objectives between supplier and stockist. European Journal of
Operational Research 20: 17–24
Ronen D, (1982) Measures of product availability. Journal of Business Logistics 3: 45–58
Sani B, Kingsman BG, (1997) Selecting the best periodic inventory control and demand
forecasting methods for low demand items. Journal of the Operational Research Society
48: 700–713
Shale EA, Boylan JE, Johnston FR, (2006) Forecasting for intermittent demand: the
estimation of an unbiased average. Journal of the Operational Research Society 57:
Shenstone L, Hyndman RJ, (2005) Stochastic models underlying Croston’s method for
intermittent demand forecasting. Journal of Forecasting 24: 389–402
Silver EA (1970) Some ideas related to the inventory control of items having erratic demand
patterns. CORS Journal 8: 87–100.
Silver EA, Pyke DF, Peterson R, (1998) Inventory management and production planning
and scheduling (3rd edition). John Wiley & Sons, New York
Snyder R, (2002) Forecasting sales of slow and fast moving inventories. European Journal
of Operational Research 140: 684–699
Strijbosch LWG, Heuts RMJ, van der Schoot EHM, (2000) A combined forecast-inventory
control procedure for spare parts. Journal of the Operational Research Society 51:
Syntetos AA, (2001) Forecasting of intermittent demand. Unpublished PhD Thesis,
Buckinghamshire Chilterns University College, Brunel University, UK
Syntetos AA, Boylan JE, (2001) On the bias of intermittent demand estimates. International
Journal of Production Economics 71: 457–466
Syntetos AA, Boylan JE, (2005) The accuracy of intermittent demand estimates.
International Journal of Forecasting 21: 303–314
506 J. Boylan and A. Syntetos

Syntetos AA, Boylan JE, Croston JD, (2005) On the categorization of demand patterns.
Journal of the Operational Research Society 56: 495–503
Syntetos AA, Boylan JE (2006) On the stock control performance of intermittent demand
estimators. International Journal of Production Economics 103: 36–47
Teunter RH, (1998) Inventory control of service parts in the final phase. Published PhD
Thesis, University of Groningen, The Netherlands
Teunter RH, Fortuin L, (1998) End-of-life-service: a case-study. European Journal of
Operational Research 107: 19–34
Vereecke A, Verstraeten P, (1994) An inventory management model for an inventory
consisting of lumpy items, slow movers and fast movers. International Journal of
Production Economics 35: 379–389
Ward JB, (1978) Determining re-order points when demand is lumpy. Management Science
24: 623–632
Watson RB, (1987) The effects of demand-forecast fluctuations on customer service and
inventory cost when demand is lumpy. Journal of the Operational Research Society 38:
Willemain TR, Smart CN, Shockor JH, DeSautels PA, (1994) Forecasting intermittent
demand in manufacturing: a comparative evaluation of Croston’s method. International
Journal of Forecasting 10: 529–538
Willemain TR, Smart CN, Schwarz HF, (2004) A new approach to forecasting intermittent
demand for service parts inventories. International Journal of Forecasting 20: 375–387
Williams TM, (1984) Stock control with sporadic and slow-moving demand. Journal of the
Operational Research Society 35: 939–948
Part F

Applications (Case Studies)


Maintenance in the Rail Industry

Jørn Vatn

21.1 Introduction
This chapter presents two case studies of maintenance optimization in the rail
industry. The first case study discusses grouping of maintenance activities into
maintenance packages. The second case study uses a life cycle cost approach to
prioritize between maintenance and renewal projects under budget constraints.
Grouping of maintenance activities into maintenance packages is an important
issue in maintenance planning and optimization. This grouping is important both
from an economic point of view in terms of minimization of set-up costs, and also
with respect to obtaining administratively manageable solutions. If several main-
tenance activities may be specified as one work-order in the computerized main-
tenance management system, we would have less work-orders to administer. The
maintenance intervals are usually determined by considering the various compo-
nents or activities separately, and then the activities are grouped into maintenance
packages. By executing several activities at the same time, the set-up costs may be
shared by several activities. However, this will require that we have to shift the
intervals for the individual activities. If we try to put too many activities into the
same group, the gain with respect to set-up costs may be dominated by the costs of
changing the intervals for the individual activities. The case study we present for
maintenance grouping is related to train maintenance, and especially we focus on
activities related to components in the bogie.
Another problem most industries are facing is the limited resources available
for maintenance and renewal, implying that optimization has to be conducted under
budget constraints. Then two main questions should be addressed, first of all
whether the budget constraints should be eliminated to some extent by putting
more resources into maintenance and renewal in case we have more good projects
than we have resources. The other question is how to prioritize, given the budget
constraints. In the case study we present an approach to cost-benefit analysis of the
various projects. This gives a ranked list of projects to consider for execution. The
510 J. Vatn

proposed method has been implemented by the Norwegian National Rail Admini-
stration (JBV), responsible for the Norwegian railway net.
Section 21.2 presents some general information about rail maintenance in
Norway as a basis for the two case studies. The first case study in Section 21.3
discusses grouping of maintenance activities into maintenance packages. The
second case study in Section 21.4 uses a life cycle cost approach to prioritize be-
tween maintenance and renewal projects under budget constraints.

21.2 Background Information About Rail Maintenance

During the past few decades there has been a dramatic change in the organization
of the European railways. The European Union has been an important driving
force, and legislation has been introduced to split the former state railways into one
national infrastructure manager, and one or more train operators (railway under-
takings). The idea has been to allow for many train operators to compete against
each other to offer train services on the European network. Further, the main-
tenance of both the rolling stock (trains) and the infrastructure has to a great deal
been outsourced.
In the following sections we present some Norwegian case studies, and give
some background information about organization of the maintenance in Norway.
The Norwegian State Railways (NSB) is the main Norwegian railway undertaking.
NSB has outsourced most of the maintenance to MANTENA, a maintenance con-
tractor. The preventive maintenance is based on activity based contracts where NSB
decide type and amount of maintenance, whereas corrective maintenance is com-
pensated for by a lump sum. The potential for the contractor to earn more money is
in effective grouping of the maintenance, and in improved work processes for or-
ganization and execution of maintenance. NSB has implemented reliability centred
maintenance (RCM) as basis for the preventive maintenance program. There is an
objective that maintenance should be executed in natural lulls in the timetable.
Major revisions of, e.g. the bogies, need longer depot stops.
JBV is the infrastructure manager of the Norwegian network. The level of
outsourcing of maintenance work is relatively low. Less than 10% of the operations
and corrective maintenance work is performed by external contractors, whereas for
preventive maintenance contract work represents 10–20%. For renewals the per-
centage is almost 70, and for investment (new lines) the percentage is more than 80.
JBV has also implemented RCM as a basis for the preventive maintenance program.
For larger maintenance projects and all renewal projects a prioritization regime
supported by life cycle cost considerations has been implemented. The Norwegian
network is split into three regions, where each region is responsible for prioritization
of the resources that are allocated by the central maintenance administration. For the
track and overhead line data from special measuring wagons is important input to
the models used to support prioritization between large maintenance and renewal
projects such as rail grinding, level tamping, ballast cleaning and rail repair and
Most European infrastructure managers have introduced more formalized opti-
mization models for maintenance and renewal planning. Some recent references
Maintenance in the Rail Industry 511

are Carreteroa et al. (2003), Zoeteman (2003), Veit and Wogowitsch (2003), Vatn
et al. (2003), Zarembski and Palese (2003), Pedregala et al. (2004), Meier-Hirmer1
et al. (2005), Budai et al. (2005) and Reddy et al. (2006). Railway research related
to maintenance is, however, dominated by wear modelling. Especially wheel-rail
wear models and track degradation models are important because the major main-
tenance and renewal costs of a railway line are due to track components. Some
important references are Bing and Gross (1983), Li and Selig (1995), Sato (1995),
Bogdaanski et al. (1996), Ferreria and Murray (1997), Zhang et al. (1997), Kay
(1998), Zakharov et al. (1998), Salim (2004), Telliskivi and Olofsson (2004),
Grassie (2005) and Braghin et al. (2006). A complete survey of reported models is
beyond the scope of this chapter.

21.3 Case Study 1

21.3.1 Grouping of Maintenance Activities

Rolling stock maintenance is characterized by the fact that the trains have to be
taken out of service while they are maintained in a maintenance depot. This causes
a lot of challenges related to scheduling of the train services taking the need for
maintenance into account. The scheduling problem is not considered here, and we
only present a rather simple model for grouping of some maintenance activities
assuming that we have access to the train whenever we want. Sriskandarajah et al.
(1998) present a methodology utilizing genetic algorithms on a much more com-
plex situation within train maintenance scheduling. In our example we only con-
sider the following cost elements:
• Man-hour costs and material costs related to preventive maintenance of
each component.
• Set-up costs to get access to the components to be maintained, and by
paying the set-up costs access to several components is obtained.
• Costs of taking the train out of service. These costs are included in the set-
up costs from a modelling point of view.
• Man-hour costs and material costs related to corrective maintenance. Typi-
cally set-up costs can not be shared by other components unless preventive
maintenance is advanced (opportunity maintenance).
• Costs related to the effect of a failure, i.e. punctuality, safety and material
damage costs.
In classical maintenance optimization the objective is to find the optimum
frequency of maintenance of one component at a time. However, in the multi-
component situation there exist dependencies between the components, e.g. they
may share common set-up costs (economy of scope), the costs may be reduced if
the contract to a maintenance contractor is huge (economy of scale), etc. This will
complicate the modelling from the single component approach, e.g. see Dekker et
al. (1997) for a survey of models used in the multi-component situation. In this
chapter we only consider the situation where we can save some set-up costs by
executing several maintenance activities at the same time.
512 J. Vatn

We often distinguish between the static and the dynamic planning regimes. In
the static regime the grouping is fixed during the entire system lifetime, whereas in
the dynamic regime the groups are re-established over and over again. The static
grouping situation may be easier to implement than the dynamic, and the main-
tenance effort is constant, or at least predictable. The advantage of the dynamic
grouping is that new information, unforeseen events, etc., may require a new
grouping and changing of plans. For an introduction to maintenance grouping we
refer to Wildeman (1996) who discusses these different regimes in detail. In the
example that follows we illustrate some aspects of dynamic grouping related to
maintenance activities on a train bogie.

21.3.2 Modelling Framework for the Grouping of Maintenance Activities

The trains are regularly taken out of service and sent to the maintenance depot for
execution of maintenance. Several subsystems are maintained at the same time,
and this makes the definition of set-up costs rather complicated when we develop
grouping strategies. In principle, some of the set-up costs are related to the fact that
the train is sent to the depot for maintenance, whereas some other parts of the set-
up costs are specific for one subsystem. In the following, we will simplify and only
consider costs related to the bogie, i.e. we assume one fixed set-up costs related to
the bogie. We also assume that the train is available at the maintenance depot at
any time. This is also a simplification, since each train follows a schedule, and can
only enter the maintenance depot at some of the end stations for the different
services. In order to get access to the various components in the bogie some dis-
assembling is required before maintenance can be executed, and also some re-
assembling is required after execution of maintenance. The costs of disassembling
and re-assembling are here included in the set-up cost. In the model presented we
also assume that the set-up costs are the same for all activities. It is further assumed
that there is one and only one maintenance activity related to each component. This
simplifies notation because we then may alternate between failure of component i
and executing maintenance activity i where there is a unique relation between
component and activity. The basic notation to be used is as follows.

ciP Planned maintenance cost, exclusive set-up cost. Typically the costs of
replacing one unit periodically.
cUi Unplanned costs upon a failure. These costs include the corrective
maintenance costs, safety costs, punctuality costs, and costs due to ma-
terial damage.
S Set-up costs, i.e. the costs of preparing the preventive maintenance of a
group of components maintained at the same time. We assume the same
set-up costs for all activities.
λE,i(x) Effective failure rate for component i when maintained at intervals of
length x.
Mi(x) Mi(x) = x × cUi × λE,i(x) = expected costs due to failures in a period [0,x)
for a component maintained at time 0, exclusive planned maintenance
Maintenance in the Rail Industry 513

Φi(x,k) Φi(x,k) = [ciP + S/k + Mi(x)]/x = average costs per unit time if x is the
length of the interval between planned maintenance, and the set-up
costs are shared by totally k activities.
Φ*i,k The minimum value of Φi(x,k), i.e., minimization over x.
x*i,k The x-value that minimizes Φi(x,k).
ki,Av Average number of components sharing the set-up costs for the i-th
component, i.e. the i-th component is in average maintained together
with ki,Av –1 other components.
Φ*i,Av Average minimum costs per unit time over all k-values.
x*i,Av Optimum value of xi over all k-values. x*i,Av is measured in million
kilometres since last maintenance on component i.
t0 Point of time when we are planning the next group of activities. Initially
t0 = 0. t0 is measured in running (million) kilometres since t = 0.
xi Age of component i at time t0, i.e., time since preventive maintenance
t*i,Av t*i,Av = t0 +x*i,Av – xi = optimum time in running (million) kilometres.
Kk Candidate group, i.e. the set of the first k components to be maintained
according to individual schedule with t*i,Av as the basis for due time.
N Number of activities/components.
T End of planning horizon, i.e. we are planning from t0 = 0 to T.

The optimization problem is basically a question of balancing planned costs

against unplanned costs. The planned costs are paid when the train is taken out of
service for preventive maintenance, whereas the unplanned costs arise upon failures,
i.e. corrective maintenance costs (repairs), costs related to accidents, delays, etc.
For each component there is an expected time dependent cost which is a
function of the time since the last preventive maintenance activity, i.e. Mi(x). In
order to establish Mi(x) we need: (i) to establish the accumulated expected number
of failures in the period [0,x), (ii) to specify the expected corrective maintenance
costs for the repair of each failure, and (iii) to specify the impact of the failure on
safety, punctuality, etc., and quantify these into cost figures. In the model presen-
ted here we assume that the effective failure rate, λE,i(x) may be established for the
different failure characteristic, and maintenance strategies (e.g. periodic replace-
ment and condition monitoring). Next the costs associated with a failure of compo-
nent i can in principle be found by risk modelling, punctuality modelling, etc. (see
Chapter 4). The result of such modelling is one figure for the expected costs, i.e. cUi.
Thus, Mi(x) = x ⫻ cUi ⫻ λE,i(x).
The planned costs comprise the costs of executing the maintenance on com-
ponent i (ciP ) and set-up costs (S) of getting access to the component. The set-up
costs may in general be shared with k–1 other activities.
The average contribution to the total costs for component i per unit time is
given by

Φi(x,k) = [ciP+ S/k + Mi(x)]/x (21.1)

If the grouping was fixed, i.e. static grouping, the optimization problem would
just be to minimize ΣiΦi(x,k) for all k components maintained at the same time.
514 J. Vatn

Static grouping will not be discussed, but we present an approach for dynamic
grouping. Mathematically, the challenge now is to establish the grouping either in a
finite or infinite time horizon. In addition to the grouping, we also have to schedule
the execution time for each group (maintenance package). The grouping and the
scheduling cannot be done separately. Generally, such optimization problems are
NP hard (see Garey and Johnson (1979), for a definition), and heuristics are re-
quired. Before we propose our heuristic we present some motivating results.
Let Φ*i,k be the minimum average costs when one component is considered
individually, and let x*i,k be the corresponding optimum x value. It is then easy to
prove that mi(x*i,k) = M’i(x*i,k) = Φ*i,k meaning that when the instantaneous expected
unplanned costs per unit time, mi(x), exceeds the average costs per unit time,
maintenance should be carried out. The way to use the result is now the following.
Assume we are going to determine the first point of time to execute the
maintenance, i.e. to find t = x*i,k starting at t = 0. Further, assume that we know the
average costs per unit time (Φ*i,k) but that we have for some reason “lost” or “for-
gotten” the value of x*i,k. What then we can do is to find t such that mi(t) = M’i(t) =
Φ*i,k yielding the first point of time for maintenance. Then from time t and the
remaining planning horizon we can pay Φ*i,k as the minimum average costs per unit
time. This is the traditional marginal costs approach to the problem, and brings the
same result as minimizing Equation 21.1. The advantage of the marginal thinking
is that we are now able to cope with the dynamic grouping. Assume that the time
now is t0, and xi is the age (time since last maintenance) for component i in the
group we are considering for the next execution of maintenance. Further, assume
that the planning horizon is [t0,T). The problem now is to determine the point of
time t (≥t0) when the next maintenance is to be executed. The total costs of
executing the maintenance activities in a group is S + ΣiciP which we pay at time t.
Further, the expected unplanned costs in the period [t0 , t) is ΣiMi(t-t0+xi) –ΣiMi(xi).
For the remaining time of the planning horizon the total costs are (T–t)ΣiΦ*i,k
provided that each component i can be maintained at “perfect match” with k–1
activities the rest of the period. Since Φ*i,k depends on how many components that
share the set-up cost, which we do not know at this time, we use some average
value Φ*i,Av. We assume that we know this average value at the first planning. To
determine the next point of time for maintaining a given group of components we
thus minimize:

c1 (t ; k ) = S + ∑ ⎡⎣c
i∈K k
+ M i (t − t0 + xi ) − M i ( xi ) + (T − t )Φ*i ,Av ⎤⎦ (21.2)

The costs in Equation 21.2 depend on which components to include in the

group of activities to be executed next. The more activities we include, the higher
the costs will be. For some activities it might thus be cheaper to include them in
groups to be executed later. For activities we do not include in this first group we
assume that they will be maintained at their “optimum” time t*i,Av > t. The total
contribution to the costs related to these activities in [t0,T) is
Maintenance in the Rail Industry 515

c2 (t ; k ) = ∑ ⎡⎣c
i∉K k
+ S / ki ,Av + M i ( xi*,Av ) − M i ( xi ) + (T − ti*,Av )Φ*i ,Av ⎤⎦ (21.3)

provided they can be maintained at “perfect match” with other activities, i.e. the
set-up costs are shared with ki,av – 1 activities, and executed at time t*i,Av. The total
optimization problem related to the next group of activities is therefore to

c(t ; k ) = S + ∑ ⎡⎣c
i∈K k
+ M i (t − t0 + xi ) − M i ( xi ) + (T − t )Φ*i ,Av ⎤⎦

+ ∑ ⎡⎣c
i∉K k
+ S / ki ,Av + M i ( xi*,Av ) − M i ( xi ) + (T − ti*,Av )Φ*i ,Av ⎤⎦ (21.4)

The idea is simple, we first determine the best group to execute next, and the
best time to execute it. Further we assume that subsequent activities can be
executed at their local optimum. It is expected to do better by taking the second
grouping into account when planning the first group, and not only treat the
activities individually. See, e.g. Budai et al. (2005) for more advanced heuristics in
similar situations to those presented here. The heuristic is as follows.
Step 0: Initialization. This means to find initial estimates of ki,Av, and use these k-
values as basis for minimization of Equation 21.1. This will give initial estimates
for x*i,Av and Φ*i,Av. Finally the time horizon for the scheduling is specified, i.e., we
set t0 = 0 and choose an appropriate end of the planning horizon (T).
Step 1: Prepare for defining the group of activities to execute next. First calculate
t*i = x*i,Av + t0 – xi and sort in increasing order.
Step 2: Establish the candidate groups, i.e. for k = 1 to N we use the ordered t*i s to
find a candidate group of size k to be executed next. If t*k > mini<k (t*i +x*i,Av) this
means that at least one activity in the candidate group needs to be executed twice
before the last one is scheduled which does not make sense. Hence, in this situation
the last candidate group is dropped and we are not searching for more candidate
groups at the time being.
Step 3: For each candidate group Kk, minimize c(t,k) in Equation 21.4 with respect
to execution time t. Next choose the candidate group Kk that gives the minimum
cost. This group should then be executed at the corresponding optimum time t.
Step 4: Prepare for the next group, i.e. we assume that all activities in the chosen
candidate group are executed at time t. This corresponds to setting xi = 0 for i ∈ Kk,
xi = xi +t–t0 for i ∉ Kk and then update the current time, i.e. t0 = t. If t0 < T GoTo
Step 1, else we are done.
There are several ways to improve the algorithm. One intuitive improvement is
to improve the estimates of ki,Av and corresponding x*i,Av and Φ*i,Av to be specified in
Step 0. This is easy, since in Step 4 we get a new value of k for those activities
included in the candidate group, and when the algorithm terminates we simply set
ki,Av as the average for each activity i in the period [0,T). We may then start over
again at Step 0 with these new values of ki,Av.
516 J. Vatn

Table 21.1. Snapshop of FMECA for bogie components

# Component Function Failure type Failure effect

1 Torsions bar and Anti roll device Crack Potential reduction
lever, motor bogie of antitilting
2 ZF-Ecomat 5HP600 Transmission Wear and tear Defect of gear
between motor
and axle gear
3 Flexible coupling Coupling between Wear and tear Worn out bearing
bearing (CENTA) diesel engine/gear –> vibrations
4 Deep groove ball Power transfer Wear and tear Worn out bearing
5 Aeration valve Pressure balance Locked Problems with fuel
oil filling
6 Torque reaction arm Torque reaction link Wear and tear Fissure and
demaged rubber of
silent blocks
7 Diesel engine Actuation of half Wear and tear Functional failure
Cummins N14-R train set or lower com-
pression of engine
8 Engine attachment Engine seat Wear and tear Worn out bearing
(bearing NS3.59)
9 Plant frame bearing Damping of Wear and tear Worn out bearing
(NS3.61) vibrations
10 Primary damper Absorbing the Functional failure Reduced dynamic
vibration between characteristics
axle box and bogie
11 Horizontal damper, Absorbing the Functional failure Reduced dynamic
motor bogie vibration between characteristics
bogie and car body
12 Horizontal damper, Absorbing the Functional failure Reduced dynamic
motor bogie vibration between characteristics
bogie and car body
13 Vertical damper, Absorbing the Functional failure Reduced dynamic
motor bogie vibration between characteristics
bogie and car body
14 Vertical damper, Absorbing the Functional failure Reduced dynamic
motor bogie vibration between characteristics
bogie and car body
15 Longitudinal car Absorbing Functional failure Reduced dynamic
body damper vibrations between characteristics
car bodies
Maintenance in the Rail Industry 517

Table 21.1. (continued)

# Component Function Failure type Failure effect

16 Break beam Fixing pin for break Wear and tear Increased gap
support bush beam between pin
and bush
17 Bush for brake Reduction of wear Wear and tear Increased gap
pad link between bolts brake between pin
support and bush
18 Bush for brake unit Reduction of wear Wear and tear Increased gap
between bolts and between pin
brake unit support and bush
19 Cylindrical roller Bearing rotor Wear and tear Rotor of generator
bearing actuation of generator blocks
20 Cardan shaft Power transmission Wear and tear Fracture joint
from gear box to bearing

The procedure is demonstrated by analyzing components in a train bogie. A

snapshot of the corresponding FMECA is presented in Table 21.1.
Table 21.2 gives cost figures for the bogie components. All failure times are
assumed to be Weibull distributed, where we specify the mean time to failure
(MTTF, given in million kilometres), and the aging (shape) parameter α. The para-
meter values have been established in cooperation with NSB experts. However,
some of the parameters have been modified by intention to meet competitive con-
siderations. The example is thus realistic, but no single figure should be regarded as
approved by NSB. The format and quality of the available data within the main-
tenance organization of NSB is currently not compatible with requirements for
estimating aging parameters or fitting parametric distributions. The shape para-
meters have therefore been established on a very qualitative understanding of failure
mechanisms, and the Weibull distribution has been chosen due to convenience
considerations. Set-up costs are assumed to be 3000 Euros for all activities. We
assume a standard age replacement model, but it is easy to adopt to more complex
situations where we, for example, combine inspection and replacement upon
condition rather than age (see, e.g. Podofollini et al. 2006 for an example model).
In step 0 of the algorithm we first assess ki,Av = 13 for all activities, meaning
that we initially believe that in average more than half of the activities are included
in each execution of a maintenance group. For all activities we have set ki,Av = 13,
and we use Equation 21.1 to find x*i,Av for each activity. The result is shown in
Table 21.2. The values of Φ*i,Av are not presented here. The time horizon is set to T
= 15 million kilometres.
In Step 1 we calculate the optimum of each individual activity, t*i = x*i,Av + t0 – xi.
In the example we have assumed that initially all xi’s are zero (a new train), and
since t0 also is zero initially, we simply have t*i = x*i,Av. These values are sorted in
Table 21.3 (values given in million kilometres).
518 J. Vatn

In Step 2 we establish candidate groups. For k =12 we note that t*12 > t*1 + x*1,Av
which means that we only process candidate groups with k < 12.
In Step 3 we calculate c(t,k), and the minimum values are shown in Table 21.3.
The minimum is found for k = 10. Further c(t,10) has its minimum for t* = 0.829
million kilometres. We observe that for those activities included in the first group,
the t*i -values are rather close to 0.829 million kilometres.
In Step 4 we now proceed, and set xi to 0 for those activities which are executed
(i.e. i ≤ 10), whereas xi = xi + 0.829 million kilometres for i > 10. Finally we set t0
= 0.829 million kilometre before we go to Step 1 again. The next group of activi-
ties is similarly found to be executed at t* = 1.606 million kilometres. This next
group comprises some activities not included in the first group, but also some
activities that was executed in the first group and are now executed for the second
time. We proceed until t0 > 15.
When the procedure terminates, we have a total cost of 1.2 million Euros. We
have also recorded the average values of ki,Av which in this example ranges from
13.5 to 17 which is slightly higher than the initial assessment of ki,Av = 13. By re-
peating the entire procedure with the new values for ki,Av a small reduction in costs
of 1% is obtained.

Table 21.2. Cost figures and reliability parameters

# CP (€) CU(€) MTTF (106 km) Aging, α x*i,Av (106 km)

1 960 6,740 2.56 3.5 1.38
2 9,600 22,400 3.33 3 2.48
3 680 6,230 1.33 3.5 0.67
4 632 5,960 2.22 3.5 1.12
5 720 6,320 10.00 2 4.76
6 400 5,720 2.11 3.5 0.98
7 37,000 72,500 2.00 3.5 7.90
8 520 5,960 4.17 3.5 2.01
9 780 6,440 12.50 3.5 6.46
10 664 6,236 1.60 3.5 0.80
11 424 5,786 1.61 3.5 0.75
12 384 5,711 1.61 3.5 0.74
13 384 5,711 1.78 3.5 0.82
14 184 5,336 1.78 3.5 0.74
15 600 6,116 1.78 3.5 0.88
16 1,440 7,580 2.67 3.5 1.53
17 4,060 12,590 2.67 3.5 1.77
18 1,160 7,130 2.67 3.5 1.48
19 6,080 16,220 1.61 2.5 1.22
20 6,400 16,700 1.33 3.5 0.93
Maintenance in the Rail Industry 519

21.3.2 Opportunity Based Maintenance

The dynamic scheduling regime presented above is a good basis for opportunity
based maintenance. The scheduling we have proposed may be used to set up an
explicit maintenance plan for the time horizon [0, T). But even though the plan
exists, we may consider changing it as new information becomes available, either
in terms of new reliability parameter estimates, or if unforeseen failures occur. In
operation, for any time t0 we may update the scheduling of preventive mainten-

Table 21.3. Results for the first maintenance group

# Activity t*i (106 km) k c(t*,k) (106 €) t* (106 km

3 PM 0.674 1 1.2009 0.659
6 PM 0.740 2 1.2007 0.682
10 PM 0.742 3 1.2005 0.690
11 PM 0.751 4 1.2002 0.700
12 PM 0.805 5 1.2000 0.718
13 PM 0.819 6 1.1998 0.728
14 PM 0.879 7 1.1996 0.743
15 PM 0.932 8 1.1995 0.814
20 PM 0.979 9 1.1993 0.820
1 PM 1.120 10 1.1991 0.829
9 Wait 1.221 11 1.1993 0.872
18 Wait 1.375 12 . .
2 Wait 1.475 13 . .
16 Wait 1.534 14 . .
5 Wait 1.769 15 . .
17 Wait 2.013 16 . .
7 Wait 2.483 17 . .
8 Wait 4.760 18 .
19 Wait 6.461 19 .
4 Wait 7.904 20 .

Upon a failure requiring the set-up costs to be paid, it is rather obvious that
activities that already were due if they were treated individually according to
Equation 21.1 should be executed upon this opportunity. Further, activities not
scheduled in the next group (maintenance package) should not be executed since
they were not even included in a group to be executed later than the time of this
520 J. Vatn

opportunity. The basic question is thus which of the remaining activities in the next
due group that should be executed at this opportunity. Let Kk be the set of k
activities in this group. Assume that we have found that it is favourable to execute
the first i–1 < k activities on this opportunity. The procedure to test whether or not
activity i also should be executed is as follows:
• First perform a scheduling by starting at Step 1 in Section 21.3.2. First we
assume that all activities up to i are executed on this opportunity, i.e. xj = 0,
j ≤ i, and xj is set to the time since activity j were executed for j > i.
• Let C1 be the minimum value of c(t,k) obtained in Step 3 plus the marginal
cost, ciP of executing activity i.
• Next, we assume that only activities up to i–1 is executed, i.e. xj = 0, j ≤ i–1,
and xj is set to the time since activity j was executed for j ≥ i.
• Let C2 be the minimum value of c(t,k) obtained in Step 3 this second time.
• If C1 > C2 is it not beneficial to do activity i.
If it was beneficial to do activity i at t0 we should test for i = i+1 as long as i ≤ k.
The procedure is demonstrated by the following example.
We assume that a failure occurs at time t = 0.8 million km. From Table 21.3 we
observe that the first 10 activities were scheduled for execution at time 0.829
million km. Since the schedule costs is already paid by the corrective activity, it is
obvious that the first four activities, i.e. those with individual optimum less than
t = 0.8 million km, should be done. Then we test whether activity 5 (t*5 = 0.805)
should be done at this opportunity. We calculate C1 = 1.188267 million Euros and
C2 = 1.188274 million Euros, hence activity 5 should be done. Then we proceed
similarly, and find that also activity 6 should be executed. For activity 7 (t*7 = 0.879)
we find that it is not cost effective to executed this activity. Since the first six
activities have been executed upon this opportunity, the next planned maintenance
can be postponed from the original t = 0.829 million km to t = 0.985 million km.

21.4 Case Study 2

21.4.1 Prioritization of Major Maintenance and Renewal Projects

The infrastructure manager usually has a limited budget for maintenance and re-
newal of the railway network. This calls for a structured approach to prioritization
of possible projects. In this section we discuss a portfolio approach to greater pro-
jects, in contrast to the situation in Section 21.3 where the scheduling of periodical
activities were discussed. Examples of such greater projects are:
• Ballast cleaning when the ballast is polluted and stones are crushed
• Rail grinding when the rail surface is rough
• Tamping and leveling when track geometry is degraded
• Sandblasting of bridges exposed to corrosion
• Renewal of overgrown ditches
• Point replacement of rails, e.g. in curvatures with high wear factor
Maintenance in the Rail Industry 521

The challenge is to schedule the candidate projects proposed by the local

railway departments. Scheduling here means to decide which projects to include in
the renewal plan for the next 10 years, and the order of executing the proposed
projects. JBV requires that all candidate projects are subject to a cost-benefit
analysis (CBA). For such projects we need to consider a time span of several
decades; hence it is natural to calculate the net present value (NPV) as a basis for
CBA. The CBA figures will only be used as input to the decision process, since it
might be other considerations than the pure CBA figures that are taken into
account when projects are selected.


ρC/B Cost-benefit ratio, i.e. the net present value of the benefits divided by
the net present value of the costs of the project
{RC(t)} Portfolio costs of renewals without the project
{RC*(t)} Portfolio costs of renewals with the project
{T*} Set of renewal times with the project
{T} Set of renewal times without the project
c(t) Time dependent cost as at point of time t (from now)
c*(t) Time dependent cost when a maintenance or renewal project is
d Factor to describe increase in time dependent cost due to degradation,
i.e. the increase from one year to another is d ⫻100%
LCC Life cycle cost
N Calculation period for net present value calculations
r Discount rate
RIF Risk influencing factor, i.e. a factor that influences the risk level
RLT Residual lifetime without the project
RLT* Residual lifetime with the project

21.4.2 Model Formulation Related to Case Study 2

The basic situation is that the railway infrastructure is deteriorating as a function of

time and operational load. This deterioration may be transformed into cost func-
tions, and when the costs become very large it may be beneficial to maintain or
renew the infrastructure. In the following we introduce the notation c(t) for the
time dependent costs as a function of time. In c(t) we include costs related to (i)
punctuality loss, (ii) accidents, and (iii) extra maintenance and operation due to
reduced track quality. By executing a maintenance or renewal project we typically
reset the time dependent cost function c(t), either to zero, or at least a level sig-
nificantly below the current value. Thus, the operating costs will be reduced in the
future if we execute the maintenance or renewal project.
Figure 21.1 shows the savings in operational costs, c(t)–c*(t), if we perform
maintenance or renewal at time T. In addition to the savings in operational costs,
we will also often achieve savings due to an increased “residual lifetime”.
522 J. Vatn

Renewal costs

Costs Savings
c(t) c*(t)

T Time

Figure 21.1. Costs savings

Special attention will be paid to projects that aim at extending the lifelength of
a railway system. A typical example is rail grinding for lifelength extension of the
rail, but also the fastenings, sleepers and the ballast will take advantages of the rail
grinding. Figure 21.2 shows how a smart activity ( ) may suppress the increase
in c(t) and thereby extend the point of time before the costs explode and a renewal
is necessary.
From a modelling point of view the situation is rather complex because
different projects are interconnected. For example, by executing a ballast cleaning
project the track quality is increased, reducing the need for tamping and leveling.
On the other hand, by tamping and point-wise supplement of ballast in pumping
areas (surface water) we may postpone the much more expensive ballast cleaning.
A third factor to take into account is the fact that for each tamping cycle there is
some stone crushing, and hence we should also be reluctant to do too much
tamping. Despite the fact that railways have existed for over 160 years there is a
lack of documented mathematical models describing the interaction between
different components in the railway, and the effect of the various maintenance
activities. When developing a tool for prioritization it has therefore been necessary
to base the model on model parameters specified by the maintenance planners and
their experts. In the future, it is planned to improve the models based on the
findings from a joint research project between Norway and Austria.
In the following we describe the basic input for performing the cost benefit
analysis. The numerical calculations are supported by a computerized tool (PriFo). Qualitative Information

The situation leading up to each proposed project is described. This is typically
information from measurements and analysis of track quality, trends, etc. It is
important to describe the situation qualitatively before any quantitative parameters
are assessed. It is, however, a great challenge to transform the qualitative problem
description to quantitative numbers. In the future this can be supported by the
expected results from various research projects on deterioration models. Safety Related Information

A general risk model has been derived where important risk influencing factors
(RIFs) have been identified. The RIFs relate both to the accident frequency such as
Maintenance in the Rail Industry 523

number of cracks in the rails, but also to the accident consequences such as speed,
terrain description, etc.
Table 21.4 shows an example related to the derailment frequency. In the
modelling, f0 corresponds to the “average” derailment frequency related to rail
problems. The value of f0 is found by analysing statistics over derailments in
Norway, where we find f0 = 3 × 10–4 per kilometre per year.

Variable cost


= smart maintenance activity, e.g., rail grinding

Figure 21.2. Lifelength extension

The variation width (w) in Table 21.4 shows the maximum negative or positive
effect of each RIF. In this model the values of the various RIFs are standardised,
which means that –1 represents the “worst value” of the RIF, 0 represent the “base
case”, and +1 represents the “best value” of the RIF. The interpretation of w is as
follows: If one RIF equals –1, then the derailment frequency is w times higher than
for the base case, and if the RIF equals 1 then the derailment frequency is w times
lower than the base case. Assuming that the various RIFs act independently of each
other an influence model for the derailment frequency may be written

f = f 0 Π i wi− RIF i

where wi is the variation width of RIF number i, and RIFi is the value of RIF
number i. By using Equation 21.5 with the generic weights from Table 21.4, we
may easily assess the derailment frequency only by assessing the values of the
RIFs for a given railway line or section.
In addition to the current value of the risk, the future increase also has to be
described corresponding to the two cost curves c(t) and c*(t) in Figure 21.1. For
example, we might use an exponential growth of the form c(t) = f (1+d)t–1, where d
is the degradation from one year to the next. The rational behind an exponential
growth is that the forces driving the track deterioration often is assumed pro-
portional to the deviation from an ideal track. A simple differential equation argu-
ment would then show an exponential growth.
524 J. Vatn Punctuality Information

The basic punctuality information to be specified is the ordinary speed for the line,
and any speed reductions due to the degradation the project is intended to fight
against. Based on the amount of speed restrictions it is rather easy to calculate the
corresponding train delay minutes. Very often such delays cause cascading effects
in a tight network.

Table 21.4. Example of effect of risk influencing factors

Risk influencing factor, RIF Variation width, w

Number of failures/cracks 4
Rail quality (age, type, rail profile) 2
Gradient 2
Quality of sleepers, ballast and fastening 2
Number of fixed points with narrow filling 1.5
Horizontal geometry 1.5

Such effects cannot be assessed unless we have a good understanding of the

network capacity, and the possibilities for change of crossings, etc. In Norway,
where most lines are single track lines, change of crossing may cause large distur-
bances in the network. It may also be possible to catch up with a delay if there is
slack in the schedule. Maintenance and Operating Information

The degradation of the permanent way will very often require extra maintenance
and operating costs. Examples of such costs are extra runs of the measurement car,
extra line inspections, use of alternative transportation such as busses, shorter
lifetime of influenced components, etc. These costs need to be quantified in the
model. Describing the change in maintenance and operating costs are very
challenging because short term and long term activities interact. It is possible to
perform explicit modelling of such interactions if we have a good understanding of
the physical deterioration. Welte et al. (2006) has, e.g. used a Markov state model
to model degradation, and the effect of different inspection and renewal strategies. Residual Lifelength

To be able to calculate the economic gain due to increased lifelengths it is required
to describe the residual lifelength both if the proposed project is executed, e.g.
RLL*, and if the project is not executed, RLL. Project Costs

The project costs are specified for each year in the project period.
Maintenance in the Rail Industry 525 Cost Parameters

A set of general cost parameters are common for all projects. For JBV these are:
• The discount rate is r = 4%. Note that we here introduce the discount factor
as the difference between the interest rate and the inflation rate.
• Monetary values for safety consequence classes as given in Table 21.5.
• Costs per kiloton freight delayed 1 min = 160 Euros.

Table 21.5. Monetary values in Euros for each safety consequence class

Safety consequence Monetary value (€)

C1 Minor injury 2 000
C2 Medical treatment 33 000
C3 Serious injury 330 000
C4 1 fatality 1.7 millions
C5 2-10 fatalities 11 millions
C6 > 10 fatalities 175 millions

• Costs per passenger delayed 1 min = 0.4 Euros. A train with 250 passengers
then gives 100 Euros per minute delayed.

21.4.3 LCC Calculations

A life cycle cost (LCC) perspective will be taken with respect to calculating the
cost benefit ratio for the different projects. This includes a net present value analy-
sis, taking the following aspects into consideration:
• Change in variable costs, c(t)
• The effect of extending the lifelength
• The project costs Change in Variable Costs

The variable cost contribution from the dimension safety, punctuality, and main-
tenance and operation can be treated similarly from a methodical point of view. Let
c(t) denote the variable costs in year t (from now) if the project is not executed, and
similarly c*(t) is the cost if the project is run. See Figure 21.1 for an illustration.
For example, for the safety dimension we have

∆LCCS = ∑[c(t ) − c *(t )](1 + r )
t =1

where r is the discount rate, and N is the calculation period. N is here the residual
lifelength (RLL) if nothing is done. This means that we compare the situation with
and without the project in the period from now till we have to do something in any
526 J. Vatn

case. Similarly we obtain the change in punctuality costs, ∆LCCP and the change in
maintenance and operational costs, ∆LCCM&O.
To calculate Equation 21.6 we may in some special situations find closed
formulas. For example, if c(t) is constant, i.e. c(t) = c, the formula for the sum of a
geometric series yields

⎡1 − (1 + r ) − N ⎤

t =1
c (1 + r ) −t
= c ⎢
⎣ r


Further, if c(t) the first year is c1 and c(t) increases by a factor (1+d) each year we

N ⎡ ⎛ 1+ d ⎞ N ⎤
∑ c1 (1 + d )t −1 (1 + r ) − t = c1 ⎢ 1 − ⎜⎝ 1+ r ⎟⎠ ⎥ (21.8)
t =1 ⎣⎢ r − d ⎦⎥ The Effect of Extending the Lifelength

To motivate for the calculation we show a sketch of the need for renewal both if
and if not the proposed project is executed in Figure 21.3.
We now let:
{RC(t)} = Portfolio costs of renewals without the project
{RC*(t)} = Portfolio costs of renewals with the project
{T} = Set of renewal times without the project
{T*} = Set of renewal times with the project.
The cost contribution related to increased residual lifetime may now be found from

∆LCCRLT = ∑ RC(t ) ⋅ (1 + r )
t∈{T }
− ∑ RC *(t ) ⋅ (1 + r )
t∈{T *}
(21.9) The Project Costs

The LCC contribution from the project cost, LCCI, is the net present value of the
project costs in the project period. The project costs may be spread over some
years, and hence we have to calculate the NPV of the project cost profile. Total LCC Contribution

The total gain in terms of life cycle costs are

∆LCC = LCCI + ∆LCCS + ∆LCCP + ∆LCCM&O + ∆LCCI (21.10)

The cost benefit ratio, or more precisely the benefit cost ratio is given by


ρ C/B = LCC I (21.11)
Maintenance in the Rail Industry 527

Figure 21.3. Renewals if and if not the project is executed

21.4.4 Illustrative Example

As a calculation example we consider a rail-grinding project. Grooves and wave

formations imply strong impact on the track and rolling stock due to increased
dynamic loads and vibrations. This again gives shorter life length of the rails, the
sleepers, fastenings and ballast. Increased noise, energy consumption, and lower
comfort can also be expected.
A 160-km section on the Rauma line in Norway has rails of age 40–50 years
and rail grinding is recommended primarily to extend the life length of the rails. Safety Costs

The derailment frequency due to rail breakages is estimated to 0.01 per year. For
the most severe consequences we have the following distribution: P(C4) = 13.5%,
P(C5) = 11% and P(C6) = 5% where the consequence classes are explained in Table
The material damage costs given a derailment is estimated to 1,300,000 Euros.
Thus the yearly “safety costs” is found to be 0.01 ⫻ (0.135 ⫻ 1.7 + 0.11 ⫻ 11 + 0.05
⫻ 175 + 1.3) million Euros, which equals 110,000 Euros. It is further expected that
the rate of rail breakages leading to derailments will increase by a factor d = 7% if
no grinding is performed. If the grinding project is executed the derailment prob-
ability the first year is assumed to be reduced by a factor of 50%, and the
deterioration factor is also assumed to be reduced to d = 3% each year, and by
utilizing Equation 21.8 we have the following contribution to the safety part of the
LCC (the calculation period is set to N = 5 years, which is the expected residual
life of the rails if no grinding is performed):

⎡ ⎛ 1+ 0.07 ⎞5 ⎤ ⎡ ⎛ 1+ 0.03 ⎞5 ⎤
∆LCCS = 110 000 ⎢ 1 − ⎜⎝ 1+ 0.04 ⎟⎠ ⎥ − 55 000 ⎢ 1 − ⎜⎝ 1+ 0.04 ⎟⎠ ⎥ ≈ 300 000 €
⎢⎣ 0.04−0.07 ⎥⎦ ⎢⎣ 0.04−0.03 ⎥⎦
528 J. Vatn Punctuality Costs

Due to a high number of cracks it is recommended to reduce the speed from 80 to
70 km/h for a section of 20 km. The speed reduction corresponds to 2 minutes
increase in travelling time. Slightly more than thousand passengers travels this line
per week, thus the yearly delay time costs is in the order of 50,000 Euros. In
addition, there is also freight delay time costs in the order of 60,000 Euros the first
year. An increase in the speed restriction of d = 10% is expected if the grinding
project is not executed. If the grinding project is executed, we may relax on speed
restriction yielding a punctuality loss the first year of 40,000 Euros, and then a
yearly increase of d = 3%. Again, utilizing Equation 21.8 we have

⎡ ⎛ 1+ 0.10 ⎞5 ⎤ ⎡ ⎛ 1+ 0.03 ⎞5 ⎤
∆LCCP = 110,000 ⎢ 1 − ⎜⎝ 1+ 0.04 ⎟⎠ ⎥ − 40,000 ⎢ 1 − ⎜⎝ 1+ 0.04 ⎟⎠ ⎥ ≈ 400,000 €
⎣⎢ 0.04−0.10 ⎦⎥ ⎣⎢ 0.04 −0.03 ⎦⎥ Maintenance and Operation Costs

From different studies it is found that rail grinding every 40 megaton reduce the wear
of other components (sleepers, ballast and fastenings) corresponding to 3 Euros per
metre per year. This corresponds to a yearly (fixed) cost of 400,000 Euros for the
actual 160 km section. Using Equation 21.7 with N = 5 this corresponds to an NPV
value of 2.1 million Euros. Reduction of critical cracks that have to be fixed is
estimated to 10 per year, and with a cost of 2500 Euros per crack to be fixed this
gives an NPV value of 110,000 Euros. Finally, extra yearly ultrasonic inspection
accounts for 12,000 Euros per year corresponding to an NPV value of 50,000
Euros. The total extra maintenance and operation costs are therefore found to be
almost 2.3 million Euros. Extended Lifelength

By the rail grinding project it is assumed that the rails may be kept going for
another 15 years, whereas a rail renewal is expected after 5 years if the project is
not run. The lifelength of new rails is approximately 40 years. The costs of new
rails is in the order 250 Euros per meter. The LCC contribution is thus the differ-
ence in changing the rails in 5 years, 45 years, 85 years, etc. vs. changing the rails
in 15 years, 55 years, 95 years, etc. A discount rate r = 4% calls for only counting
the two first renewals, hence:

LCCRLT = 250 × 160 000 [1.04–5+1.04–45–1.04–15–1.04–55] ≈ 12.9 million € Project costs

The costs of rail grinding is in the order of 8 Euros per meter, giving a total cost of
1.3 million Euros. In addition we have to expect a second grinding within 5–10
years, giving an additional contribution. The net present value of the grinding
activity is then 2.2 million Euros. Cost Benefit Ratio

Summing up we find the following contribution to the change in LCC (million
Maintenance in the Rail Industry 529

∆LCCS = 0.3
∆LCCP = 0.4
∆LCCM&O = 2.3
∆LCCRLT = 12.9
LCCI = 2.2

This yields a cost benefit ratio of ρC/B = 7.2, meaning that for each Euro put
into rail grinding, the payback is 7 Euros.
By calculating the cost benefit ratio for the various maintenance and renewal
projects, we get a sorted list of the most promising projects. In principle, we should
execute those projects having a cost benefit ratio, ρC/B, higher than one. If the
budget constraints imply that we can not execute all projects with ρC/B higher than
one, it would be necessary to have a thorough discussion related to the budget for
maintenance and renewal. Since most organizations suffer from the short term
costs cutting syndrome, it is a hard struggle to argue for spending more money now
in order to save money in a five to ten years perspective.
Even if we cannot do much about the budget situation, we may use the results
from the cost-benefit analysis to prioritize between the various projects.

21.5 Conclusions
The two case studies presented elaborate on some of the challenges in Norwegian
rail maintenance. Both the railway undertaking (NSB) and the infrastructure
manager (JBV) aim at implementing more proactive strategies for maintenance and
renewal based on more formal methods such as RCM and NPV/CBA. These
methods require reliability parameters of a much higher level of detail than the
current experience databases can offer today. Therefore both NSB and JBV have
started the process of restructuring databases, and emphasize the importance of
proper failure reporting. Due to the lack of experience data it has up to now been
necessary to utilize expert judgment to a great extent. It is further important to
emphasize that optimization models like the ones presented here should be con-
sidered as decision support, rather than decision rules. In order to improve on these
areas we believe that more systematic collection and analysis of reliability data is
an important factor, and here the rail industry may learn from the offshore industry
where joint data collection exercises have been run for 25 years (OREDA 2002).
Another challenge of such modelling is the lack of consistent degradation
models. For example, for the track there is a good qualitative understanding of
factors affecting degradation such as water in the track, contamination, geometry
failures, heavy axles, etc. However, the quantitative models for degradation taking
these factors into account are not very well developed. Research has paid much
attention to design problems to ensure long service life but it is difficult to use the
research results for maintenance and renewal considerations. More empirical re-
search on degradation mechanisms will also be important in the future.
530 J. Vatn

21.6 References
Bing AJ, Gross A, (1983) Development of Railroad Track Degradation Models.
Transportation Research Record 939, Transportation Research Board, National
Research Council, National Academy Press, Washington, D.C, USA.
Bogdaanski S, Olzak M, Stupnicki J. (1996). Numerical stress analysis of rail rolling contact
fatigue cracks. Wear 191:14–24
Braghin F, Lewis R, Dwyer-Joyce RS, Bruni S, (2006) A mathematical model to predict
raiway wheel profile evolutio due to wear. Accepted for publication in Wear.
Budai G, Huisman D, Dekker R. (2005) Scheduling Preventive Railway Maintenance
Activities. Accepted for publication in Journal of the Operational Research Society.
Carreteroa J, Pereza JM, Garcıa-Carballeiraa F, Calderona A, Fernandeza J, Garcıaa JD,
Lozano A, Cardonab L, Cotainac N, Prete P, (2003) Applying RCM in large scale
systems: a case study with railway networks. Reliability Engineering and System Safety
Dekker R, Wildeman RE, Van der Duyn Schouten, FA, (1997). A Review of Multi-
Component Maintenance Models with Economic Dependence. Mathematical Methods
of Operations Research, 45:411–435.
Ferreira L, Murray M, (1997) Modelling rail track deterioration and maintenance: current
practices and future needs. Transport Reviews, 17(3): 207–221.
Garey MR, Johnson DS (1979). Computers and Intractability: a Guide to the Theory of NP-
Completeness. W.H. Freeman and Company: New York.
Grassie SL (2005). Rolling contact fatigue on the British railway system: treatment. Wear
Hecke A, (1998) Effects of future mixed traffic on track deterioration. Report TRITA-FKT
1998:30, Railway Technology, Department of Vehicle Engineering, Royal Institute of
Technology, Stockholm.
Kay AJ, (1998) Behaviour of Two Layer Railway Track Ballast under Cyclic and Monotonic
Loading. PhD Thesis, University of Shefield, UK.
Li D, Selig ET, (1995) Evaluation of railway sub grade problems. Transportation Research
Record. 1489:17–25.
Meier-Hirmer1 C, Sourget F, Roussignol M, (2005). Optimising the strategy of track
maintenance. Advances in Safety and Reliability – Kołowrocki (ed.) Taylor & Francis
Group, London.
OREDA, (2002) Offshore Reliability Data, 4th ed. OREDA Participants. Available from
Det Norske Veritas, NO-1322 Høvik, Norway.
Pedregala DJ, Garcıaa FP, Schmid F (2004) RCM2 predictive maintenance of railway
systems based on unobserved components models. Reliability Engineering and System
Safety 83:103–110
Podofillini L, Zio E, Vatn J. Risk-informed optimization of railway tracks inspection and
maintenance procedures. Reliability Engineering and System Safety 91:20–30, 2006
Reddy V, Chattopadhyay G, Larsson-Kråik PO, Hargreaves DJ, (2006). Modelling and
analysis of rail maintenance cost . Accepted for publication in International Journal of
Production Economic.s
Salim W, (2004): Deformation and degradation aspects of ballast and constitutive modeling
under cyclic loading. PhD Thesis, university of Wollongong. Austrailia.
Sato Y, (1995) Japanese studies on deterioration of ballasted track. Vehicle System Dynamics,
Sriskandarajah C, Jardine, AKS, Chan, CK (1998). Maintennace scheduling of rolling stock
using a genetic algorithm. European J. Oper.Res., 35:1–15.
Telliskivi T, Olofsson U, (2004) Wheel–rail wear simulation. Wear 257 1145–1153.
Maintenance in the Rail Industry 531

Vatn J, Podofillini, P, Zio E (2003). A risk based approach to determine type of ultrasonic
inspection and frequencies in railway applications. World Congress on Railway
Research. Edinburgh, Scotland 28 September – 1 October 2003.
Veit P, Wogowitsch M, (2003) Track Maintenance based on life-cycle cost calculations. In
Innovations for a cost effective Railway Track.
Welte T, Vatn J, Heggset J, (2006) Markov state model for optimization of maintenance and
renewal of hydro power components. 9th International Conference on Probabilistic
Methods Applied to Power Systems, KTH, Stockholm, 11–15 June 2006.
Wildeman RE (1996). The art of grouping maintenance. PhD Thesis, Erasmus University
Rotterdam, Faculty of Economics.
Zakharov S, Komarovsky I, Zharov I (1998). Wheel flange/rail head wear simulation. Wear
215. 18–24
Zarembski AM, Palese JW, (2003) Risk Based Ultrasonic Rail Test scheduling: Practical
Application in Europe and North America. 6th International Conference on Contact
Mechanics and Wear of Rail/Wheel Systems (CM2003) in Gothenburg, Sweden June
10–13, 2003
Zhang YJ, Murray MH, Ferreira L, (1997). Railway track performance models: degradation
of track structures. Road and transport Research. 6(2):4–19
Zoeteman A, 2003. Life Cycle Management Plus. In Innovations for a cost effective Railway

Condition Monitoring of Diesel Engines

Renyan Jiang, Xinping Yan

22.1 Introduction
The engine is the heart of the ship; and the lubricant is the lifeblood of the engine.
Wear is one of the main causes that lead to engine failures. It is desirable to avoid
engine breakdowns for reasons of safety and economy. This has led to an increas-
ing interest in engine condition monitoring and performance modeling so as to
provide useful information for maintenance decision.
Generally, an engine goes through three phases – (i) running-in phase with an
increasing wear rate, (ii) normal operational phase with a roughly constant wear
rate and, (iii) wear-out phase with a quickly increasing wear rate. The wear state
can be effectively monitored by a number of techniques. The most popular tech-
nique is lubrication oil testing and analysis. Other techniques such as vibration and
acoustical emission analyses also provide evidences of the wear state. A more
effective way may be an integrated use of various monitoring techniques. In this
chapter we confine our attention on oil analysis.
Oil analysis techniques fall into the following three types. The first is concen-
tration analysis of wear particles in lubricant. This can be conducted in the field or
the laboratory. The second is wear debris analysis. This deals with examination of
the shape, size, number, composition, and other characteristics of the wear particles
so as to identify the wear state. This is usually conducted in the laboratory. The
third is lubricant degradation analysis. This is used to analyze physical and chemi-
cal characteristics of lubricant and determine the state of lubricant. This can be
conducted in the field or the laboratory.
To avoid the use of expensive laboratory instrumentation for wear state identi-
fication, a usual practice is to build a quantitative relation (or discriminant model)
between the condition variables (e.g. concentrations of wear particles) and the wear
state using an observation sample obtained from both field and laboratory analysis.
Once such a relation is built and verified, only field analysis is needed in practical
applications. As a result, a key issue is to develop an effective and quantitative
condition monitoring model.
534 R. Jiang and X. Yan

In this chapter we present a case study, which deals with applying oil analysis
techniques to condition monitoring of marine diesel engines. We present a system-
atic approach to identify the important condition variables, construct a multivariate
control chart, build the quantitative relation between the condition variables and
the wear state, and establish the state discrimination criterion or critical value. The
proposed approach is formulated based on intuitive reasoning, optimization tech-
nique and real data.
The chapter is organized as follows. Section 22.2 presents a literature review on
condition-based maintenance (CBM) and its applications to diesel engines. Section
22.3 provides the background details and presents the monitoring and experimental
results. The results are analyzed and modeled in Section 22.4. Finally, we conclude
the chapter with a summary and discussion in Section 22.5.

Notation and Acronyms

AE Acoustic emission
AI Artificial intelligent
CBM Condition-based maintenance
CM Condition monitoring
CV Coefficient of variation
TBM Time-based maintenance
f(x) Pdf of X
F(x) Cdf of X
m Mean
r Correlation coefficient
V Variance
φ() Standard normal pdf
Φ() Standard normal cdf
µ, σ Distribution model parameters
and so on

22.2 CBM and its Applications to Diesel Engines

CBM is a maintenance approach, where a maintenance action is performed only
when needed. Research and development in the CBM area has been growing rapid-
ly. This section outlines the CBM concept and summarizes the relevant literature
on the applications of CBM to diesel engines.

22.2.1 CBM and its Main Constituent Elements

Traditional maintenance policies are run-to-failure and time-based maintenance

(TBM). TBM is usually carried out at regular and fixed intervals that are deter-
mined based on experience or the recommendations of manufacturers. The CBM
decision is based on the information collected through condition monitoring (CM)
and hence it is particularly applicable to the situations where maintenance and
failure are very costly.
Condition Monitoring of Diesel Engines 535

From a viewpoint of reliability, TBM is based on traditional reliability models

and CBM on dynamic multivariate models. According to Lu et al. (2001), a
traditional reliability model is represented by a probability distribution of time to
failure of a population, which reflects the average behavior of the population’s
reliability characteristics while a dynamic multivariate model focuses on estimat-
ing individual system reliability under dynamic operating and environmental con-
ditions. As such, CBM usually deals with condition monitoring and reliability
evaluation of individual systems in a quantitative and real-time manner.
Jardine et al. (2006) provide a comprehensive literature review on the recent
research and developments in diagnostics and prognostics of mechanical systems
implementing CBM. They divide a CBM program into three main steps: data
acquisition, data processing and maintenance decision-making. Saranga (2002)
considers that a complete architecture for CBM systems should cover the range of
functions from data collection through the recommendation of specific mainten-
ance actions. He enumerates the following key functions:
• Sensing and data acquisition
• Signal processing and feature extraction
• Production of alarms or alerts
• Failure or fault diagnosis and health assessment
• Prognostics
• Projection of health profiles to future health or estimation of remaining use-
ful life
• Decision aiding
• Maintenance recommendations, or evaluation of asset readiness for a parti-
cular operational scenario
• Management and control of data flows or test sequences
• Management of historical data storage and historical data access
• System configuration management
• Human system interface
In this chapter, we summarize the CBM literature from the following five per-
• Data acquisition
• Data processing
• Diagnosis and prognostics
• Maintenance decision-making
• Computerized CBM management system Data Acquisition

Data or information is the basis of CBM decision. There are three main sources:
field records, CM, and expert knowledge.
Field records provide event data such as breakdown, minor repair, overhaul, oil
change, etc. CM data are the measurements related to the health condition of the
system. They can be vibration signals, acoustics signals, debris concentrations,
temperature, pressure, etc., obtained using various sensors or techniques such as
536 R. Jiang and X. Yan

accelerometers, laser vibrometers, microphones, acoustic emission sensors, ferro-

graphy, spectroscopy, thermography, thermocouples, etc. Finally, the knowledge
and experience of experts provide important information in determining system
state and importance of relevant factors, indices or measures.
CM can be continuous and intermittent. The former is often expensive and
probably inaccurate; the latter may be more cost effective and accurate but pro-
bably misses some failure events. Thus, it has been an important issue to determine
the optimal monitoring (or inspection or sampling) interval. Data Processing

According to Jardine et al. (2006), CM data fall into three categories: value type,
waveform type, and image type. Data processing for value-type data is called data
analysis; data processing for waveform and image data is called signal processing;
and the procedure of extracting useful information from raw signals is called
feature extraction.
Two commonly used techniques for analyzing value type data are trend
analysis and time series modeling. When the problem involves a number of variab-
les, dimension reduction appears very important. In a CBM setting, a reliability
model with covariates can combine event data (e.g. times to failure) with CM data
(or covariates). One such model is the proportional hazards model.
Another well known approach is some two-interval models, where the failure
process is divided into two intervals: the time interval from working state to the
initiation of the defect, and the time interval from the initiation of the defect to
failure. Moubray (1997) describes the latter as P-F interval, where P means po-
tential failure and F means functional failure. Goode et al. (2000) describes the
former as I-P interval, which is the time interval from machine installation to its
potential failure. Each of the intervals can be represented by a certain distribution.
Based on the fitted distributions and the outcomes of condition monitoring, ma-
chine prognosis can be derived.
Waveform data analysis includes three main categories: time-domain analysis,
frequency-domain analysis and time-frequency analysis. Time-domain analysis
calculates some descriptive statistics such as mean, standard deviation, root mean
square (RMS), skewness, kurtosis, time synchronous average, etc., based on the
time waveform. More advanced approaches include time series, autoregressive and
autoregressive moving average models. The most widely used frequency-domain
analysis is spectrum analysis by means of fast Fourier transform. A typical time-
frequency analysis is the wavelet transform.
Image processing is similar to waveform signal processing but more compli-
cated. Diagnostics and Prognostics

Diagnostics. Diagnostics deals with detection, isolation and identification of faults.
It maps the monitoring information and extracted features to machine faults. This
mapping process is usually called pattern recognition. Typical fault diagnostic ap-
proaches are model-based (or first principles; see Grimmelius et al. 1999), statisti-
cal, and artificial intelligent (AI).
Condition Monitoring of Diesel Engines 537

The model-based approaches use mathematical simulation models based on

underlying physical principles of the monitored machine. This kind of approach
requires specific mechanistic knowledge and theory relevant to the monitored
The statistical process control approach has been widely used for fault detec-
tion. It compares the monitored signal with a reference signal representing the
normal condition to determine whether the monitored signal is within the control
limits or not. Cluster analysis is a statistical classification approach that groups
signals into different fault categories based on a certain distance or similarity
measure between two signals. The measure is usually derived from certain dis-
criminant function in statistical pattern recognition.
AI approaches have been increasingly applied to machine diagnosis. Typical
AI techniques include artificial neural networks, expert systems, fuzzy logic sys-
tems, and evolutionary algorithms.
According to Grimmelius et al. (1999), the model-based approaches can be
applied efficiently for newly developed machinery because the design data is
already available; and the other two kinds of approaches strongly depend on the
availability of measured data and are more suited for application to existing
Prognostics. Different from diagnostics that deals with posterior event analysis,
prognostics deals with fault prediction. Prognostics usually needs to evaluate re-
maining useful life or the probability that a machine will operate normally for a
given time interval. Similar to diagnosis, the prognosis approaches include model-
based, statistical, and AI approaches.
Data fusion. For a complex system, a single sensor cannot provide sufficient
information for producing accurate results from analysis. In such cases, multiple
sensors are needed to obtain additional condition information, and multi-sensor
data fusion techniques are used to combine the information from these sensors for
more accurate diagnosis and prognosis. Data collected from each sensor may be a
mixture of data from several sources. Some of the sources are related to a par-
ticular machine condition of interest. Thus, an issue is to separate different sources
by fusing the observed multi-sensor data. Fusion can be conducted at data-level,
feature-level, or decision-level. Maintenance Decision

The outcome of prognosis provides decision support for maintenance actions.
Therefore, prognostic and maintenance optimization are often considered together.
Maintenance optimization is usually based on certain criteria such as risk, cost,
reliability and availability. Widely used criteria are cost and availability. However,
risk and reliability criteria may be more appropriate for critical equipments or
situations where the consequence cannot be estimated by cost. Many CBM opti-
mization models have been developed and can be found in the literature. Their
implementation often needs specially developed software packages. Computerized CBM Management System

A well-developed CBM system can provide the user with a simple and automated
method to plan and implement maintenance quickly and efficiently. To achieve
538 R. Jiang and X. Yan

this, a computerized management system with all those functions mentioned in

Saranga (2002) is a crucial tool.

22.2.2 Applications of CBM to Diesel Engines: A Literature Survey

We classify the literature based on the following five dimensions:

• Reference
• Machinery type: diesel engine or marine diesel engine
• CM technique: oil, vibration, acoustic emission (AE), others (including
• Modeling technique: model-based, statistical, AI, others
• Focus: data processing and modeling, development of sensors, and develop-
ment of a CBM system
The relevant literature is summarized in Table 22.1. From the table, we can draw
the following observations:

1. CM technique: among the 23 references, 10 deal with oil analysis, 5 with

vibration analysis, 4 with AE analysis, and 7 with other analysis techniques
(mainly multi-sensors technique). This implies that oil analysis is the most
widely used CM technique for diesel engines.
2. Analysis and modeling technique: 4 references deal with model-based
approach, 9 with statistical approach, 10 with AI approach, and 3 with
other approaches (mainly integrated approach). This implies that statistical
and AI approaches are widely used analysis techniques for CM data of
diesel engines.
3. Application type: 15 references deal with data processing and/or modeling,
4 with development of on-line sensors or measurement systems, and 4 with
development and/or application of integrated CBM systems. This implies
that data processing and modeling plays a key role in a CBM program.

Table 22.1. Summary of literature in CBM of diesel engines

Reference Machinery CM technique Model type Focus

Anderson et al. Engine Quantitative Development of a
(1983) analytical standard ferrography
ferrography analysis procedure,
evaluation of a high
gradient magnetic
Douglas et al. Diesel Acoustic Statistical, AE Identification of AE
(2006) engines emission energy signals of ring/liner
Gorin and Shay Marine Oil analysis Development of
(1997) diesel onboard oil analysis
engine meters
Grimmelius Marine Torsional First principles, Demonstration of
et al. (1999) diesel vibration of feature extraction, modeling techniques
engines crank shaft neural networks through two cases
Condition Monitoring of Diesel Engines 539

Table 22.1. (continued)

Hargis et al. Marine Oil, Discriminant Identification of
(1982) diesel Ferrography score plotting normal and abnormal
engines technique states, condition-
based inspection
Hofmann Ship main Vibration Vibration monitoring
(1987) engine analysis program for
maintenance onboard
Hojen- Marine Vibration, Neural network, On-line classification
Sorensen et al. diesel acoustic discriminant scheme
(2000) engines emission methods, hidden
Markov decision
Hountalasa and Marine Thermo- Model-based, Automatic
Kouremenosa diesel dynamics simulation model troubleshooting
(1999) engines method
Hubert et al. Cummins Ferrography Model-based Development of a
(1983) VT-903 approach testing methodology
diesel for determining wear
engine particle generation
rates and filter
Jakopovic and Marine Oil analysis Expert system, Expert system for
Bozicevic engine theory of fuzzy assessing the quality
(1991) sets of lubricant and
diagnosing engine
Jardine et al. Diesel Metal Proportional Fit proportional
(1989) engine concentration hazard model hazards model to oil
of engine oil analysis data
Johnson and Medium Analytical Statistical Evaluation of the
Hubert (1983) duty truck ferrography particle generation
engine rate and the filtering
Liu et al. Marine Ferrograph, On-line wear
(2000) diesel grid capacitance condition monitoring
engines and photo- system
electric sensors
Logan (2005) Gas turbine Electric sensing Neural network Intelligent diagnostic
and diesel devices diagnostic software agents
generators inferencing operating in real-time
onboard naval ships
Pontoppidan Marine Acoustical Independent Detection of
and Larsen diesel emission component condition changes
(2003) engines analysis
540 R. Jiang and X. Yan

Table 22.1. (continued)

Priha (1991) Marine Knowledge-based Development of Fault
diesel Avoidance
engines Knowledge System
Scherer et al. Diesel Viscosity, Development and
(2004) engines permittivity, application of
temperature, prototype of an oil
IR-spectros- condition sensor
Sharkey (2001) Diesel Vibration, AE, Neural network Decision fusion
engines cylinder through a multinet
pressure system.
Sun et al. Diesel Oil analysis Artificial neural Application of
(1996) engines network multisensor fusion
Tang et al. Marine Temperature, Fuzzy neural Condition monitoring
(1998) diesel pressure, network, system
engine combustion air combustion
flow simulation model
Wang and Diesel evidence theory, Approach to diagnose
Wang (2000) engine decision-layer multiple faults of a
multisensor data working diesel
fusion engine.
Wu et al. Diesel Vibration Statistics analysis, Analysis of the
(2001) engine multi-index fusion piston–liner wear
Zhang et al. Marine Oil Grey system Determination of the
(2003) diesel spectrometric theory turning point
engines analysis

22.3 Problem Background and Observation Results

22.3.1 Problem Background

Wear states of two 8NVD48A-2u marine main propulsion diesel engines were
experimentally investigated at the Reliability Institute of Wuhan University of
Technology, China. The overall objective of the program was to develop a CBM
technique to provide condition information for maintenance decision of the engines.
There are three kinds of condition variables to represent the wear condition of
the engines:
• Wear particle concentrations
• Lubricant quality parameters such as viscosity and contamination index
• Operational parameters such as vibration level, shaft torque moment and
instantaneous rotation velocity
Condition Monitoring of Diesel Engines 541

In this case study, we focus on the concentrations of wear particles, which

reflect the wear condition of the main tribo-pairs (e.g. ring-liner, shaft-bearing, and
gears) in the engines. The metallic elements in the wear particles consist of ferrous
elements and non-ferrous elements. Elemental Fe comes from many parts such as
valve, bearings, piston ring, shaft, and so forth. The other ferrous elements include
Cr, Mn, and Ni. Among them, Cr is from the surface coating of the first piston
ring, Mn from cylinder liner and Ni from transmission gears. The non-ferrous
elements include Al, Cu, Pb and Si. Among them, Al is from piston, Cu mainly
from the bearing of connecting rod, Pb from crankshaft bearing and Si from piston
or contaminant.

22.3.2 Observation Results

The experiment was started after an overhaul of the engines, which is assumed to
restore the engine to good-as-new, and finished at the time instant of the next
overhaul. During the experiment the engines cummulatively ran for 4831 h, the
engine oil was periodically sampled, and a total of 110 oil samples were taken
from the 2 engines.
Various pieces of equipment such as direct reading ferrograph, rotary ferro-
graph, infrared spectrum analyzer, scanning electron microscope and electronic
digital analyzer, viscosity meter, and lubricant quality meter were used to analyze
the oil samples in order to classify the wear state.
In this study, the wear is divided into two states: normal (or State 0) and
abnormal (or State 1). The wear state can be determined by analyzing the size,
composition, and type of wear particles. Several different techniques were used to
identify the wear state in the laboratory. For more details about the state classifi-
cation based on wear particle morphology, see Roylance et al. (1994), Roylance
and Raadnui (1994) and Raadnui and Roylance (1995).
Most of the observations were under normal operational conditions. A trend
analysis of concentration vs. time was carried out. The main findings were as

1. There exists a close relation between oil degradation and abnormal wear. It
was observed that the concentration of wear particles increases as viscosity
decreases and the contaminate index increases.
2. There exist some differences among the outcomes provided by different
analysis techniques; and sometimes the outcomes are in disagreement.

The trend analysis identified 28 observations within the neighboring regions of

the condition change point, which is somewhat similar to the P-point of the P-F
interval. Among them, 12 observations are identified as abnormal and 16 observa-
tions as normal. These 28 observations are shown in Table 22.2 for further analysis
and modeling. In the table, j denotes sample number, and the 14th row gives the
mean concentration values of the first 12 observations.
542 R. Jiang and X. Yan

Table 22.2. Concentration of main elements in oil samples (ppm)

j State Fe Cr Ni Mn Al Cu Pb Si
1 1 52.18 2.95 2.66 2.36 8.7 10.98 13.29 5.32
2 1 52.73 3.25 2.55 1.78 8.04 8.93 9.65 5.46
3 1 35.31 0.95 0.68 1.26 5.57 4.33 6.23 4.57
4 1 32.2 1.35 1.17 1.03 5.70 3.57 5.89 4.56
5 1 82.87 4.74 2.61 1.85 9.85 13.34 17.1 7.22
6 1 48.22 2.17 1.94 1.37 7.08 6.82 8.05 4.88
7 1 30.78 1.03 0.00 1.15 4.71 4.18 5.94 4.00
8 1 37.99 1.30 0.00 1.07 6.07 4.52 5.87 3.90
9 1 39.51 1.39 0.41 1.04 7.24 3.17 5.51 7.15
10 1 33.47 1.06 0.35 0.86 6.77 2.87 4.95 7.18
11 1 36.50 2.17 1.05 1.61 7.73 3.88 6.29 7.61
12 1 35.03 1.73 0.57 1.30 7.68 3.47 5.11 7.43
Mean 43.07 2.01 1.17 1.39 7.10 5.84 7.82 5.77
13 0 28.2 0.39 0.00 0.72 4.04 2.71 3.70 3.96
14 0 27.02 0.79 0.40 0.87 4.09 3.24 5.02 5.71
15 0 25.66 0.43 0.00 0.69 3.64 2.65 3.57 5.29
16 0 22.25 0.50 0.18 0.50 3.94 2.15 3.80 5.50
17 0 30.72 1.28 0.64 1.09 5.09 4.15 6.63 3.99
18 0 29.4 0.58 0.00 1.01 4.30 4.20 4.92 3.65
19 0 29.17 0.47 0.00 0.97 4.12 3.67 4.73 3.67
20 0 31.45 1.10 0.00 1.12 4.73 4.27 5.96 4.01
21 0 30.04 0.43 0.00 1.02 4.16 3.91 5.30 3.90
22 0 29.48 0.66 0.00 0.91 4.49 3.58 4.79 3.85
23 0 25.97 0.34 0.00 0.68 3.69 2.56 3.71 4.03
24 0 42.05 2.34 1.98 1.94 7.75 11.12 12.78 5.93
25 0 43.16 2.10 2.16 1.92 7.41 10.64 13.25 5.67
26 0 23.38 0.96 0.31 0.58 4.63 2.13 3.21 6.74
27 0 29.16 0.62 1.07 0.91 3.23 2.95 4.93 8.52
28 0 22.82 0.66 0.31 0.91 4.52 2.06 3.66 6.35

22.4 Development of Multivariate Control Chart

and Discriminant Model
The data of Table 22.2 has been modeled using a stepwise pluralistic regression
approach by Zhao et al. (2003). There, an empirical relation between the state and
debris concentrations was built, which included three major elements: Mn, Cr, and
Cu. Jiang and Jardine (2006) propose a composite scale modeling approach, where
the data is used as a numerical example and is reanalyzed. It is shown that the
composite scale approach gives a better result in terms of statistical significance
and failure (or abnormal) prediction capability. In this section, we propose a new
Condition Monitoring of Diesel Engines 543

approach to model the data. Comparing it with the previous approaches, it appears
more straightforward and comprehensive.

22.4.1 Correlation Analysis

The correlation coefficient r is a measure of strength of linear relationship between

two variables (Blischke and Murthy 2000, p 367). In this section we conduct a cor-
relation analysis to:
• Examine whether the correlation is dependent on the wear state or not
• Identify potentially significant variables for further analysis
For the former issue we examine the correlation coefficient matrices for both
States 0 and 1. The correlation coefficient matrix associated with State 1 can be
obtained from the first 12 rows of Table 22.2. The first figure of each entry in
Table 22.3 gives the correlation coefficient for this case. Similarly, the correlation
coefficient matrix associated with State 0 can be obtained from the last 16 rows of
Table 22.2. The second figure of each entry in Table 22.3 gives the correlation
coefficient for this case.
As can be seen from the table, the two figures of each entry are close to each
other with an average relative error about 10% except those of the column corres-
ponding to Si. Thus, we can roughly assume that the correlation is state-indepen-
dent. In the following discussion, when we mention the correlation coefficients,
they are the first figure of each entry in Table 22.3.
To judge the significance of a linear correlation, we need to determine a critical
value for the correlation coefficient r. According to Fisher (1970), given a corre-
lation coefficient r the significance of the linear correlation of two variables can be
tested using the following statistic to transform the correlation coefficient to a
Student’s t-value:

t = r / 1− r2 . (22.1)

The critical value of t associated with the 95% level, one tail, and the degrees of
freedom 12–1 = 11 is 1.7959. This implies that the critical value of r is 0.8737.
Namely, the linear relation between two variables are significant if their correlation
coefficient is larger than 0.8737 in this application.
As can be seen from Table 22.3, there are eight correlation coefficients that are
larger than 0.8737. They are:
r(Cu, Pb) = 0.98, r(Fe, Cr) = 0.95, r(Fe, Pb) = 0.94, r(Fe, Cu) = 0.93,
r(Cr, Cu) = r(Cr, Pb) = 0.92, r(Cr, Al) = r(Cu, Ni) = 0.88. (22.2)
Equation 22.2 involve six elements: Fe, Cr, Ni, Al, Cu, and Pb. Among them, Ni
and Al appear only once and the corresponding correlation coefficients (= 0.88) are
very close to the citical value. Thus, we may classify the elements into three groups:
• Strong correlation group: (Fe, Cr, Cu, Pb)
• Weak correlation group: (Ni, Al)
• Independent group: (Mn, Si)
544 R. Jiang and X. Yan

Table 22.3. Correlation coefficients matrices

Cr Ni Mn Al Cu Pb Si
Fe 0.95 0.78 0.66 0.81 0.93 0.94 0.24
0.84 0.80 0.96 0.85 0.96 0.96 -0.04
Cr 0.87 0.78 0.88 0.92 0.92 0.32
0.88 0.89 0.96 0.91 0.93 0.23
Ni 0.84 0.75 0.88 0.84 0.11
0.83 0.82 0.85 0.89 0.50
Mn 0.73 0.84 0.81 0.11
0.90 0.97 0.97 0.05
Al 0.74 0.76 0.64
0.93 0.93 0.06
Cu 0.98 0.02
0.99 0.05
Pb 0.12

When two variables are strongly correlated and their means differ significantly,
then one can ignore the one with the smaller mean and simply use the one with the
larger mean. Using this reasoning, we may delete some of the elements in the
strong correlation group. Consider the first three correlation coefficients of Equa-
tion 22.2, which have larger r values. According to the first correlation coefficient
and the means given in Table 22.2, Cu may be deleted. Similarly, Cr and Pb may
be deleted based on the second and third correlation coefficients, respectively. As a
result, only five elements (Fe, Ni, Mn, Al and Si) are retained for further analysis.
A physical interpretation of the correlation in this case study is that the wear
debris may not be pure metal and can be from different parts. Its mathematical
interpretation is that an increase or decrease of the readings in one element implies
a possible increase or decrease [decrease or increase] of the readings in a positively
(negatively) correlated element. When the absolute value of readings is very small,
e.g. some of readings of Si, the correlation should be considered insignificant.

22.4.2 State Discrimination Capability of Condition Variables

Each condition variable contributes partial information for identifying the state of
the monitored system. By quantitatively examining their contributions, we can
identify those varables which carry more state information. This study develops a
method to quantitatively evaluate contributions of the condition variables. It starts
with building the marginal distributions associated with the abnormal and normal
states for each condition variable. Marginal Distribution Associated with State 1

We use index 1 ≤ i ≤ 5 to denote the element (Fe, Ni, Mn, Al, Si), respectively. For
a given element i, the entries of the corresponding column in Table 22.2 form a
censored sample, denote them {xij, j = 1, 2, …, 28}. Assume that Xi follows a
Condition Monitoring of Diesel Engines 545

certain distribution F1(i ) ( x) . The data associated with State 0 can be viewed as
right-censored. Namely, if the observed value associated with State 0 is xij+ , then
the corresponding value of x associated with State 1 meets the relation: x > xij+ . Its
likelihood function is given by 1 − F1( i ) ( xij+ ) . The overall maximum likelihood
function is given by

12 28
L1(i ) = ∏ f1(i ) ( xij )∏ [1 − F1(i ) ( xij+ )] . (22.3)
j =1 j =13

The model parameters can be estimated by maximizing L1(i ) or ln( L1(i ) ) .

Murthy et al. (2003) present a model selection method based on a WPP (Weibull
plotting paper) plot. The method is based on a match between the WPP plot of data
and the WPP plot of a model. The WPP plots of data are shown in Figure 22.1. As
can be seen from the figure, the WPP plots of data have three kinds of different
shapes: convex for Ni, S-shaped for Mn and Al, and concave for Si and Fe.
It is well known that the WPP plot of the two parameter Weibull distribution is
a straight line, and hence it is not an appropriate model for modeling the data. We
examined the WPP plots of some common two-parameter distributions and found
that the WPP plot is S-shaped for the normal distribution truncated at x = 0,
concave for Lognormal distribution, and convex for the Gumbel distribution of the
smallest extreme truncated at x = 0 given by

F ( x) = {1 − exp[− exp( )]} exp(e − µ / σ ), x ≥ 0 . (22.4)


-2 -1 -0.5 0 1 2 3 4 5


-1.5 Mn

Al Fe
Figure 22.1. WPP plots of data
546 R. Jiang and X. Yan

Their WPP plots are shown in Figure 22.2. Clearly, for each WPP plot of data in
Figure 22.1 one can find a shape that matches one of the WPP plots in Figure 22.2.
Thus, an appropriate model can be found from these three models for each variable.
Once the model type is determined, the maximum likelihood method can be
used to obtain the estimates of the model parameters. The estimated parameters,
( µ1(i ) , σ 1(i ) ), are shown in Table 22.4.
In later analysis, we need to know the means and variances of the fitted
marginal distributions. For the truncated normal distribution, the mean and vari-
ance are given by

φ ( − µ / σ ) , V = σ 2 − m( m − µ ) ,
m = µ +σ (22.5)
1 − Φ (− µ / σ )

where µ and σ are the model parameters, and φ (.) and Φ(.) are pdf and cdf of
the standard normal distribution, respectively. For the truncated Gumbel distribu-
tion, the mean and variance are given by

m = µ + σ exp(e − µ / σ ) I1 , V = (m − µ )[(µ − m) + σI 2 / I1 ] , (22.6)


∞ ∞

∫ ln(s)e ds , I 2 = ∫ ln
I1 = −s
( s )e − s ds . (22.7)
−µ / σ
e −µ / σ e

For lognormal distribution, the mean and variance are given by

m = exp( µ + σ 2 / 2) , V = m 2 [exp(σ 2 ) − 1] . (22.8)


T runcated


T runcated

Figure 22.2. WPP plots of truncated normal, lognormal, and truncated Gumbel distributions
Condition Monitoring of Diesel Engines 547

Table 22.4. Maximum likelihood estimates of the distribution parameters

i = 1, Fe i = 2, Ni i = 3, Mn i = 4, Al i = 5, Si
Truncated Truncated Truncated
Lognormal Gumbel normal normal Lognormal
(i )
µ 0 3.3532 1.0286 0.0548 4.4977 1.5194
(i )
σ 0 0.1243 0.6361 0.8948 1.1658 0.1880
(i )
m 0 28.8153 0.9530 0.7342 4.4980 4.6511
(i )
V 0 3.5965 0.8064 0.5494 1.1653 0.8820
(i )
µ 1 3.7554 2.0401 1.5644 7.3574 1.8317
(i )
σ 1 0.1897 0.7985 0.4440 1.3679 0.1964
(i )
m 1 43.5269 1.7723 1.5648 7.3574 6.3660
V1(i ) 8.3335 1.1040 0.4433 1.3679 1.2624
(i )
x c 34.3485 1.5703 1.0493 5.9022 5.3512
(i )
Err 1 0.0701 0.1171 0.2540 0.1142 0.2005
(i )
Err 2 0.1244 0.3797 0.1230 0.1437 0.2159
(i )
P( x c
) 0.0973 0.2484 0.1885 0.1289 0.2082
Rank 1 5 3 2 4 Marginal Distribution Associated with State 0

Denote F0(i ) ( x) as the marginal distribution associated with State 0. The data
associated with State 1 can be viewed as left-censored (see Blischke and Murthy
2000). Namely, if the observed value associated with State 1 is xij− , then the
corresponding value of x associated with State 0 meets the relation x < xij− . Its
likelihood function is given by F0(i ) ( xij− ) . The overall maximum likelihood function
is given by:

12 28
L(0i ) = ∏ F0(i ) ( xij− )∏ f 0(i ) ( xij ) . (22.9)
j =1 j =13

A careful examination has been carried out to determine the model type of F0(i ) ( x) .
We found that that F0( i ) ( x) has the same model type as F1( i ) ( x) . The maximum
likelihood estimates of the model parameters, ( µ 0(i ) , σ 0( i ) ), are also shown in Table
22.4. Critical Value Between States 0 and 1

For a given element i and observation value xi, we need to establish a state
discrimination criterion based on a certain critical value xc(i ) . Namely, we classify
it as normal if xi < xc(i ) ; otherwise, as abnormal and accordingly initiate an
548 R. Jiang and X. Yan

appropriate maintenance action. We propose the following method to determine

the value of xc(i ) .
Consider the case xi = xc(i ) . As can be seen from Figure 22.3, there can be two
kinds of misjudgments or errors:
• The real state is 0 but is misjudged as State 1
• The real state is 1 but is misjudged as State 0
The former error is given by

Err1(i ) = 1 − F0( i ) ( xc(i ) ) , (22.10)

and the latter error is given by

Err2(i ) = F1(i ) ( xc(i ) ) . (22.11)

The average of the two errors is given by

P( xc(i ) ) = ( Err1(i ) + Err2( i ) ) / 2 . (22.12)

f 1 (x)

f 0 (x)

x xa xc

Figure 22.3. Distributions of condition variable and critical value

Thus, xc(i ) can be determined by minimizing P( xc(i ) ), i.e.

dP ( xc(i ) ) (i ) (i ) (i ) (i )
= 0 or f 0 ( xc ) = f1 ( xc ) . (22.13)
dxc( i )
Condition Monitoring of Diesel Engines 549

The specific values of the relevant parameters ( Err1(i ) , Err2(i ) , P( xc(i ) ), xc(i ) ) for
each element are shown in Table 22.4. Discussion
P( xc(i ) ) is a measure of misjudgment probability. The smaller it is, the better is the
discrimination capability of variable i, namely, the variable contains more state
information. Using it as an importance criterion, we can rank the condition vari-
ables. The last row of Table 22.4 shows the rank number of each variable.
As can be seen from the table, Fe has the best discrimination capability. This is
consistent with the result of correlation analysis, which shows that it is highly
correlated with (Cr, Cu, Pb). Namely, the concentration of Fe comprehensively
reflects the concentrations of Cr, Cu, Pb and itself, and hence the reading of Fe
reflects the wear state to a great extent. The second most significant element is Al.
This also appears reasonable since debris of Al and Cr (the latter is reflected by Fe)
mainly comes from piston and piston rings, which are the main wear parts. Mn and
Si have almost the same discrimination capability. This appears reasonable due to
their independence. Finally, it is noted that Ni has the worst state discrimination
capability. This can be explained by the dispersion of its readings (see Table 22.2),
and the fact that the wear of the transmission gears may not be a major problem.

22.4.3 Construction of a Multivariate Control Chart

A multivariate control chart can intuitively display the results of condition moni-
toring and evolution trend. Therefore, it appears especially important to set an
alarm threshold and an abnormal threshold. Usually, the thresholds are optimized
in a CBM model. Here, our focus is on the construction of such a control chart, and
hence we only present a simple method to set the thresholds when the optimal
thresholds unavailable.
We define xc(i ) as the abnormal threshold, and define the alarm threshold as

F1(i ) ( xa(i ) ) = α < Err2(i ) . (22.14)

Here, α depends on the condition degradation speed, sampling interval,

maintenance reaction time, and the values of Err2(i ) , i = 1, …, 5. In the current
example, we take α = 5%.
The control chart is designed to have the following features:
• It is displayed in x-y plane with an element order from the most important
one to the least important one
• The abnormal thresholds are normalized to 1 for all the elements
• The alarm thresholds are transformed to the same value, γ for all the
• The overall state is represented along the y-axis
550 R. Jiang and X. Yan

To achieve the second and third features, we use the following relation to
transform an observed concentration xi into a normalized concentration yi without
changing the relative magnitude of the original readings:

yi = ai + bi xi , bi > 0. (22.15)


1 = ai + bi xc( i ) , γ = ai + bi xa(i ) . (22.16)

From Equation 22.16 we have

γxc( i ) − xa( i ) , 1− γ . (22.17)

ai = bi = (i )
(i )
xc − x a (i )
xc − xa(i )

To specify the value of γ, we let

i =0 (22.18)

so as to decrease the influence of the constant term in Equation 22.15. This yields

xa( i ) xc( i )
γ =∑ / ∑ . (22.19)
i xc( i ) − xa( i ) i xc(i ) − xa( i )

Clearly, γ is a function of α. For the present case, γ = 0.8404.

The alarm thresholds and relevant parameters are shown in Table 22.5. The
multivariate control chart with rescaled element concentrations associated with the
12th and 13th observations are displayed in Figure 22.4.

Table 22.5. Alarm thresholds and transformed parameters

i = 1, Fe i = 2, Ni i = 3, Mn i = 4, Al i = 5, Si
(i )
x a 31.2898 0.4048 0.8342 5.1075 4.5206
ai –0.7925 0.7849 0.2215 –0.1855 –0.0284
bi 0.0522 0.1370 0.7419 0.2008 0.1922

22.4.4 State Discriminant Model

A state discriminant model consists of a relation between the condition variables

and the wear state and a critical value. A composite scale modeling approach can
be used to combine several scales or variables into a single scale or variable. The
Condition Monitoring of Diesel Engines 551

combined scale is expected to have better failure (or abnormal state) prediction
capability than individual scales. Two typical models are the linear and multipli-
cative ones. Their parameters are determined by minimizing the sample coefficient
of variation (CV) of the composite scale. The minimum CV approach is hard to
apply in the presence of censored data. In this context, Jiang and Jardine (2006)
propose a simple method to estimate the model parameters in the presence of
censored data. The method transforms censored data into complete data by adding
a mean residual value to a censored datum for each scale. Such a new data set, thus
obtained, is called an equivalent complete data set and will be used for the para-
meter estimation using the minimal CV approach under the assumption that the
transformation does not significantly impact the composite scale model to be built.
They also conclude that a small value of CV is a necessary but insufficient con-
dition of a good prediction capability of failure for the composite scale model.
Therefore, they consider more than one alternative model, use the minimum CV
method to estimate the parameters of the alternative models, and determine the best
model based on the prediction capability of the models.

1.5 No. 12
Rescaled concentration


0.5 No. 13

State Fe Al Si Mn Ni

Figure 22.4. Multivariate control chart

The above approach appears somewhat troublesome as it involves multiple steps

and intensive numerical calculations. In this subsection, we propose a simpler and
more straightforward approach. It is based on the following assumptions, which
appear plausible:
• The correlations between the individual variables under consideration can
be ignored (see the correlation analysis in Section 22.4.1)
• The composite scale is a linear combination of the individual variables, and
follows the normal distribution
Under these assumptions, the misjudgment probability can be directly repre-
sented by a function of the parameters of the composite scale and the means and
variances of the condition variables. As a result, the parameters of the composite
552 R. Jiang and X. Yan

scale and the misjudgment probability can be simultaneously determined by mini-

mizing the misjudgment probability. The critical value is then established using the
approach presented in Section 22.4.2. Determination of a Composite Scale

Consider the following linear model:

5 5
y = ∑ ci xi , ∑c i = 1. (22.20)
i =1 i =1

If we want to exclude a certain variable, say xk, from the model, we just need to set
ck = 0.
According to the above assumptions, the composite scale, Y, is a normal ran-
dom variable. For State 1, the mean and variance of Y are given by

5 5
m1 = ∑ ci m1( i ) , V1 = ∑ ci2V1(i ) . (22.21)
i =1 i =1

For State 0, the mean and variance of Y are given by

5 5
m0 = ∑ ci m0(i ) , V0 = ∑ ci2V0(i ) . (22.22)
i =1 i =1

According to Equation 22.13, the critical value of the composite scale, yc, meets
the following relation:

y − m0 y − m1
φ( ) / V0 = φ ( ) / V1 . (22.23)
V0 V1

From Equation 22.23 we have

d 2 + 2( s 2 − 1) ln(s ) − sd ,
yc = m1 + V1 (22.24)
s2 −1


s = V1 / V0 , d = (m1 − m0 ) / V0 . (22.25)

According to Equation 22.12, the misjudgment probability is given by

yc − m0 y − m1
P( yc ) = [1 − Φ ( ) + Φ( c )] / 2 . (22.26)
V0 V1
Condition Monitoring of Diesel Engines 553

Since m0, V0, m1, and V1 are functions of the decision variables {ci}, P(yc) is a
function of {ci}. As a result, {ci} can be optimally determined by directly mini-
mizing P(yc). Candidate Linear Models

By considering all linear models that at least include three variables, then we have
ten three-parameter models, five four-parameter models, and one five-parameter
model. If we always include the two most important elements, Fe and Al, in all the
models, then we just needs to consider three three-parameter models, three four-
parameter models, and one five-parameter model. We take the latter approach.
We first consider the five-parameter model. Using the approach outlined in
Section, we obtained the model parameters and objective function value
shown in the second row of Table 22.6. The third row of the table shows the values
of ci m1(i ) , which reflects the contribution of each element to the composite scale.
The larger it is, the more important is the element. Based on this criterion, we
rerank the elements and the results are shown in the fourth row. Comparing these
results with those shown in Table 22.4, we can find that the ranks are basically
consistent except that the positions of Mn and Si are exchanged.
By eliminating one of (Ni, Mn, Si), we obtain three four-parameter models,
whose parameters and objective function values are shown in rows 5–7 of Table
22.6. Similarly, by eliminating two of (Ni, Mn, Si), we obtain three three-para-
meter models, whose parameters and objective function values are shown in rows
8–10 of Table 22.6.
As can be seen from rows 5–10 of the table, the objective function value
obtained from the model excluding a less important element is smaller than that
obtained from the model excluding a more important element. This confirms the
reasonability of the new rank.

Table 22.6. Parameters of composite scale models

Model No. c1, Fe c2, Ni c3, Mn c4, Al c5, Si P(yc) rI

1 0.0529 0.1182 0.4009 0.2310 0.1969 0.0201 12.4486
ci m1(i ) 2.3039 0.2096 0.6274 1.6994 1.2534
Rank 1 5 4 2 3
2 0.0602 0 0.4542 0.2620 0.2235 0.0224 14.8925
3 0.0917 0.1974 0 0.3810 0.3299 0.0290 11.5072
4 0.0664 0.1476 0.4981 0.2879 0 0.0294 11.3420
5 0.1149 0 0 0.4739 0.4112 0.0323 15.4769
6 0.0782 0 0.5837 0.3381 0 0.0328 15.2289
7 0.1394 0.2948 0 0.5658 0 0.0425 11.7512 Selection of the Best Model

The best model should have a small value of P(yc) and include few model para-
meters. Denote by n the number of model parameters (i.e. the number of variables
included in a linear composite scale). Noting that the second relation in Equation
554 R. Jiang and X. Yan

22.20, an n-parameter model only has n–1 independent parameters. Define the
information quantity of a model as follows:

I = 1/P(yc). (22.27)

The average information quantity per an independent parameter is given by:

rI = 1 /[(n − 1) P( yc )] . (22.28)

It comprehensively reflects the above two requirements. A large value for rI im-
plies a better model. We use this criterion to select the best model. The last column
of Table 22.6 shows the values of rI. As can be seen from the table, the best model
is the three-parameter model that includes the three important elements (Fe, Al, Si).
Also to be noted is that the second best model is the three-parameter model that
includes the elements (Fe, Al, Mn). Once more, it shows that Mn and Si have al-
most the same importance as indicated in the correlation analysis. Rescaling of the Best Model

To display the state discrimination result on the control chart, we normalize the
state critical value yc (= 8.9081) to 1. To do so, all the coefficients in the composite
condition variable is divided by yc. Similarly, we may set an alarm threshold for y
as below:

F1 ( ya ) = β < Err2 = 4.15% . (22.29)

In the current case, we take β = 1%. This yields y0.01 = 8.1536. The rescaled alarm
threshold for the composite scale equals 0.9156, which is not equal to the rescaled
alarm threshold (= γ) for the elements; see Figure 22.4.

22.5 Conclusions and Discussion

In this case study, we have presented an approach for modeling and analysis of the
condition monitoring data of the 8NVD48A-2u marine diesel engines. The main
conclusions have been:

1. The correlation analysis is useful for identifying the correlation strengths

between the elements and whether or not the correlations are state-inde-
2. It is possible and useful to build the marginal distributions of element con-
centrations associated with both abnormal state and normal state. A discri-
mination capability analysis helps in evaluating the state discrimination
capability of elements.
3. A multivariate condition monitoring control chart has been developed to
provide the maintenance engineer with intuitive wear state information.
Condition Monitoring of Diesel Engines 555

4. The composite scale modeling approach based on minimizing the mis-

judgment probability is a useful technique to combine multiple variables.
The proposed information criterion for selecting the best model appears rea-

Some issues that need to be considered in the future are as follows:

1. Some additional work is needed to validate the proposed model. This can be
done by examining the agreement between the model predition results and
the actual observations in the field.
2. The alarm threshold and oil sampling interval can be optimized so as to
obtain a balance between the acquired information and the effort involved.
3. To provide a more accurate assessment of engine condition, it appears
necessary to use multiple monitoring techniques. Thus, fusion of multi-
sensor data and aggregation of multi-state measures is an important topic
that needs further study.
4. An optimization maintenance decision model and computerized implemen-
tation software package needs to be developed to promote greater use of this
approach in industry.

22.6 Acknowledgement
The authors wish to thank Prof. D.N.P. Murthy for his constructive comments on
an earlier version of this chapter.

22.7 References
Anderson DN, Hubert CJ, Johnson JH, (1983) Advances in quantitative analytical
ferrography and the evaluation of a high gradient magnetic separator for the study of
diesel engine wear: Wear 90(2): 297–333
Blischke WR, Murthy DNP, (2000) Reliability: modeling, prediction, and optimization.
John Wiley, New York
Douglas RM, Steel JA, Reuben RL, (2006) A study of the tribological behaviour of piston
ring/cylinder liner interaction in diesel engines using acoustic emission. Tribology
International 39(12): 1634–1642
Fisher RA, (1970) Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh
Goode KB, Moore J, Roylance BJ, (2000) Plant machinery working life prediction method
utilizing reliability and condition-monitoring data. Proceedings of the Institution of
Mechanical Engineers Part E-Journal of Process Mechanical Engineering 214: 109–122
Gorin N, Shay G, (1997) Diesel lubricant monitoring with new-concept shipboard test
equipment. TriboTest 3(4): 415–430
Grimmelius HT, Meiler PP, Maas HLMM, Bonnier B, Grevink JS, van Kuilenburg RF,
(1999) Three state-of-the-art methods for condition monitoring. IEEE Transactions on
Industrial Electronics 46(2): 407–416
Hargis SC, Taylor H, Gozzo JS, (1982) Condition monitoring of marine diesel engines
through ferrographic oil analysis. Wear 90(2): 225–238
556 R. Jiang and X. Yan

Hofmann SL, (1987) Vibration analysis for preventive maintenance: a classical case history.
Marine Technology 24(4): 332–339
Hojen-Sorensen PAdFR, de Freitas N, Fog T, (2000) On-line probabilistic classification
with particle filters. Neural Networks for Signal Processing X, 2000. Proceedings of the
2000 IEEE Signal Processing Society Workshop 1: 386–395
Hountalasa DT, Kouremenosa AD, (1999) Development and application of a fully automatic
troubleshooting method for large marine diesel engines. Applied Thermal Engineering
19(3): 299–324
Hubert CJ, Beck JW, Johnson JH, (1983) A model and the methodology for determining
wear particle generation rate and filter efficiency in a diesel engine using ferrography.
Wear 90(2): 335–379
Jakopovic J, Bozicevic J, (1991) Approximate knowledge in LEXIT, an expert system for
assessing marine lubricant quality and diagnosing engine failures. Computers in Industry
17(1): 43–47
Jardine AKS, Ralston P, Reid N, Stafford J, (1989) Proportional hazards analysis of diesel
engine failure data. Quality and Reliability Engineering International 5(3): 207–216
Jardine AKS, Lin D, Banjevic D, (2006) A review on machinery diagnostics and prognostics
implementing condition-based maintenance. Mechanical Systems and Signal Processing
20(7): 1483–1510
Jiang R, Jardine AKS, (2006) Composite scale modeling in the presence of censored data.
Reliability Engineering and System Safety 91(7): 756–764
Johnson JH, Hubert CJ, (1983) An overview of recent advances in quantitative ferrography
as applied to diesel engines. Wear 90(2): 199–219
Liu Y, Liu Z, Xie Y, Yao Z, (2000) Research on an on-line wear condition monitoring
system for marine diesel engine. Tribology International 33(12): 829–835
Logan KP, (2005) Operational Experience with Intelligent Software Agents for Shipboard
Diesel and Gas Turbine Engine Health Monitoring. 2005 IEEE Electric Ship
Technologies Symposium: 184–194
Lu S, Lu H, Kolarik WJ, (2001) Multivariate performance reliability prediction in real-time.
Reliability Engineering and System Safety 72: 39–45
Moubray J, (1997) Reliability-centred maintenance. Butterworth-Heinemann, Oxford.
Murthy DNP, Xie M, Jiang R, (2003) Weibull Models, Wiley.
Pontoppidan NH, Larsen J, (2003) Unsupervised condition change detection in large diesel
engines. 2003 IEEE XI11 Workshop On Neural Networks For Signal Processing: 565–574
Priha I, (1991) FAKS—an on-line expert system based on hyperobjects. Expert Systems
with Applications 3(2): 207–217
Raadnui S, Roylance BJ, (1995) Classification of wear particle shape. Lubrication
Engineering 51(5): 432–437
Roylance BJ, Albidewi IA, Laghari MS, Luxmoore AR, Deravi F, (1994) Computer-aided
vision engineering (CAVE): Quantification of wear particle morphology. Lubrication
Engineering 50(2): 111–116
Roylance BJ, Raadnui S, (1994) Morphological attributes of wear particles – their role in
identifying wear mechanisms. Wear 175(1-2): 115–121
Saranga H, (2002) Relevant condition-parameter strategy for an effective condition-based
maintenance. Journal of Quality in Maintenance Engineering 8(1): 92–105
Scherer M, Arndt M, Bertrand P, Jakoby B, (2004) Fluid condition monitoring sensors for
diesel engine control. Sensors, 2004. Proceedings of IEEE 1: 459–462
Sharkey AJC (2001) Condition monitoring, diesel engines, and intelligent sensor processing.
Intelligent Sensor Processing, A DERA/IEE Workshop on: 1/1 – 1/6
Sun C, Pan X, Li X, (1996) The application of multisensor fusion technology in diesel
engine oil analysis. Signal Processing, 1996., 3rd International Conference on 2:1695–
Condition Monitoring of Diesel Engines 557

Tang T, Zhu Y, Li J, Chen B, Lin R, (1998) A fuzzy and neural network integrated
intelligence approach for fault diagnosing and monitoring. UKACC International Con-
ference on Control 2: 975–980
Wang HF, Wang JP, (2000) Fault diagnosis theory: method and application based on
multisensor data fusion. Journal of Testing and Evaluation 28(6): 513–518
Wu X, Chen J, Wang W, Zhou Y, (2001) Multi-index fusion-based fault diagnosis theories
and methods. Mechanical Systems and Signal Processing 15(5): 995–1006
Zhang H, Li Z, Chen Z, (2003) Application of grey modeling method to fitting and
forecasting wear trend of marine diesel engines. Tribology International 36(10): 753–756
Zhao C, Yan X, Zhao X, Xiao H, (2003) The prediction of wear model based on stepwise
pluralistic regression. In: Proceedings of International Conference on Intelligent Main-
tenance Systems (IMS), Xi’an, China: 66–72

Benchmarking of the Maintenance Process at

Banverket (The Swedish National Rail Administration)

Ulla Espling and Uday Kumar

23.1 Introduction
To sustain a competitive edge in business, railway companies all over the world are
looking for ways and means to improve their maintenance performance. Bench-
marking is a very effective tool that can assist the management in their pursuits of
continuous improvement of their operations. The benefits are many, as bench-
marking helps developing realistic goals, strategic targets and facilitate the achieve-
ment of excellence in operation and maintenance (Almdal 1994).
In this chapter three different benchmarking studies are presented, these are: (1)
benchmarking of the maintenance process for cross-border operations, (2) study of
the effectiveness of outsourcing of maintenance process by different track regions in
Sweden, and (3) study of the level of transparency among the European railway ad-
ministrations. In these case studies the focus is on railway infrastructure excluding
the rolling stock. The outline of the chapter is as follows. An overview of Swedish
railway operation is presented in Section 23.2. The definition and methodology in
general is discussed in Section 23.3. The special demands for benchmarking of
maintenance is described in Section 23.4 and in Section 23.5, the special con-
siderations caused by the railway context is overviewed generally for the railways
and in more detailed from the Swedish context. The case studies are discussed in
Sectiosn 23.6–23.8. The discussions and conclusions are presented in Sections 23.9
and 23.10 respectively.
All the data pertinent to benchmarking of railway operation and maintenance are
retrieved, classified and analyzed in close cooperation with operation and main-
tenance personnel from both infrastructure owners and maintenance contractors.
The chapter discusses the pros and cons, the areas for improvement and the need for
the development of a framework and metrics for benchmarking. The focus of this
chapter is to visualize best practices in maintenance and also proposed means for
improvement in railway sector with special reference to railway infrastructure.
560 U. Espling and U. Kumar

23.2 Swedish Railway Operations

The railway industry is presently in a state of transition, with new stakeholders
emerging and old ones trying to adjust to the new operating environment. In each
country of Europe, the railway administration is vertically integrated, i.e. to com-
prise all in “one body”, almost until the end of the 1980s, when a new railway era
started. The vertically integrated railway organisations were and still are partly
government-funded and regulated by parliament through government directives.
Figure 23.1 illustrates the organisational changes in Sweden from “single entity”,
SJ (the Swedish State Railways), to a number of business units, each functioning
independently to achieve their business goals. During 1988, SJ as a state authority
was restructured to enhance its competitiveness and make railway travel and
transportation economically viable. The restructuring programme divided SJ up in-
to two major groups, namely train operating companies (TOCs) and infrastructure
owners. The TOCs are expected to take the responsibility for transportation of
goods and passengers in close cooperation with infrastructure managers. Today
there are about 20 TOCs functioning in Sweden. The railway infrastructure is
managed by ‘Banverket’ (the Swedish National Rail Administration), which is a
government body. In 1998, Banverket was reorganised into two distinct categories,
purchasers or ‘service buyers’ and contractors, or ‘service providers’. For adminis-
trative purposes, Banverket is divided into five regions, each of which is respon-
sible for maintenance planning and purchasing, and following up the execution of
the maintenance contract. In recent years, maintenance contracts have increasingly
been awarded through open tender, thus being subjected to market competition.

1988-07-01 1998-01-01 2001-07-01 2004-07-01

Swedish Rail Agency

Rail Inspectorate Inhouse Svensk Banproduktion
Banverket Contractor
Rail Traffic
Administration Green Cargo AB
SJ Jenhusen AB
EuroMaint AB
EuroMaint AB
ASG AB TR TrainTech AB Interfleet
TrafficCare AB
Swebus Sweferry

Nordwaggon AB Unigrid AB

operators TGOJ

Figure 23.1. Organisational changes within the Swedish railway system

Benchmarking of the Maintenance Process at Banverket 561

23.2.1 Maintenance

Railway infrastructure is a complex system. Usually such infrastructure is techni-

cally divided into substructures, namely bridges, tunnels, permanent way, turnouts,
sleepers, electrical assets (both low and high voltage), signalling systems including
systems for traffic control, telecom systems such as systems for radio communi-
cation, telecommunications and detectors, etc. Maintenance of all these subsystems
is a complex issue which makes it difficult to plan and execute the maintenance
task. Factors such as geographical and geological features, topography, climatic
conditions need to be considered when planning for maintenance. Furthermore, the
availability of track for maintenance is also an important issue to be considered
when planning the maintenance tasks to be executed. Previously, maintenance
management was based on technical system characteristics instead of asset delivery
functions. Maintenance is critical for ensuring safety, train punctuality, overall ca-
pacity utilization and lower costs for modern railways.
The deregulation, privatization and outsourcing processes have created new
situations, new organizations and new structures for collecting appropriate data
from the field operations and extracting relevant information, so as to make correct

23.2.2 Need for Benchmarking in Maintenance

Many of the European railways have followed a similar evolution. Although many
of the countries of Europe are now members of the European Union, questions are
being raised concerning the transparency of the state-controlled railway sector in
order to make comparisons possible and to find the best practices followed within
the railway business. The European railway sector has gradually started to use
benchmarking so that the different actors may be able to learn from each other.

23.3 Benchmarking: An Overview

Benchmarking has its root in fundamental business exercise and began to take
shape in the beginning of 1980. It was introduced as a tool for business develop-
ment and is supposed to offer a key to large-scale improvements, as it provides a
basis for learning from the best practices, providing a road map for copying the
work process of the best in the class, i.e. it provides gains with relatively little
effort (Dunn 2003). In general the magnitude of the improvement is around 10–
15% (Varcoe 1996) and in some cases it can be as high as 35% (Burke 2004).
There are different benchmarking approaches ranging from the purely
quantitative to the highly qualitative (Oliverson 2000). Quantitative benchmarking
will benchmark, for example, the percentage of emergency work orders, the num-
ber of skilled workmen per first line supervisor or the percentage of overtime.
Moulin (2004) discusses benchmarking of the public sector, in which some aspects
of performance measurement must be considered, and states that, since organisa-
tions in this sector often perform non-profitable administrative work, they should
be viewed from a balanced scorecard perspective (see Kaplan and Norton 1992).
562 U. Espling and U. Kumar

Such organizational measures are useful to service users and provide a clear system
for translating feedback from the analysis into strategy for corrective actions.

23.3.1 The Benchmarking Methodology

Successful benchmarking starts with a deep understanding and good knowledge

regarding one’s own organisation’s processes; i.e. learning about one’s own per-
formance and bringing one’s own core business under control before learning from
others (Wireman 2004).
The most common approach to benchmarking is to compare one’s own per-
formance indicators with those of competitors or other companies in the same area,
which can be accomplished using simple questionnaires completed by personnel
involved in maintenance activities, with little or no expert help to conduct com-
prehensive studies, or with help from outside firms providing expertise in the
planning, execution and implementation of such processes. Based on what is to be
compared, benchmarking can be classified as performance, process or strategic
benchmarking (Campbell 1995). Similarly, based on whom one should make a
comparison with, benchmarking can be classified as internal, competitive, func-
tional or generic benchmarking (Zairi and Leonard 1994).
The results obtained from benchmarking identify the gap between one’s own
organisation’s performance and the one following best practices. These results are
then used to improve and develop core competencies and core businesses, leading
to lower costs, increased profit, better service towards the customers, increased
quality, and continuous improvements. In order to gain benefits, an organisation
has to mature in its own core competencies, and to ensure success, the ROI (return
on investment) should be calculated for each benchmarking exercise (Wireman
1998, 2004).
A broad survey of the literature shows that, even though all the suggested
methodologies for benchmarking are similar in their approach, they vary from a
general two-step process to a more detailed 10-step process (Varcoe 1996;
Ramabadron et al. 1997; Wireman 2004). All these steps can be related to Deming’s
famous PDCA cycle. Malano (2000) goes a little further and describes Deming’s
cycle as a “circular process” which includes the following phases; planning, analy-
sis, integration, action and review. The operational form of these four steps for the
purpose of benchmarking may look like the following:

1. Detailed planning of the benchmarking operation is to keep the goal of

benchmarking in focus (for example cost reduction, productivity, etc.) and
identify suitable partners for benchmarking. This step essentially encom-
passes an internal audit to learn about the organisation’s business indicators
2. Identifying which business to visits and appropriate data collection.
3. Analysis of the data and information collected to identify gaps and the
sharing of information
4. Implementation and continuous improvement.
Benchmarking of the Maintenance Process at Banverket 563

Most of the literature points out the fact that successful benchmarking needs a
good plan specifying what to benchmark, whom to visit (to study the best practice),
when to visit, and what types of resources are required for analysis and implemen-
tation. Often simple studies are completed at little cost and generally have no
follow-up. Good benchmarking, on the other hand, is time- and resource-consum-
ing and has well-structured follow-up plans etc. The selection of the type and scope
of the benchmarking process should be made on the basis of the impact of the
outcome on the critical success factors for the process (Mishra et al. 1998).
A benchmarking exercise is of no value, if the findings are not implemented. In
fact, without implementation it would be a waste of resources. The benefits of
benchmarking do not occur until the findings from the benchmarking project are
realized, and therefore performance improvement through benchmarking needs to
be a continuous process.

23.3.2 Metrics

Metrics for benchmarking can be indicators or KPIs as discussed in Chapter 19. In

order to make the benchmarking process a successful exercise, it is important that
the areas, the process enablers and the critical success factors required for a good
performance needs to be identified, so that the common denominator or any com-
mon structure that is important to compare can be described by indicators or other
types of measurements, often presented as percent (%) (Wireman 2004). These
performance drivers can be characterized as lead and lag indicators, lead indicators
being performance drivers and lag indicators being outcome measures (Åhrén et al.

23.4 Benchmarking of Maintenance

Maintenance is treated as an enabler of improved asset or equipment performance
(see Figure 23.2) which creates additional value for the business process (Liyanage
and Kumar 2003). Its performance can be monitored by performance measures like
availability, quality, value (cost) etc. (Mishra et al.1998).

Equipment Perfomance
state measurement

Comparison with
Maintenance benchmarked

Figure 23.2. Maintenance’s link with benchmarked value

564 U. Espling and U. Kumar

Since maintenance is a process of continuous improvement of the delivered

performance, benchmarking can be used to improve efficiency in maintenance and
offer solutions for improvement in maintenance performance. One definition of
benchmarking maintenance used in practice is “the process of comparing perform-
ance with other organisations, identifying comparatively high performance organi-
sations, and learning what they do that allow them to achieve the high level of
performance” (Dunn 2003).
Relevant data can contain the following: (1) the man hours, (2) the material
costs, (3) the cost of preventive maintenance, (4) the cost of predictive maintenance
and (5) the cost of maintenance contracting. In Europe the European Federation of
National Maintenance Societies (EFNMS 2006) has agreed upon 13 different
maintenance indices to be used for presenting the results from benchmarking main-
tenance organisations. These are:

1. Maintenance costs as a percentage of plant replacement value

2. Store investment as a percentage of plant replcement value
3. Contract cost as a percentage of maintenance cost
4. Preventive maintenance costs as a percentage of maintenance costs
5. Peventive maintenance man hours as a percentage of maintenance man
6. Maintenance cost as a percentage of turnover
7. Training man hours as a percentage of maintenance hours
8. Immediate corrective maintenance man hours as a percentage of mainten-
ance hours
9. Planned and scheduled man hours as a percentage of maintenance man
10. Required operating time as a percentage of total available time
11. Actual operating time as a percentage of required operation time
12. Actual operating time divided by the number of immediate corrective
maintenance events
13. Immediate corrective maintenance time divide by the number of immediate
corrective maintenance events

Wireman (2004) states that the maintenance management impact on the return
on fixed assets (ROFA) can be measured by two indicators, namely:
• Maintenance cost as a percentage of the total process, production, or manu-
facturing cost
• Maintenance cost per square foot maintained

23.4.1 Decision Criteria from Benchmarking Exercise

Results as experienced from different benchmarking projects in the US have

identified some rules of thumb that can be used to evaluate the results as well as
make suggestions for future actions. One rule of thumb concerns the ratio of the
corrective maintenance volume to the total maintenance volume. A level higher
than 20% indicates a reactive situation, where the future focus will be to bring the
Benchmarking of the Maintenance Process at Banverket 565

core business under control, since planned work vs. unplanned work may have a
cost ratio as high as 1:5. Another rule of thumb concerns a high level of overtime,
which indicates reactive situations in the maintenance process. Since labour is a
large cost driver for maintenance, the amount of overtime can have a large impact
on maintenance costs. Another large cost driver is spare parts (Wireman 2004,
Hägerby 2002).

23.4.2 Railway Context

Benchmarking approaches used by industries to improve their performance

through comparison with the best in the class, can be equally used for bench-
marking of the railway operations. But unlike the industrial sector, railway infra-
structure consists of a larger number of individual assets, including substructure,
permanent way, signalling, electrical and telecom assets that extend over a few or
hundreds of kilometres. Furthermore, there are large differences between the
structures of the different railway organisations. At present, many organisations are
characterised by comprising one entity, whereas some are divided up into traffic
companies and infrastructure owners, with an in-house or outsourced maintenance
function. The different types of traffic on the railway tracks have different degra-
dation characteristics and, therefore, it is difficult to compare passenger-intensive
lines with heavy haul lines or lines with mixed traffic. Furthermore, the data
collected from the different partners selected for benchmarking are not always
possible to compare without normalisation. It is also important to validate and audit
the collected data to find outliers (Oliverson 2000). Some examples of the normali-
sation required within railway benchmarking are presented in the following.
In a benchmarking project called “InfraCost”, data have been collected over a
number of years to compare the asset life cycle costs of different railways. A
complex normalisation process has been used to bring all the information, for
example maintenance costs, renewal costs, local labour costs, intensity and speed
of trains, from different countries in Europe to a same base for comparison (see; Zoeteman and Swier 2005 ).
Another way to normalize data is to identify the cost drivers and try to establish
a link between performance and cost on the one hand, and performance and the age
of the assets on the other. In order to compare the assets, compensation factors
were established on the basis of the network complexity, measured in terms of
(Stalder et al. 2002):
• Density of turnouts
• Length of lines on bridges and in tunnels
• Degree of electrification
• Usage according to average frequencies of train per year
• Average gross tonnage per year (freight and passenger)
In the project, the cost drivers have been established, but the implementation of
life cycle cost (LCC) strategies for avoiding the difficulties of separating the main-
tenance cost from the renewal expenditures has not yet been fully realized (Stalder
et al. 2002).
566 U. Espling and U. Kumar

When the International Union of Railways (UIC) in their benchmarking pro-

jects between the years 1996 and 2002 compared costs between Europe, USA and
Asia, they found big differences in the costs. In an attempt to understand the
differences, Zoeteman and Sweir (2005) developed a model that converted the
benchmarked results into life cycle cost per km of track, including the maintenance
cost, renewal cost and overhead cost both for the organization and the contractors.
The major differences are in purchasing power, wages, turnout density, and degree
of electrification, the proportion of single track and intensity of use.
Benchmarking is not yet common practice within the railway sector, and there
is a need to build up a framework and metrics in order to compare and find out the
best practices.
The aim of using benchmarking as a tool to improve prevalent maintenance
practices within the railway sector is to demonstrate the measures that make it
possible to compare the result from one operation to another regarding the railway
administrations under different circumstances and conditions, and to identify the
best practices in the area. Therefore, the benchmarking process has to be evaluated
and normalised to fit the railway maintenance process. Accordingly, it is also
essential to decide what kind of KPIs (key performance indicators) need to be im-
plemented for improvement.

23.5 Benchmarking in the Swedish Railway Sector

Benchmarking within the railway sector is characterized by state ownership
and monopoly. One of the first benchmarking projects, “InfraCost”
(; Zoeteman and Swier 2005) showed big differences in
maintenance costs among European, Asian and American railway administrations.
The result from this benchmarking shows the need for establishing a common
framework and common metrics for benchmarking. Initially benchmarking within
Sweden was motivated using other reasons than finding the best practice. These
• Checking, if it is possible to perform benchmarking and studying bench-
marking methods,
• Finding those key areas that are critical success factors
• Finding answers to questions like “Why is it less expensive to run railways
in neighbouring countries?”
The case studies presented in Sections 23.6–23.8 have used three different
approaches concerning methodologies for data collection and classification, nor-
malization and analysis of results. The case studies are:

1. Two neighbouring local track areas sharing a line for railway traffic on
each side of the border. The aim was to compare the maintenance cost,
identify differences and find areas to improve.
2. Internal benchmarking for maintenance contracts in order to find the best
practice and to improve the maintenance contracts.
Benchmarking of the Maintenance Process at Banverket 567

3. To determine what (maintenance) performance measures were in use

within the railway sector in Europe. The aim was to scan the possibility of
finding areas to compare, just by looking into those official documents that
some of the railways have presented.

The common denominator between these case studies is used for benchmarking
methodologies in order to find out if it is useful within the railway sector. The
differences between these case studies are the main objectives of the bench-

23.6 Case Study – 1: Benchmarking Across the Border

A case study benchmarking a cross-border operation and maintenance process was
initiated by Track Area A for the rail administration in Country A. Track Area A
provides railway infrastructure in the western part in Country B, between City B in
Country B and City A in their own country (Country A). The aim was to study and
understand why the operation and maintenance cost are different on the other side
of the border. They also needed to find out if those costs were comparable with the
costs in Country B and if it was possible to coordinate parts of the maintenance
work between these two countries in order to decrease the cost (Åhrén and Espling
The benchmarking process was conducted by Luleå Railway Research Center,
a neutral party to both the organizations. During the preparatory stage of the
benchmarking process, a total transparency between the infrastructure owners
representing these two countries was agreed upon. It was also decided (by the
sponsor of the study) not to make the result of the study public and to keep it con-
fidential for five years.
Both track areas were organised more or less identically for the purpose of
maintenance, and the maintenance activities were planned and executed in a
similar way. It was therefore not necessary to examine and normalise the overhead
costs of both the railway administrations.

23.6.1 Metrics and Data

The metrics and data collected were the cost for the operation and maintenance and
outcome of performance losses. The data were collected for one calendar year from
the systems for accounting, planning system, failure reporting and inspection and
• Budget vs. performed outcome for maintenance costs
• Overhead costs for the local administrations
• Maintenance planning
• Failure statistics
• The inspection remarks
568 U. Espling and U. Kumar

However, the following information and data relevant to the study could not be
• Overhead cost for the contractor (not available due to the competition
between the different contractors)
• Man hours (not available, not collected in the client system from the in-
• Traffic volume
• Asset age, which were approximately the same (not necessary to collect,
since the traffic mix and volume were the same)
• Spare part costs (not available) Normalisation
Since the organisation and accounting structure were almost the same, it was
assumed that the missing data could be disregarded. The amount of normalisation
was restricted to adjusting the currency.

23.6.2 Results and Interpretations

The available data and information were then sorted as shown in Table 23.1. The
maintenance costs were grouped into the categories snow removal, corrective
maintenance and preventive maintenance; see Table 23.2.

Table 23.1. Comparing cost per metre of track

Object Track Area A Track Area B
Total cost 795 290
Maintenance cost 285 280
Track area administration cost (overhead) 220 8
Other external costs, e.g. consultancy 90 2
Charges for electric power 200 0

Table 23.2. Difference in percentage in maintenance costs between Track Areas A and B
Maintenance activities Difference in percentage
from Track Area B
Snow removal + 10%
Corrective maintenance, including organisation + 32%
for preparedness (emergency service)
Preventive maintenance, including inspection – 62%

The benchmarking result showed that the maintenance cost was approximately
the same as the total cost per track meter. One of the findings was that the amount
of corrective maintenance was very high in both track areas. A closer investigation
showed that Track Area A had a larger amount of corrective maintenance and
therefore less money for preventive maintenance.
Benchmarking of the Maintenance Process at Banverket 569

Furthermore the overhead cost and other external costs such as travel costs,
costs for consultancy etc. in Track Area A were much higher compared to Track
Area B. One of the explanations was the geographical isolation of Track Area A
from its own administration, resulting in higher traveling costs and the necessity of
buying consultancy for some services that Track Area B could obtain from its
nearby regional office. Another explanation was that Track Area A had to finance
all its buildings, the electrical power and the cost for the traffic control centre,
while this was taken care of by a separate organization for Track Area B.
It was also possible to find those areas of work that could be mutually co-
ordinated, for example snow removal. However, this was something that needed to
be negotiated and was therefore considered a political matter.
The implementation phase was the responsibility of the national railway ad-
ministrations. The results were mainly used as arguments clarifying why the costs
were so much higher for the railway line in Country A compared with those of
other national lines.

23.7 Case Study – 2: Internal Benchmarking

for Maintenance Contracts
All the maintenance work within Banverket is purchased either from the in-house
contractor or from an external contractor. This necessitates legal operations, and
maintenance business contracts are prepared and written for every maintenance
commission, containing details of the work to be provided, with targets and agreed
performance measures (for example, minimum of track down time in order to
increase the train punctuality) to control the quality of the maintenance work to be
Purchasing infrastructure maintenance is a complex issue due to the engineer-
ing complexities of railway assets, safety assurance, the usage type, the climate and
the traffic mix. In particular, it is very difficult to define the task to be performed
(procured) and the desired final outcome from the contract. Many different pro-
curement models have been tested with varying degrees of success (Larsson 2002).
This benchmarking project was launched at the request of one of the 16 local
regional track area managers (clients) responsible for procuring the maintenance
contracts. The manager had observed that their contracts with, in this case, the in-
house contractor had resulted in an increase in the cost limits, while the perform-
ance and the quality had started to decrease.
The process started with an internal survey of an ongoing contract. The contract
included snow removal and maintenance activities such as corrective maintenance
(failure repair and repair due to inspection remarks classifying faults as requiring
immediate action), inspections for safety and inspections for maintenance (classi-
fied as condition-based maintenance) and predetermined maintenance pin-pointed
by the internal regulations. Repair work due to faults not classified in inspections
as requiring immediate action was to be bought separately. The survey showed
problems such as a high amount of corrective maintenance, increasing costs for
failure repair, an increasing amount of backlogs and a long response time for
failure calls. The aim was to find ways to improve the procurement and the next
570 U. Espling and U. Kumar

maintenance contract by learning from the experience and knowledge of other re-
gional track areas in this respect.
The benchmarking process followed the standard procedure recommended for
benchmarking as stated in an earlier section (Section 23.3). The study covered nine
local track areas named as Track Areas A–I, and six of these were selected for the
study and follow-up of qualitative interviews (D–I).

23.7.1 Metrics and Data

Before starting the collection of data and other relevant information, the existing
indicators and indices used by maintenance professionals available in the literature
and through professional bodies, for example the EFNMS indices (EFNMS 2006),
were examined for their suitability for the purpose of benchmarking maintenance
practices in different track regions at Banverket. Most of these metrics were not
found suitable for the purpose of this study and therefore actions were initiated to
establish indicators that would facilitate this benchmarking process. Furthermore,
information and data which were planned to be included in the study, namely
details of maintenance-related measures such as maintenance costs, maintenance
hours, material, maintenance vehicle costs, overhead costs etc., were missing or
only available in the aggregate form, due to the competitive situation.
As the deregulation of the railway transport system in Sweden has led to
competition among the traffic companies, it was not possible to get hold of traffic
data, i.e. how the track was used, because this information is being treated as a
business secret by the train operators.
Data from 2002 were collected from the systems for accounting, the failure
reports, the inspection remarks, and the asset information and from the train delay
reports. The following data were collected:
• Asset data from BIS: total length of track, total length of operated track,
total amount of turnouts, total amount of operated turnouts, length of
electrification, number of protected level crossings. An attempt was also
made to define their standard by the assets’ age and what type of traffic
they had been exposed to – this had to be skipped as it was not possible to
obtain complete data for all the assets and different track lines. The purpose
was to know the intensity of track utilization.
• From the accounting system AGRESSO: snow removal and maintenance
costs for one year, defined per maintenance activity corresponding to the
maintenance contract (corrective, predetermined, condition-based etc.) and
cost per asset type (rail, sleeper, turnout etc.).
• From BESSY (inspection remark system): the number of inspection re-
marks, classified as remarks requiring immediate attention or deployment
of corrective measures or remarks requiring attention or correction in the
near future (deferred inspections remarks).
• From OFELIA (failure report system): failure reports (including asset type
and type of failure, time to fault localization and time to repair, symptoms
and causes, place, date and time). Time to establish on the fault place.
Benchmarking of the Maintenance Process at Banverket 571

• From TFÖR (train delay system): train delay statistics corresponding to

infrastructure failures. TFÖR registers all the train delays and records them
together with the respective reported infrastructure failure.
• Contracts and procurement documents.

23.7.2 Data Collection

The data collected from the accounting system needed normalisation in particular,
due to difficulties in separating normal track maintenance activities from track
renewal activities, as these two concepts were frequently being mixed in the
database. There were also some difficulties in using the prescribed terminology,
because of misunderstandings in the maintenance context which resulted in the
common structure for reporting cost back into the system not being used, and data
had to be sorted afterwards into the “right boxes”. Some track areas were using
maintenance definitions and concepts from other branches representing the build-
ing and construction industry. Some “outliers” were also eliminated from the data,
especially those representing some special or just-one-time investments made to
increase train punctuality or reduce winter problems.
Cost drivers leading to non-availability of infrastructure for train operation or
affecting safety were identified. The respective train delay hours were also re-
trieved. The cost drivers for the infrastructure were failure or defects in rail,
sleepers, rail joints, turnouts, level crossings, and catenaries (overhead wire). On
further investigation it was found that the cost related to sleepers could be classi-
fied as outliers, because a large amount of the sleepers replaced in the 1990s were
delivered with inbuilt defects. These sleepers are being dealt with in a replacement
phase within the framework of a large project.
In order to find the best internal practice within the organization, two para-
meters, the “amount of corrective maintenance” and the management indicator
“return on fixed asset” (ROFA), were used.

22.7.3 Results and Interpretation

Track Areas A–I are the nine track areas, D–I are those selected by the infra-
structure manager for qualitative interviews and Track Areas A–C are references.
The data pertaining to various costs, corrective maintenance, condition based
maintenance and failure and delay statistics from Track Areas A–I for the year
2002 are given in Tables A.1–A.7 of the Appendix to this chapter.
When using the parameter ROFA and the rule of thumb concerning the lowest
amount of corrective maintenance, Track Areas B, G, C and H were the best
performers (see Figure 23.3) and the ROFA measurement showed a tendency of
“more money per track metre, less corrective maintenance”; see Figure 23.4.
572 U. Espling and U. Kumar

Share of corrective and preventive


Corrective maintenance Preventive maintenance

Figure 23.3. Share of corrective maintenance and preventive maintenance for the nine track
areas studied (Espling 2004)

Maintenance cost per square metre


Skr/sqr metre


Figure 23.4. Maintenance cost per square metre of track area (Espling 2004)

Another comparison was made concerning the maintenance cost per metre
within the framework of the maintenance contract for each track region under
study. Track Areas H, C and G showed the best practice followed; see Figure 23.5.
It was noted that the maintenance cost varies greatly per asset or per track metre
unit among the compared track areas due to the asset standard, type of wear,
climate and type of traffic.
To compare the performance, the amount of functional failures and train delay
hours were listed as failure or delay hours per metre or per cost driving asset; see
Figure 23.6. Even here the best performance was shown by Track Areas G and H.
Benchmarking of the Maintenance Process at Banverket 573

Maintence cost in the contract



100 Predetermined maintenance

Maintenance inspection
Cost per m

Saf ety Inspection
60 Repair Immidate Insp remarks
Failure repair
Snow removal


Track are a

Figure 23.5. Maintenance cost in the maintenance contract

De lay hour s and am ount failur e and im m idate ins pe ction

r e m ar k s pe r k m or pe r as s e t


20 Inpection remarks/km
hours/asset or

f ailur/crossing
f ailures/turnout
f ailures/km
5 h/catenary
A B C D E F G H I h/track km

Tr ack ar e as

Figure 23.6. A comparison of the performance of the different track regions

All these results obtained from the comparison of different track regions, in
combination with the content of the maintenance contract defining work specifi-
cations within the maintenance contracts, were used for the gap analysis. The gap
analysis was conducted with the help of interviews with the track area managers
for Track Areas D–I. The best practice criteria were identified with the help of
interviews and survey questionnaires. The best practices were:
• Goal-oriented maintenance contracts combined with incentives
• Scorecard perspectives, quality meetings and feedback facilitate manage-
ment by objectives
• Frequent meetings where top managers from the local areas participate
• Forms for cooperation and an open and clear dialogue, for example partner-
• Focus on increased preventive maintenance of assets with frequent func-
tional failures and a high maintenance cost will give results, e.g. turnouts
• The use of Root Cause analysis
574 U. Espling and U. Kumar

The best practices identified from the benchmarking study were immediately
implemented in the new purchasing procedures and documents. These were used
for floating tenders and for new contracts by the infrastructure manager for the
local track area initiating this benchmark, and resulted in maintenance contracts at
a much lower price with better control of quality and performance. The bench-
marking study also identified the best practice for gaining control over backlogs by
using SMS and other internet-based tools. Besides these, the maintenance contract
was also provided with information about goals, objectives and expected incentives
related to the execution of the maintenance contracts.

23.8 Case Study – 3: Transparancy Among the European

Railway Administrations
In an attempt to find ways of benchmarking railway infrastructure administrations
as an “external observer” and to give an answer to the question “is there any
transparency in the railway systems of Europe?” five railway administrations were
selected; see Table 23.3.

Table 23.3. Infrastructure managers (A–E) and important organisational differences

Infrastructure Outsourced maintenance Traffic operation Traffic operators
A Both external and internal Free service Many
B Internal outsourcing Free service Few
C Internal outsourcing Included Few
D Both external and internal Free service Many
E Both external and internal Is bought Few

23.8.1 Metrics

In this study, many official documents, such as annual reports and regulation letters
and documents, were studied in detail in order to gain insight into the types of
measures, key performance indicators and indices used by the railway administra-
tions investigated (Åhrén et al. 2005). The collected measures were then compared
with those recommended by EFNMS in order to see if these could be used in future
benchmarking exercises. Rather soon it was found that the EFNMS indices were
developed for factories and plants and were not suitable for studying or bench-
marking the performance of infrastructures, as they did not consider the type of
asset, the age of the asset, the asset condition or the practice of outsourcing main-
tenance work in an open market.

23.8.2 Normalisation

Since data were qualitative in nature, no normalisation was carried out for the pur-
pose of this study.
Benchmarking of the Maintenance Process at Banverket 575

23.8.3 Results and Interpretation

The next step was to group the measurements according to the unit which they
measured; for example cost went into the economy group.
The parameters collected and reported by the infrastructure managers were then
classified into different categories of common denominators. These categories
comprised the following: strong denominators (Sods) collected by everyone, me-
dium denominators (Sims) collected by more than 50%, and weak denominators
(Sews) collected by less than 50%, and finally some indicators (I) also identified as
Sods presented as a percentage value; see Figure 23.7. The results show that eco-
nomic values, safety, and traffic are strong denominators, followed by quality,
assets, and labour. It is important to note that “traffic” is the total traffic volume on
a national level. These parameters could later on be used to develop new bench-
mark measures, e.g. maintenance costs per staff and amount of accidents per traffic
Today the comparable indicators are:
• Corrective maintenance cost / total maintenance cost including renewal
• Total maintenance cost / turnover
• Maintenance and renewal costs / cost for asset replacement
• Maintenance cost / track metre
When comparing the outcomes of the findings only highly aggregated measures
were used for the purpose of analysis, in terms of:
• Economy
• Punctuality
• Safety
• Number of staff employed
• Track quality
• Total traffic volume divided up into passenger and freight kilometres
They can be used as benchmarking measures, the lag indicators showing past
performance. This indicates that these areas of interest are important for every
studied railway administration. It is also important to note that the identified
measures can be defined as outcome measures from the railway maintenance pro-
cess. It has not been possible to find any measures reflecting the actual maintenance
performance. This can probably be explained by the fact that the maintenance ac-
tivities are carried out by either in-house or external maintenance contractors (Åhrén
et al. 2005).
Some of the maintenance performance indicators are used by various organi-
zations and provide railways with an opportunity to benchmark their operations
internationally to improve their performance. One of the findings in the studies is
that there are parameters missing regarding the traffic volume, infrastructure age,
and history of the performed maintenance.
576 U. Espling and U. Kumar

Clas s ification of pos s ible param e te rs







li t












Figure 23.7. Classification of possible comparable parameters

23.9 Discussion
The reason why most plants do not enjoy best practices in maintenance is that they
do not picture how to structure a sustainable improvement process (Oliverson
2000). Benchmarking can then be a tool for waking up organisations and their
management in order to find improvement areas that create more value from the
business process. However, on the way there are many pitfalls to be aware of, such
as starting the process without knowing the starting point and the destination
(Oliverson 2000; Wireman 2004). Other pitfalls are:
• Just doing quantitative benchmarking. Quantitative numbers just tell parts
of the story, and the difficulty is to start the sustainable improvement pro-
cess, by focusing on qualitative benchmarking (Oliverson 2000). If the
organisation does not have maturity or self-knowledge, it just glances at the
figures and continues to do as it always has done before.
• Rejection of the results. Managers often overestimate their performance
and react with disbelief to feedback that tells them that their plants are
merely mediocre (Wiarda and Luria 1998).
• Not being aware of the need for normalisation of data, including the prob-
lem of outliers or comparing “apples with bananas”.
• Not finding the enablers (Wireman 2004).
• Using benchmarking data as a performance goal.
• Believing that it is as easy as just copying the best practice into one’s own
organisation, rather than learning.
• Unethical benchmarking.
The methodologies for performing benchmarking for plants are rather well
developed, but need to be adapted for infrastructure. Today it is difficult to
Benchmarking of the Maintenance Process at Banverket 577

establish what is included in maintenance, renewal and new investment. Other

difficulties are how the infrastructure administrations are organized, for example if
the client/contractor is the organization, if the maintenance is outsourced, and how
it is outsourced; outsourcing makes it difficult to collect costs for overheads,
maintenance, man hours, spare parts, backlog’s, etc.
Today there are a number of performance indicators in use connected to main-
tenance, covering for example the areas of safety, track quality and asset reliability.
Maintenance performance and cost control are the so-called lag indicators.

23.10 Conclusion
Stating that the “benchmarking of maintenance provides gains with relatively
little effort” is a truth that needs some modification. First of all, the theory of
maintenance is a rather young science, which has resulted in a lack of common
nomenclature and understanding of maintenance through value. This is one of the
reasons why it is difficult to define what is included in maintenance and where to
put the boundaries for renewal. There can also be different structures in use to
describe what operation is and what maintenance is, and also for grouping main-
tenance into preventive and corrective maintenance. Outsourcing maintenance has
become popular in recent years, and this makes it difficult to obtain all the
necessary measurements, especially if the outsourcing is carried out in a per-
formance contract (lump sum, fixed price). The assets’ complexity and condition
are also difficult to compare and measure.
The multitude of entities involved in the railway systems after their restructur-
ing has made it considerably difficult to locate the organization responsible for the
problems encountered and to ascertain the course of action to be taken to rectify
Benchmarking cannot be used if its results are not implemented. The benefits
from benchmarking do not occur until the findings from the benchmarking project
are implemented and systematically followed up and analyzed against the set
targets and goals.
The results from the three benchmarking studies presented show that bench-
marking is a powerful tool and its methodology can be used by other industries.
Since the focus of these case studies is the benchmarking process and not the con-
tinuous improvement process, it is important to point out the need for empowered
enablers, who will be responsible for identifying the problem, finding a solution to
the problem and implementing the solution and the continuous improvement
processes. The case studies also show that there is some more improvement to be
made in order to start the whole process of benchmarking including the implemen-
tation in an integrated manner.
578 U. Espling and U. Kumar

23.11 Further Research

Further research could be conducted to identify those parameters that are essential
for developing lead indicators (Kaplan and Norton 1992) for effective planning and
execution of railway infrastructure maintenance tasks, by developing methods to
select, evaluate and implement these indicators in open market competition.
More metrics, i.e. indicators and a measurement framework, should be devel-
oped and reconfigured for maintenance, making comparisons possible, for example
from the Life Cycle Cost perspective vis-à-vis the business perspective. In railway
administrations, one critical improvement area is enhancement of the quality of the
incoming data. This can be achieved:
• By giving details of the status of the assets (age and degree of wear), the
total traffic volume per year and the available time on track for infrastruc-
ture maintenance. This information should be incorporated as a correction
factor in the analysis
• By well-structured economic feedback reports on maintenance activities.
This should be implemented so that it is possible to differentiate resources
which are consuming corrective maintenance activities and those consum-
ing preventive maintenance activities. The structure of the economic feed-
back reports on maintenance should be designed so that it may be possible
to differentiate operation and corrective and preventive maintenance.
• By separating the specially targeted maintenance investment from normal
“maintenance activities”; efforts to enhance punctuality in special cam-
paign form are an example of the former.

23.12 Acknowledgements
The authors are grateful to Banverket (the Swedish Rail Administration) for
sponsoring this research work and providing information and statistics through free
access to their database.
Benchmarking of the Maintenance Process at Banverket 579


Table A.1. Failure and delay statistics from Track Areas A-I for the year 2003
Train Train Train delay Amount of Amount of Amount of Inspection
Track delay delay h/catenaries failures/ failures/ failure/ remarks/
area h/track km h/turnout km track km turnout crossing track km

A 1.07 0.25 0.15 4.2 3.5 2.5 4.7

B 0.88 0.33 0.61 3.7 2.9 1.9 3.1
C 0.73 0.21 0.45 2.5 1.68 1.5 4.2
D 0.57 0.29 0.1 3.6 2.24 1.3 2.7
E 0.93 0.76 0.25 4.7 4.59 1.5 1.4
F 0.97 0.36 0.41 3.8 2.22 1.0 0.9
G 0.35 0.14 0.05 2.8 1.28 1.3 3.5
H 0.32 0.31 0.14 2.0 2.24 1.1 3.0
I 1.18 0.84 0.14 6.5 6.1 1.9 3.2

Table A.2. Cost of various maintenance activities in thousands of SEK for each track area
for the year 2003
Track Snow removal Corrective Preventive Contract sum
area in thousands of SEK maintenance maintenance
A 15,325 24,189 14,130 53,644
B 16,801 17,792 12,941 47,534
C 12,908 28,728 10,863 52,553
D 22,085 46,772 20,537 89,394
E 18,074 44,168 21,532 83,774
F 8,250 39,181 15,991 63,442
G 4,336 22,050 26,388 52,774
H 3,041 22,854 19,131 45,026
I 4,976 46,414 31,803 83,193

Normalisation is necessary due to the investment of extra money just for one year
to enhance the preparedness to deal with failures causing train delays. The figures
in Table A.2 are the figures before normalisation
580 U. Espling and U. Kumar

Table A.3. Costs in thousands of SEK for corrective maintenance due to failure reports from
s for the year 2003
Track Maintenance Emergency Actual cost Fixed price Total cost SEK/ failure
area organisation organisation (lump sum) (t SEK)
spare parts)
A 2880a 7,989 10,869 5933
B 4416a 6,145 10,861 5273
C 3732a 4,128 7,860 4690
D 4701 11,448 16,150 5379
E 4776 16,078 20,854 5073
F 4884 14,095 18,897 5530
G 12,686 5838
H 3512a 7,785 11,444 6065
I 20,274 6304 28,246 145
a Extra preparedness 2003

Table A.4. Cost statistics for corrective maintenance triggered by the failure reporting
system ofelia (in thousands of SEK) after normalisation
Track Maintenance Emergency Actual cost Fixed price Total cost SEK/ failure
area organisation organisation (lump sum) (t SEK)
A 7,989 7,989 1832
B 2156 6,145 8,601 2060
C 1472 4,128 5,600 1676
D 4701 11,448 16,150 3002
E 4776 16,078 20,854 4111
F 4884 14,095 18,897 3417
G 12,686 2173
H 3512 7,785 11,367 1887
I 20,274 6 304 28,246 5490
Benchmarking of the Maintenance Process at Banverket 581

Table A.5. Reported corrective maintenance caused by inspection remarks classifying faults
as requiring immediate repair; also including activities such as inspection and condition-
based and predetermined maintenance that should have been booked under other codes in
the accounting system (before normalisation of the data)
Track Inspection Mixes of Inspection cost Operational Care of Condition- Total cost
area remarks inspection including actions electrical assets based
calling for remarks calling inspection due to pre- due to pre- main-
immediate for immediate remarks calling determined determined tenance
repair repair and for immediate maintenance maintenance
CBM Remarks repair
A 13,320 13,320
B 6,931 6,931
C 12,355 1485 7081 20,921
D 16,361 7614 3558 3091 30,638
E 10,864 1962 4732 1486 19,044
F 9,963 3194 4289 2756 168 20,383
G 9,346
H 11,107 303 11,410
I 18,169 18,168

Table A.6. Reported corrective maintenance caused by inspection remarks classifying faults
as requiring immediate repair; also including activities such as inspection and condition-
based and predetermined maintenance that should have been booked under other codes in
the accounting system (after normalisation)
Track Inspection remarks Inspection remarks Corrective New total cost
area calling for immediate calling for immediate maintenance booked
repair repairbooked under as inspection in the
inspection accounting system
A 13,320 13,320
B 6,931 6,931
C 12,355 995 13,350
D 16,361 1904 1506 19,771
E 10,864 491 1553 12,908
F 9,963 799 8 10,770
G 9,346
H 11,107 11,410
I 18,169 916 19,084
582 U. Espling and U. Kumar

Table A.7. Condition-based maintenance bought as extra orders in thousands of SEK, but
including the so-called special maintenance activity
Track area Original accounting sum Minus defective sleepers New Sum
A 32,319 32,319
B 43,831 43,831
C 44,139 44,139
D 6,607 6,607
E 81,720 –60,913 20,807
F 53,797 –27,972 25,825
G 50,753 50,753
H 45,198 –7,680 37,518
I 63,426 –12,722 51,004

23.13 References
Almdal, W. (1994), “Continuous improvement with the use of benchmarking”, CIM Bulletin,
Vol. 87 No.983, pp.21–26
Burke, C.J. 2004. 10 steps to Best–Practices Benchmarking.
Campbell, J.D. (1995). Uptime: Strategies for Excellence in Maintenance Management,
Productivity Press, Portland, US
Dunn, S. (2003), Benchmarking as a Maintenance Performance Measurement and Improve-
ment Technique. Assetivity Pty Ltd,
EFNMS (2006),
Espling, U. (2004), Benchmarking av Basentreprenad år 2002 för drift och underhåll,
Research Report, LTU 2004:16, (In Swedish).
Hägerby, M., Johansson, M. (2002). Maintenance performance assessment: strategies and
indicators. Master thesis, Linköping, Linköpings tekniska högskola, LiTH – IPE Ex arb
Kaplan, R.S. and Norton, D. P. (1992), The Balanced Scorecard: the measures that drive
performance, Harvard Business Review, Jan–Feb (1992), pp. 71–79.
Larsson. L. (2002). Utvärdering av underhållspiloterna, delrapport 1. Banverket F02-
§713/AL00. (In Swedish).
Liyanage, J.P. and Kumar, U. (2003). Towards a value-based view on operations and
maintenance performance management, Journal of Quality in Maintenance Engineering,
Vol. 9, pp. 333–350.
Malano, H. (2000), Benchmarking irrigation and drainage performance: a case study in
Australia. Report on a Workshop 3 and 4 August 2000, FAO, Rome, Italy.
Mishra, C., Dutta Roy, A., Alexander, T.C. and Tyagi, R.P. (1998), Benchmarking of
maintenance practice for steel plants, Tata Search 1998, 167–172.
Moulin, M. (2004), Eight essentials of performance measurements, International Journal of
Health Care Quality Assurance, Vol .17, Number 3. pp. 110–112.
Oliverson, R.J. (2000), Benchmarking: a reliability driver, Hydrocarbon Processing, August
2000, pp. 71–76.
Ramabadron, R., Dean Jr J.W. and Evans J.R. (1997), Benchmarking and project
management: a review and organisational model, Benchmarking for Quality
Management & Technology, Vol. 4, No. 1, pp. 437–458.
Benchmarking of the Maintenance Process at Banverket 583

Stalder, O., Bente, H. and Lüking, J. (2002), The Cost of Railway Infrastructure. ProM@ain
– Progress in Maintenance and Management of Railway Infrastructure, 2, pp. 32–37.
Varcoe, B.J. (1996), Business-driven facilities benchmarking, Facilities, Vol. 14. Number
3/4, March /April, pp. 42–48, MCB University Press.
Wiarda, E.A. and Luria, D.D. (1998), The Best-practice Company and Other Benchmarking
Wireman, T. (1998), Developing Performance Indicators in Maintenance. New York:
Industrial Press Inc.
Wireman, T. (2004), Benchmarking Best Practice in Maintenance Management. New York:
Industrial Press Inc.
Zairi, M. and Leonard, P. (1994). Practical Benchmarking: the Complete Guide. London:
Integrated e-Operations–e-Maintenance:
Applications in North Sea Offshore Assets

Jayantha P. Liyanage

24.1 Introduction
There is a clear growth of interests today on the development and use of e-main-
tenance concepts for industrial facilities. This is particularly seen in the offshore oil
and gas (O&G) production environment in the North Sea in relation to a major re-
engineering process termed ‘integrated operations’ (IO) that began in 2004–2005
as a new development scenario for the offshore industry (OLF 2003). Major
challenges to conventional operations and maintenance (O&M) practice have been
seen unavoidable under this new IO initiative. Subsequently, the industry began to
develop some serious interests on novel and smart solutions for O&M. The
developments began in 2005 seeking long-term changes to the conventional O&M
practice. The change process has been relatively slow during the 2005–2006
period, but seemingly has gathered gradual and steady pace by now. This is a
large-scale change, and hence the current plan is to realize fully functional e-opera-
tions e-maintenance status by the years 2012–2015 or so. Even though the integra-
ted e-operations and e-maintenance applications in the North Sea are still at their
inception, the learning process and the state of current knowledge can be very valu-
able for similar efforts in the development and implementation of novel solutions
in other industries and /or regions in the world.
Current developments in Norway exemplifies that the growth of smart use of
advanced information and communication technology (ICT) solutions is a principal
driving factor in the development and implementation of novel and smart solutions
to realize e-maintenance (Liyanage and Langeland 2007; Liyanage et al. 2006). In
principal it seeks to establish better offshore-onshore connectivity and interactivity
enhancing decisions and work processes. The emerging O&M practice will be
based on a smart blend of application technologies, novel managerial solutions,
new organizational forms, etc. to enable 24/7 online real-time operating modes.
The new set of O&M solutions for North Sea offshore assets are not simply about
the use of some form of core technologies for electronic data acquisition and so on,
but a large-scale re-engineering process dedicated to make a significant change to
586 J. Liyange

the conventional O&M practice based on a solid technical platform. It is note-

worthy that, even though the changes within O&M by far is mostly technology-
dependent, its managerial implications are inevitable and that managerial changes
have to be properly blended into the technology-based change. Such an integrated
change is very critical in terms of technical and safety integrity of assets, and
subsequent commercial impact in terms of production, plant economics, and safety
and environmental performance.
Ongoing developments in Norway bring a good example of how an industry-
wide re-engineering process has triggered major changes in O&M practice leading
the path towards integrated e-operations e-maintenance. It implies that integrated
e-operations e-maintenance initiatives in Norway is not a standalone and a short-
term technical change limited to O&M, but an integral part of a wider and a long-
term development process that combines various technical disciplines and different
sectors of the industry seeking an optimum and a long-term solution. In this con-
text, there are two salient features that define the future of e-based O&M practice
in Norwegian O&G industry:
• Integration with other technical disciplines that have major roles in the
realization of fully functional 24/7 online real-time operational status
• The important technological and managerial change that the e-approach has
to incorporate to ensure fail-safe status
Owing to the growing interests and the importance of the subject matter on
e-operations e-maintenance, learning from different application scenarios in vari-
ous industries has a timely significance. This chapter shares current experience and
knowledge with reference to ongoing developments in the Norwegian O&G indus-
try. The chapter highlights current offshore asset maintenance practice, changing
technical and economic environment that lead the path towards an e-approach,
development and implementation of integrated e-operations and e-maintenance
solutions in the North Sea, key features of the e-approach in North Sea assets, and
future challenges to be fullyintegrated and fail-safe. The specific acronyms and
their application definitions are given in Section 24.2, and Section 24.3 contains
some recent reflections on the work on e-maintenance. Section 24.4 covers a brief
introduction to offshore asset maintenance. It describes current thinking, practice,
and visible trends. The technical and economic environment that shape guides a
shift towards e-operations and e-maintenance is discussed in Section 24.5. It
illustrates some of the major drivers that demand technological and managerial
integration in search for comprehensive solutions for offshore assets in the North
Sea. The section that follows (Section 24.6) highlights issues related to develop-
ment and implementation of integrated e-operations and e-maintenance solutions
on the Norwegian continental shelf. The major features of the e-approach for
operations and maintenance are highlighted in Section 24.7, and pays specific
attention to the diagnostic and prognostic technologies and the emerging infra-
structure (i.e. ICT network, Onshore centers) for their implementation and use.
Since the emerging environment represents a step change towards a more complex
operational setting, there are numerous challenges to realize reliable fully inte-
grated status and to remain fail-safe. Section 24.8, briefly covers these issues, and
highlights the critical role and specific features of intelligent watchdog agent
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 587

technology in this context. This section also underlines some of the important non-
technical issues that play pivotal roles in terms of being fully integrated and fail-

24.2 Acronyms and Application Definitions

Acronyms given in Table 24.1 are used throughout the chapter.

Table 24.1. Acronyms in the chapter and their application definitions

Acronym Application definition

B2B Business-to-business
CBM Condition based maintenance
CMMS Computerized maintenance management system
CV Confidence value
D2A Decisions-to-actions
D2D Data-to-decisions
ERP Enterprise resource planning
ICT Information and communication technology
IO Integrated operations
IT Information technology
LAN Local area network
NCS Norwegian continental shelf
NOK Norwegian Crowns (Norwegian currency)
NPD Norwegian Petroleum Directorate
OLF Norwegian Oil Industry Association
OOC Onshore operational center
OSC Onshore support center
O&G Oil and gas
O&M Operations and maintenance
PDA Personal digital assistant
PM Preventive maintenance
PSA Petroleum safety authority
RUL Remaining useful life
R&D Research and development
SAP A commercially available business ERP system
SOIL Secure Oil Information Link
WAN Wide area network
588 J. Liyange

24.3 Current Reflections on e-Maintenance

Over the last few years, e-maintenance has drawn the attention of both the industry
and academia equally. With the growth of attention and interests towards near-
zero-downtime performance, cost-effective maintenance strategies, data-dependent
decision support systems, etc., the conventional maintenance practices have largely
been challenged during the last couple of decades (Hansen et al. 1994; Bonissone
1995; Emmanouilidis et al. 1998; Khatib et al. 2000; Roemer et al. 2001; Koc and
Lee 2001; Swanson 2001; Djurdjanovic et al. 2002; Wang 2002; Iung 2003; Yen
2003; Arnaiz et al. 2005; Han and Yang 2006). Subsequently the industrial practice
gradually showed some inclination to adapt condition monitoring as a strategic tool
to resolve some major challenges in various plants, facilities, and industrial
settings. The emergence of various condition monitoring solutions coupled with
data acquisition and presentation software appears to have laid a good foundation
for further development of technology-based maintenance solutions leading the
path towards diagnostics and prognostics (Liao et al. 2005; Emmanouilidis et al.
2006; Jardine et al. 2006). Current waves of interest on a range of e-maintenance
solutions are largely dependent on parallel development in information technology
infrastructures and communication technologies enhancing online communication,
remote monitoring capabilities, remote expert assistance, etc. As the R&D activi-
ties gradually progress seeking novel solutions to the conventional condition moni-
toring practices, more advanced solutions have begun to appear generating a strong
focus on intelligent maintenance solutions (Sanz-Bobi et al. 2002; Iung 2003; Lee
2004; Moore and Starr 2006). The trend appears to be towards more robust and
comprehensive technical solutions where data acquisition, processing and interpre-
tation, and decision support components are integrated. Along this line of practice,
the developments in the discipline seem to be progressing towards intelligent
e-maintenance solutions. Furthermore, some interesting work has also been per-
formed incorporating for instance neural networks, expert systems, fuzzy logic,
genetic algorithms, multi-agent platforms and case based reasoning, etc. (Liang et
al. 1988; Yager and Zadeh 1992; Jantunen et al. 1996; Lee 1996; Chande and
Tokekar 1998; Sanz-Bobi and Toribio 1999; Yang et al. 2000; Garcia and Sanz-
Bobi 2002; Marceguerra et al. 2002; Yu et al. 2003; Palluat et al. 2006). Moreover,
the growth of R&D activities has resulted in introduction of novel application
concepts and products such as PROTEUS (Bangemann et al. 2006), EXAKT
(Jardine et al. 1998), Watchdog agents (Djurdjanovic et al. 2003), SIMAP (Garcia
et al. 2006), etc.
Obviously, condition monitoring and e-maintenance solutions have already
shown a substantial potential for wider industrial applications. However, the type
of solutions required and the nature of the practical applications may differ from
one setting to another depending on the commercial challenges and the available
technical infrastructure. This chapter brings an overview on this with reference to
current developments in the North Sea offshore asset management environment.
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 589

24.4 Offshore Asset Maintenance

An O&G production asset in general is at the heart of the petroleum business, and
the oil or gas it produces is in fact the life-blood. A production asset by far can be
seen as a property with an economical value. Such a physical property can involve
a number of integrated modules. In O&G terms, a production asset comprises
modules for extraction, processing, treatment and supply of raw oil and/or gas to a
refinery or straight to a market (also see Liyanage 2003). With reference to the
physical process of O&G production, a production asset primarily constitutes:
• Reservoir(s) containing oil and/or gas
• Production and injection wells
• Production platform(s) and drilling/injection rig
• Pipes for export of production
An O&G production asset is a complex mechanical design involving various
machineries, tools, mechanisms, etc. in the production process. The production
process comprises different stages and has four major technical processes, namely:
• Reservoir engineering, drilling and well intervention
• Development and modifications
• Operations and maintenance
• Logistics and support services
O&M process has a major role-play in production platforms (so-called ‘top-
side’) and rigs. In fact production and injection wells require some maintenance as
well, but this is a highly specialized technical area. This chapter mainly covers
O&M aspects in the ‘top side’. O&M, inclusive of testing and inspection, is an
important discipline in terms of the technical condition and the mechanical integ-
rity of an O&G asset. Necessary functional and technical conditions are achieved
through a blend of O&M strategies, programs, and technologies. A diversity of
O&M strategies and management practices may be necessary during the life of an
asset that in general is under operation for 20–30 years of commercial production.
The challenges to plant O&M can be quite dramatic particularly at the beginning
and end of production life cycle, i.e. in the startup phase, and in the tail-end
production phase (i.e. when the production begins to decline gradually). During
various stages of the life-cycle, demands for maintenance can also vary, for in-
stance, due to design flaws, varying operational conditions (pressure, temperature,
etc.), ageing equipment, outdated O&M procedures, modifications, and so on and
so forth. The fact that a good number of production platforms on the Norwegian
shelf at the moment are in the tail-end and maturity phase of production poses
significant challenge and it demands novel solutions to improve maintenance
Obviously there is a common cause for performing O&M activities in various
O&G production assets, i.e. commercial, or statutory and regulatory. However,
there can be differences among O&M programs and practices performed by
various producers. Such variations can exist, for instance, due to age of installa-
590 J. Liyange

tions and equipment, scale of production operations, level of technological com-

plexity, competence availability, budgeted operating costs, etc.
Preventive maintenance (PM) tasks account of a larger portion of the main-
tenance work performed in offshore installations. Such PM programs can be based
on industry practices, third party recommendations, or reliability analysis. PM
programs are built into running maintenance plans and thus are executed as
calendar-based or periodical maintenance tasks. The planning process can for in-
stance be done on a 3-months or 7-weeks basis, and can be frozen weekly for
execution offshore. One of the major concerns related to current maintenance
practice is the consequences if the PM on equipment in offshore plants exceeds
what is actually required. Excessive PM has significant commercial implications in
terms of production interruptions, which on the other hand ensures compliance to
strict regulatory requirements particularly for safety critical equipment. Lately,
there seems to be some general preference for the use of condition monitoring
techniques and risk-based methods. While methods such as risk-based inspections
are already available, technology experts believe that application of CBM
techniques together with risk computation can be of great benefit as it can greatly
facilitate ‘need based maintenance’. This implies that the experts can precisely
identify at which point in time certain maintenance tasks have to be performed
based on risk conscious decisions. This is expected to bring substantial commercial
benefits by prolonging maintenance intervals and thus reducing the production
interruptions. However, conventionally CBM techniques have not been widely
applied other than on an ad hoc basis or on special rotating machineries such as
turbines. It has so far been a challenge to make effective use of condition monitor-
ing in the production facilities on the Norwegian shelf. Some applications are in
use such as vibration monitoring on heavy rotating equipment, thermography on
electrical equipment, and oil analysis. However, many producers have been
struggling to capitalize on the inherent potential of CBM technologies for quite
some time. The underlying bottlenecks are largely related to the physical distance
between offshore assets and onshore support organization, the availability of
expertise to the site at a moments notice, and reluctance by some of the producers
to initiate a quick response solely based on CBM expert’s opinion since they
conventionally rely much on the overall equipment manufacturer’s recommenda-
tions and guidelines, etc.
O&M organizations, on the other hand, gradually appear to become more team-
based. Recent downsizing moves, an ageing workforce, and the ongoing efforts to
integrate maintenance and operational crews have contributed much to this trend.
The way in which such teams are formed and the way they carry out work may
vary from one situation to another. Work teams can be dedicated to individual
plants (i.e. dedicated work teams) and also certain teams may involve in doing
campaign maintenance (i.e. fly-out maintenance) tasks. Campaign activities imply
that while dedicated work teams carry out asset-specific tasks, there are teams
(called campaign teams) with specific technical expertise (e.g. for turbine
maintenance) who carry out certain specific PM tasks in addition to the dedicated
maintenance personnel. They fly across platforms attending pre-assigned tasks in
accordance with maintenance programs registered in the system. Administratively,
while campaign teams may be responsible to the maintenance manager, function-
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 591

ally they may be responsible to individual assets. In addition, certain technical

expertise can also be calledin for specialized maintenance tasks either from the
onshore support organization of the producer or from a third party maintenance
contractor. Such competence compensation strategies are relatively ad hoc and can
take place on a need-by-need basis. Regardless of various strategies aiming at the
best use of available competencies, the O&G industry in the North Sea has suffered
from a scarcity of competent labor for some years now. The situation has further
been aggravated by the ageing workforce, where in some instances close to about
50–60% of competent and experienced personnel is said to be reading the retire-
ment age in a few years time. Major producers have already begun to resort to such
options as outsourcing and insourcing to compensate for the growing competency
Despite third parties playing pivotal roles in the entire O&G production
process, contract management seemingly has still been relatively undeveloped with
continuous discussions and debates. It is often pointed out that the fullest potential
of external maintenance technicians, engineering contractors, CBM experts, etc.
has not been fully capitalized on. However, under the competence mapping and
outsourcing-insourcing efforts by the producers, some of the issues have been
taken up for open discussion. There seems to be a relatively wider acknowledge-
ment today that the knowledge industry that is external to the producers has a
significant potential and that more prudent use of such resources is important for
the long-term benefits of the industry. The industry in general has begun to explore
win-win options to establish better commercial relationships between producers
and third parties.
Also, there is a growing inclination to the use of more advanced technical
means for doing maintenance today. More and more maintenance free products and
exotic gadgets have drawn the attention of the engineers. One such example is
usage of smart sensors for gas detection, whose built-in self-testing capability
removes the requirement for periodical inspection and functionality tests. Equally,
the dependence on IT tools for O&M management has gradually increased with
notable improvements in data and knowledge management capabilities. Certainly
recent developments in the IT sector have specific effects on maintenance planning
and decision-making processes. O&M management tools are often seen built into
corporate ERP systems such as SAP, but the effective use of such capabilities and
the efficiency with which they are put into use still need some major improve-
ments. In many cases, the biggest problem notably is the use of different databases
without an effective configuration for data acquisition. Given the large volume of
sensitive data accumulated and stored in those databases, effective and efficient
data management is often seen as a daunting and a resource consuming task.

24.5 e-Approach: Changing Technical and Economic Environment

The global industrial environment is being strongly challenged today both in
engineering and management terms. There is a clear growth of application tech-
nologies, engineering techniques, organizational forms, management principles,
cooperative policies, etc. to cope with the complex socio-economical and techno-
592 J. Liyange

political change processes. The trends of deviations from conventional wisdom and
practices have become more and more clear, seeking to adapt creative, innovative,
and smart solutions to manage complex systems for commercial advantage (During
et al. 2004; Hosni and Khalil 2004; Russell and Taylor 2006). With the growth of
business uncertainties, the enterprise risk profile has become more complex
demanding more flexible, collaborative, and open strategies to support various
operational activities in industrial plants and facilities. The emerging commercial
environment by far has already indicated the greater reliance on new technological
and managerial solutions to manage important asset processes such as O&M
establishing a new landscape for commercial activities. This seems to be a generic
trend among almost all the commercial business sectors, but to varying degrees,
where the dependence on advanced technological solutions to manage complex
technical systems is rapidly growing. The resulting environment will obviously be
very dynamic enabling key stakeholders of complex technical systems to remain
intact within an extended live network (Wang et al. 2006).
The production, manufacturing, and process industries are directly seen im-
pacted by the new demands and the wave of subsequent changes. Technologically
complex and highrisk businesses in particular cannot afford to divert their manage-
ment strategies of complex assets away from the mainstream technologydriven
change. Today different industrial sectors are seen adapting various novel and
integrated solutions to manage their industrial assets and internal processes to
realize major commercial benefits. More often, rapid advancement in information
and communication technologies (ICT) has been very catalytic to the progress in
technology applications (e.g. diagnostic technologies) and data management
solutions particularly for complex systems, such as offshore oil and gas (O&G)
production platforms.
O&G activities on the NCS began in the early 1970s with the discovery of the
great Ekofisk asset. Ever since, NCS has been a major supplier of oil to the world
energy market. Today, after more than 30 years of continuous production, NCS has
stepped up to its peak level. Despite the fact that NCS foresees a gradual decline
after 2010 or so, the remaining potential is known to be substantial. But the future
is known to have a unique set of challenges with a major need to enhance the
recovery efficiency so that the commercial lives of major production assets can be
extended by another 40–50 years. By 2003–2004, the forthcoming challenges to
O&G exploration and production activities in North Sea became very obvious. The
major part of the industry became relatively more inclined to resort to advanced
application technologies to address underlying commercial risks. At the same time
the industry has been undergoing some other challenges widely acknowledged as
serious impediments to future growth on NCS. For instance, the industry has been
experiencing some major setbacks in attracting talent, and in centralizing core
competencies. The problem has been further aggravated by the ageing workforce
with no suitable remedy to solve competency gaps. Industry restructuring has been
seen by the majority as a feasible solution to provide a tighter integration and
partnerships with the knowledge industry. Table 24.2 illustrates the complex set of
economical and technical drivers that challenged the conventional practices in the
North Sea O&G production environment.
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 593

Table 24.2. Technical and economical factors that contributed to a step-change in North Sea
asset management practice also introducing changes to conventional O&M practices

Risk and uncertainty profile Commercial incentives

Risk and uncertainty profile is seen to be The underlying commercial incentives of a

too large to ignore due to maturing assets, major change have been very convincing.
declining production, rising lifting costs, This mainly includes substantial
discovery of marginal fields, declining enhancement in production recovery at least
investments for developments, lower by 10% or more, significant reduction in
recovery efficiency, etc. operating costs, and major safety and
environmental benefits

Technical and managerial setting Business conditions

The emerging technological and business Various other industrial circumstances have
environment have given its own solutions to also constantly been demanding some form
counter attack major problems. Such of a change to the conventional industry
solutions seem to be feasible through practice. This is primarily due to remaining
application technologies and ICT solutions, substantial un-tapped value potential, major
business-to-business collaboration forms, need for more open and flexible
closer inter-disciplinary integration to partnerships, emerging competency gaps,
jointly manage offshore activities, obsolete technologies and ageing
standardized platforms for dynamic data equipment, more complex and new kind of
and knowledge sharing challenges in production settings, etc.

Under such circumstance, the risk-and-uncertainly profile on NCS appeared to

be too high to ignore for major O&G producers. This brought a major momentum
to challenge the conventional practices of core technical disciplines such as O&M,
Drilling, etc. Subsequently, key stakeholders directly steppedin to re-engineer the
conventional practices targeting long-term commercial advantages. Thus, O&G
business in Norway stepped into what has been termed integrated e-operations
since 2004 as a new development scenario for the continental shelf. This is known
to be the ‘third efficiency leap’ for O&G activities on the Norwegian shelf. This
was further envisioned by the Norwegian parliament through the report no. 38:
2003–2004. Today, this has become an industry-wide program with major national
interests drawing NOK billions of investments from various sources for re-
engineering tasks and further development.
Under integrated e-operations major improvements are expected in three
technical asset processes, namely:
• Drilling and well intervention
• Reservoir management and production optimization
• Operations and maintenance
O&M drew the attention of industry slightly later than two other technical
disciplines, but is widely acknowledged today as a technical process that has
substantial improvement potential. In fact some signs of development within O&M
began to appear in 2005–2006 period. Nevertheless, it has been known for some
time that the conventional O&M process has large limitations and some well-
594 J. Liyange

established O&M policies have seen significant hindrance to bringing cost-effec-

tive and efficient solutions. The integrated e-operations–e-maintenance concept
for North Sea assets, brought forward a long-term development path to O&M
process from 2005 onwards with substantial opportunities to:
• Test out and implement new technological solutions particularly enabling
predictive maintenance capabilities
• Implement more robust technical platforms for effective O&M data
• Establish new organizational forms to compensate for lacking or short of
experienced O&M workforces
• Standardize the technical language in use between different stakeholders to
enhance communication and cooperation
• Provide fast access to technical experts in demanding and urgent situations
• Build an agile competence network to enhance decisions and activities
The experience so far is that ongoing activities will eventually result in
relatively more dynamic and complex functional environment. However, it is also
noteworthy, that e-operations–e-maintenance is a sensitive change processes in
terms of safety and security, and thus has its own challenges to make it fully
functional and fail-safe.

24.6 Development and Implementation of Integrated e-Operations

e-Maintenance Solutions in the North Sea
The new integrated e-operations–e-maintenance scenario, as aforementioned, has
its major focus on changing conventional practices. The Norwegian O&G industry
in principle has looked into smart use of advanced application technologies and
information solutions as the driving forces to push forward smart O&M solutions
for offshore assets. Implementation of new solutions depends on the four factors
shown in Figure 24.1. These are:
• Advanced technologies that enhances the maintainability of assets
• Digital IT infrastructure that enhances reliable transfer and exchange of
O&M data between different stakeholders
• Active operational networks lively connecting producer’s O&M personnel
and that of engineering contractor’s, overall equipment manufacturers,
logistics suppliers, other external technical groups, etc.
• Business-to-business (B2B) collaborative partnerships that lay the founda-
tions to create a reliable information and knowledge network
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 595

Advanced Digital
technologies infrastructure

B2B collaborative Active operational

partnerships networks

Figure 24.1. e-Approach to O&M in North Sea assets in principal relies on fourfold aspects

Partner industries, particularly those related to electronic and communication

technologies, have major effects on the new O&M environment as it has to provide
a stable and a reliable technical infrastructure to support O&M decisions and work
processes. As of today, there are clear indications on the use of following techno-
logies for integrated O&M solutions:
• Web-based data exchange and communication networks
• Real-time visualization and simulation tools
• Equipment with built-in smart electronics and advanced functionalities
• Online diagnostic and prognostic tools and methods
• Process automation and real-time data acquisition techniques
• Online video conferencing and monitoring facilities
Wireless communication capability appears to be a major step forward in the
new O&M environment. Such smart technological products as ‘VisiWear’ are al-
ready in use. This is a man-wearable technology with live video and audio commu-
nication capabilities between offshore and onshore.
The technology-dependent change has direct implications on the establishment
of new organizational forms to bring improvements to conventional O&M manage-
ment practice. This is achieved through an active blend of digital technologies and
infrastructures with active operational networks and business-to-business collabo-
rative partnerships. The ongoing rapid reformation of industry infrastructure targets
enhances the live interactivity between different stakeholders involved in O&M
decisions and activities. This helps systematic establishment of tight online and
real-time collaborative partnerships between the O&M crew in offshore assets and
those who are positioned in the onshore support system. In fact the current ten-
dency to test-out and implement novel O&M solutions actively seeks options to
combine effectively other sectors of the industry (e.g. engineering contractors,
equipment suppliers, technical expert centers, spare-part vendors, logistics, etc.). It
is in this application context that the large-scale ICT networks and web-enabled
solutions play a key role in establishing the necessary connectivity and interactivity
between dispersed groups and organizations. It implies that integrated O&M solu-
tions, as experienced in North Sea offshore environment, breaks the conventional
boundaries, for instance advancing:
• From in-house competencies to collaborative shared-expertise
• From centralized databases to open data management landscapes
• From localized on-the-site diagnostics to remote monitoring
596 J. Liyange

• From on-the-site O&M expert interventions to tele-consultancy capabili-

ties, etc.
Figure 24.2 is a schematic diagram of the technical infrastructure in the North
Sea that facilitates realizing integrated e-operations e-maintenance.

Sources of data
Asset Operator
Distributed control and Wireless network

Onshore Support System

monitoring systems and Satellit
Radio links Equipment and Spareparts
Experience data
Central Fiber-optic I
Direct visualization Logistics and Emergency
Offshore asset

Datahub IP-VPN /
Offline-online technical Fiber-optic
data network Offshore O&M contractors

Intelligent systems and

components Advanced Fibre-optic based and Wireless Technical / Engineering expertise
Information and Communication Network

Figure 24.2. Technical infrastructure for integrated e-operations–e-maintenance solutions in

the North Sea

The figure highlights that the functional landscape for the establishment of e-
based O&M setting in North Sea is a relatively complex combination of various
technical as well as social elements. The synergy among at least three elements is
critical in the development of the necessary technical infrastructure, i.e.:
• Advanced process and safety technologies implemented in equipment in
offshore assets that allows real-time data acquisition and transfer
• Large scale ICT network with an appropriate bandwidth, that uses both
wireless, fiber-optic and web-enabled capabilities, to enable sharing of
acquired data and communication traffic on 24/7 basis
• Well equipped onshore expert centers with built-in advanced data manage-
ment capabilities and collaborative technologies to process and interpret
data, and to stay connected with offshore assets as well as other partners to
interact online for enhancing decisions and activities
Such a large-scale technical setting can perhaps be considered as the heart of
e-operations–e-maintenance activities, as it allows:
• Integration of geographically dispersed knowledge centers creating a vir-
tual workplace
• Establishment of 24/7 online net-based connectivity to provide easy and
fast access to remote experience and knowledge
• Access to reliable IT network with a higher bandwidth and speed to acquire,
process, and to interpret volumes of real-time data
The largest implication of such a setting by far is the significant improvements
to decisionmaking and work processes. The connectivity and the interactivity
between offshore and onshore, as well as between different onshore-based com-
petence groups and knowledge centers, allows more effective decision loops and
more coordinated planning and execution of O&M activities (see Figure 24.3).
Smart combination of real-time data with multi-disciplinary expertise has major
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 597

benefits on the effectiveness and efficiency in data-to-decision (D2D) processes.

Continuous monitoring of functional status of equipment and joint interpretation of
technical and safety integrity levels with the experts in the active network, on the
other hand, brings major benefits to the decision-to-action (D2A) processes. Such
benefits have already been visible in terms of time and quality of D2D and D2A
processes, and is said to be very encouraging and commercially attractive for
further improvements.

Digital technology platforms

Coupling the real-time

equipment data with rapid
analysis techniques and
joint decision making.

Integrated O&M D2D D2A

workprocesses processes processes

Joint coordination and planning

of maintenance work by use of
advanced communication
capabilities and technical data
Collaborative management sharing platforms.

Figure 24.3. Integrated e-operations–e-maintenance brings key solutions to enhance data-to-

decision (D2D) and decision-to-action (D2A) processes

The targeted benefits of these developments within O&M, together with those
in other technical disciplines, are continuously expected in a 30–40 year time span.
The key value creation elements identified includes, for example, methods and
techniques to reduce uncertainty in data interpretation, reduced cycle time on
decisions, better planning and work coordination procedures, and reduced offshore
operating costs through offshore-onshore work re-organization and prolonged
maintenance intervals. The overall commercial benefits expected include; approxi-
mately 10% increment in production, 30–40 % reduction in operating costs, and
significant improvements in health and safety performance.

24.7 Key features of the e-Approach for O&M in North Sea Assets
As aforementioned, integrated e-operations e-maintenance is not just an effort to
introduce new technologies. It in fact represents a change in the use of technical
tools, advanced methods, and joint expertise to make O&M processes more effec-
tive and efficient. It introduces a novel scenario to manage the process stepping out
of the convention. However, the successful implementation and use of e-approach
dependent heavily on the synergy between remote diagnostic and prognostic tech-
nology, onshore expert centers directly connected to offshore collaborative rooms,
and net-based web-enabled ICT solutions (Figure 24.4).
598 J. Liyange

Net-based and Web-

Remote monitoring technology enabled ICT solutions
(e.g. diagnostic and prognostic) (e.g. SOIL)

Offshore-Onshore expert

Figure 24.4. The solid foundation to e-approach in O&M demands a synergy between three
main components that establish a complex and an interactive technical system

24.7.1 Prognostic and Diagnostic Technologies

For a long time it had mostly been a challenge to make effective use of condition
monitoring on the Norwegian shelf (Ellingsen et al. 2006). There had been ad hoc
use of some diagnostic technologies such as vibration monitoring on heavy rotating
equipment, thermography on electrical equipment and oil analysis, but mainly on a
discontinuous need-by-need basis. In most cases use of diagnostic expertise had
been limited to on-the-site tapping and data acquisition after reporting a mal-
function or some abnormal technical indications. But today, many O&G producers
are keen on capitalizing on the inherent potential provided by the digital infrastruc-
ture on North Sea and advanced technologies. It implies that the use of condition
monitoring to support technical and safety integrity is strengthened in the inte-
grated environment since:
• Data acquisition techniques have developed to an extent that the experts can
tap signals real-time at onshore support centers (OSC) on critical equipment
• Online communication capability has allowed joint interpretation and trend
analysis, for instance coupling to asset operator’s OSC, and comparing
with set alarm levels
• Expert centers have acquired technological capability so that they can
secure connections to several offshore assets in a way that those assets can
be served simultaneously if necessary
The use of advanced networking technologies is in fact a landmark of inte-
grated O&M solutions for North Sea assets, as opposed to offline technologies. It
has brought some unique capabilities to share the expertise. With the rapid use of
portable communication technology, offshore personnel can also communicate
effectively with OSCs allowing more sensible use of data acquisition technologies.
The current setting has given a new dimension to the diagnostic and prognostic
efforts for North Sea assets today.
The OSC in SKF-Norway is for instance a CBM expert center that has remote
diagnostic and prognostic capabilities and serves various operators in the Norwe-
gian and Danish O&G sectors. Over the past few years it has carried out online
remote vibration monitoring of critical machinery of offshore production platforms
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 599

in its OSC. Use of other technical options such as wireless or web-enabled

solutions and software such as Microsoft Net Meeting, etc. together with high-
secure data traffic enabled by SOIL, have made it possible for experts to have
simultaneous access to critical data around the clock from different geographical
locations for rapid analysis and interpretation.
In the absence of access to fiber-optic technology, for instance due to technical
limitations, multiplexed solutions are available that reduce the amount of data
traffic to a level that can easily be handled by satellite communications. In such
cases, foreign expert centers can reasonably record data, e.g. from 20 accelero-
meters at a frequency of every 20 s. In the presence of adequate data exchange and
communication networks, data management systems of those foreign locations can
easily be linked to SOIL. This has proven to work successfully, for instance in case
of mobile offshore drilling rigs, and floating production and storage vessels.
Obviously developments in sensor systems, microelectronics, wireless systems
and software have actively contributed to the growth in the use of advanced CBM
techniques today on NCS. On the other hand, advancement in exception-reporting
techniques has also reduced the need for massive amount of data for failure
predictions. This has actively contributed to faster decision-making processes on
the bases of variations in pre-defined data sets or associations between technical
indicators. The need for a fast track of communication from remote locations has
been solved by web based reporting systems that follow a pre-defined format and
tag-based reporting structure for ease in faster action. This allows the data and
reports to be transferred automatically into CMMS systems such as those built into
corporate ERP systems as SAP, Workmate, etc. This has substantially narrowed
the time and conventional routines for data collection, analysis, reporting, work
orders and feedback.

24.7.2 Onshore Remote Support Centres and Virtual Activity

Onshore Support Centres (OSCs) can be considered as the active nodes of the in-
tegrated e-operations e-maintenance setting. Such OSCs are established in the
premises of both O&G producers and third parties. The functional characteristics
of OSCs can vary from one to another depending on the contractual roles and
specific assignments of external organizations. For instance, ConocoPhillips as the
operator of the Ekofisk asset has two such onshore centers. One of them is called
onshore operational center (OOC) and has built-in integrated solutions for O&M
planning, logistics, and other production and operation related activities (Figure
600 J. Liyange

3D technologies &
Simulations laandscape

Logistics and planning Conferencing

landscape landscape

Realtime monitoring

Figure 24.5. Landscape of onshore support centers (OSCs) with built-in collaborative and
decision support technologies are the active nodes of the integrated e-operations–e-mainten-
ance environment on NCS (courtesy: ConocoPhillips, Norway)

In general, OSCs have built-in communication capabilities with offshore

control rooms and external business partners. The OSCs of third party organiza-
tions are dedicated to provide expert assistance for instance in logistics, vibration
monitoring, etc. on a 24/7 online and real-time basis. To enable active collabora-
tion these OSCs are equipped with tabletop collaborative workstations, back-
projected large VDUs, technologies for remote monitoring, video-conferencing
facilities, and other advanced technological capabilities for joint decision-making
(e.g. VisiWear, Smart boards), supportive advanced technology to produce 3D
images and to run simulations, etc. The success stories of OSCs such as those of
ConocoPhillips have given an industry-wide boost to further advancement both in
number of OSCs and the type of technologies in use. This has given a very fruitful
environment for rapid exploitation of technology (e.g. CBM), decision and work
process optimization, and multi-disciplinary coordination of planning (e.g. between
O&M and drilling), shared-expertise, etc.
The technological capabilities built into OSCs, together with the net-based
access via the ICT infrastructure, have resulted in establishment of a dedicated
virtual environment to support O&M decisions and activities. This takes place
• Real-time online connection between offshore and onshore organizations
• Real-time online connection between different technical disciplines (e.g.
planning and scheduling, transport logistics, equipment suppliers, external
service contractors, spare part suppliers, health and safety advisors, etc.)
• Real-time online connection between the asset operator and the external
experts (e.g. remote condition monitoring center of SKF-Norway and BP)
• Real-time online connection across the geographical borders to the corpo-
rate network to receive expert support for instance from Aberdeen, UK,
Houston or Alaska, USA, or to remotely monitor activities ‘following the
Certainly, the new network-based and collaborative O&M environment has
already shown its capabilities in making notable changes to conventional O&M
practice. The new thinking and the progress so far have indicated great potential
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 601

for expansions of substantial scale that can lead to a completely different techno-
logical setting and an operating mode by the year 2010 or so. The ongoing
developments at some stage would be coupled with other technologies, for instance
related to scenario simulations of technical faults and failures using 3D techno-
logies, intelligent watchdog agents for condition prognostics, virtual tools to train
O&M crews, etc.

24.7.3 The ICT Network: Secure Oil Information Link (SOIL)

Often, advanced ICT solutions are at the heart of principal commercial activities of
almost all industrial sectors today (Chang et al. 2004; van Oostendrep et al. 2005;
Mezgaar 2006). Current developments on the Norwegian shelf also resorted to
such solutions as the basis to induce the change. Current ICT solutions are a tech-
nical blend of more centralized LANs, primarily localized within organizational
boundaries, to large scale WAN solutions that open up transaction routes for com-
plex business-to-business (B2B) traffic. In fact, the specific need for such robust
integrated solutions for O&G industry in the North Sea have largely been growing
over the last 2–3 years, demanding more common platforms, for instance to manage
complex O&M and other plant data. The large scale ICT network established in
North Sea is called Secure Oil Informaton Link (SOIL).
SOIL was introduced to Norwegian E&P industry in 1998. It is a result of
growing demands for integrated data management and B2B communication
solutions. SOIL consists of a number of application services actively connecting
almost all the business sectors of the Norwegian O&G industry. This network
helps establishing the connectivity and interactivity between different parties, for
instance offshore O&M teams, operator’s onshore O&M support groups, third-
party CBM experts, logistic contractors, etc. through the use of fiber-optic cables
and wireless communications. Real-time equipment data can be acquired, jointly
analyzed and results can be exchanged online between these parties, enhancing the
ability for shared interpretation and decision-making. In this context, there are two
major functional features of SOIL (see also Figure 24.6):
• The high reliable information and knowledge-sharing network to coordi-
nate and manage remotely O&M activities in North Sea offshore assets
regardless of the geographical location
• Many-to-many simultaneous authorized connectivity breaking the conven-
tional one-to-one solution enhancing collaboration between experts, third
party services, asset operator, and offshore crew
The conventional one-to-one setting only enabled the connectivity between two
distinctive parties, for example between an inspection engineer of a contractor and
a maintenance planner of an asset owner. However, with the use of the web-
enabled networking solutions available today, a number of distinctive groups can
stay connected and interact simultaneously (i.e. many-to-many connectivity). This
capability has major effects on improvements to D2D and D2A processes of O&M
in terms of time, cost, and quality.
602 J. Liyange

Figure 24.6. SOIL’s application solutions provide many-to-many connectivity and inter-
activity on 24/7 online real-time bases to enhance D2D and D2A performance of O&M

24.8 Future Challenges to be Fully-integrated and Fail-safe

The integrated e-operations–e-maintenance approach that is currently under pro-
gress on the Norwegian shelf has given a new perspective challenging the con-
vention. It has already illustrated, through a number of successful implementation
tasks, how the technology can be coupled with suitable managerial solutions (e.g.
better interfacing with contractors, fast access to external expertise, etc.) to address
novel challenges of and to cater to the crave for innovative solutions by O&M
With the availability of the high-secure ICT infrastructure the Norwegian shelf
has opened up a substantial space for technological innovation seeking major im-
provements in O&M practice. In fact, SOIL enabled operational-network together
with OSCs have provided the structural skeleton for test-beds and to implement
novel solutions. It is the rapid developments within data acquisition and offshore-
onshore communication technologies that are expected to take O&M to a different
mode of practice. However, certain challenges are still there in the use of some of
the novel technologies that include for instance:
• Portable video-communication technologies
• Smart sensors and intelligent transducers for equipment with built-in self-
diagnostic and reporting capabilities
• Electronic products such as PDAs with advanced functionalities
• 3D technologies, etc.
Regardless of the notable achievements so far during the last 1–2 years, the
challenges for further development of O&M process are quite many. In pure
technological terms, smart and cost-effective use of CBM technologies in par-
ticular still remains a significant challenge. In fact there is no argument about the
benefits of CBM in terms of being fully integrated and fail-safe. The demand by
far is on the more sensitive use of the diagnostic and prognostic technologies as a
principal means to improve and to be in control of technical and safety integrity of
assets. The demand in the current O&M setting is towards advanced technical
platforms that for instance combine unique signal processing, risk analysis, and
decision-making features. In fact the demand is for such technologies where failure
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 603

modes can be learnt by intelligent programs and reported automatically by decision

support software. It implies that as the remote support through CBM becomes the
‘practice of the future’ the O&G activities in North Sea demand innovative solu-
tions such as intelligent watchdog agents, common data management platforms,
smart decision support tools, etc. to support the rapid transition towards fully inte-
grated and fail-safe e-operations e-maintenance setting.

24.8.1 Intelligent Watchdog Agents

From pure CBM perspective, there is a greater demand for the use of enabling
technologies as integral parts of robust CBM solutions. As the operating environ-
ment steps into a remote mode, where 24/7 access becomes a sensitive issue, the
experts need to ensure a tight technical coupling for instance between:
• Signal-processing technology with a series of toolboxes for signal pro-
cessing and system performance evaluation to track the health of a system/
machine and provide diagnostic and prognostic information in order to
achieve the goal of “near-zero-downtime” performance
• Application software solutions to interpret optimally monitored data signals
regarding the execution of a maintenance action and to estimate remaining
useful life (RUL)
The requirement on the Norwegian shelf today is a CBM technology that is not
limited to data acquisition but also has integrated advanced solutions with signal
processing and decision-making capabilities to make it more attractive and com-
mercially viable solution. In a series of more recent R&D efforts, the Center for
Intelligent Maintenance Systems (IMS) at University of Wisconsin-Milwaukee and
the CBM Lab at University of Toronto have developed such an integrated O&M
optimization platform to provide asset owners and operators with an advanced tool
for the signal processing and the maintenance decision-making (see Jardine et al.
1997; Banjevic et al. 2001). Figure 24.7 shows the multi-sensor performance assess-
ment framework of this technology.
This watchdog agent constitutes a toolbox with modules for signal processing,
feature extraction, degradation assessment and performance evaluation embedded
in a common software application. It includes signal processing and feature extrac-
tion tools built on Fourier analysis, time-frequency distribution, wavelet packet
analysis and ARMA time series models. The component of performance evaluation
uses such tools as fuzzy logic, match matrix, neural network and other advanced
algorithms. Functionally, the watchdog agent in principal is used for feature extrac-
tion from a series of signals under a given condition, and comparing those with a
template model built-up based on signals under a pre-identified normal condition.
The performance evaluation yields a “confidence value” (CV), which indicates the
health status of the system and is used as the basis for diagnostics and prognostics
under given circumstances. If the data can be directly associated with some failure
mode, then most recent performance signatures, obtained through the signal pro-
cessing and feature extraction modules, can also be matched against signatures
extracted from faulty behavior data for proper decisions.
604 J. Liyange

M ultisensor P erform ance A ssessm ent

S ensory S ig. P roc. Feature E xtraction M ultisensor P erf. E valuation

• T im e- • Logistic
Frequency • T im e-frequency
R egression
An alysis / W avelet
m om ents and • S tatistical
• AR M A PCA pattern
m odeling recognition
• W avelet
• Fourier Frequency B ands • Feature M ap
An alysis pattern
• AR m odel roots
• W avelet m atching
• E xpert extracted
packet Analysis • N eural
N etw ork
(intensity, peak-
to-peak value,
m atching
R M S ).
• H idden
M arkov M odel
• P article filter

Figure 24.7. The potential for further enhancement in the use of advance CBM technologies
such as Intelligent watchdog agents are very evident for North Sea assets (courtesy: CBM
Lab, University of Toronto, Canada)

24.8.2 Early Warning and Decisions Support Systems

Offshore production facilities are largely threatened by various unwanted events

and incidents yearly. The risk exposure due to such serious events and incidents are
much higher in a 24/7 online real-time environment than on a conventional
operating mode. The former demands more robust early warning systems and
decision support tools for fast decisions and actions. The ability to control better
such events and incidents demands tools and techniques for recognition of actual
condition of technical items (i.e. systems, sub-systems, equipment, and compo-
nents) and early prediction of eminent faults and failures that may lead to such
events and incidents, based on performance cues (or early indications). In this
context, the major challenge for avoiding serious events and catastrophic incidents
relates to the ability to employ smart technologies and techniques to obtain such
critical performance cues and to actively use such cues as a basis for diagnostic and
prognostic purposes to enable early decisions and actions prior to the ‘point of no
return’ (e.g. emergency shut down).
Some of the current R&D projects for O&M optimization seek to implement
such technical solutions as integral parts of early warning systems to deal better
with unwanted events and incidents. Such early warning systems to quick initiate
further technical analysis based on trends, associations, failure histories, or expert
judgments will be builtin to OSCs to support decisions. Additional software appli-
cation solutions are under testing at the moment that can be mapped onto the
existing ERP systems with built-in data mining logic to tap into complex events
and incidents data. However, apart from the technology, there are other impedi-
ments such as ontology, semantics, mechanics of reporting, custom data flow
structures, etc. that need to addressed. It implies that the current integrated e-opera-
tions–e-maintenance setting require some efforts for standardization of data as well
to make use of reliable early warning and decision support systems.
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 605

24.8.3 Non-technical Issues

In fact, the initiatives in 2005–2006 to introduce integrated e-operations–e-main-

tenance solutions to the Norwegian shelf represent a very sensitive commercial
step to enhance technology exploitation and resource management capability.
Owing to the direct long-term commercial implications of this industry-wide
program, the involvement from socio-political and authoritative organizations have
been rapidly increasing with the intention of taking all possible measures to ensure
a fully-integrated and a fail-safe system in operation. The Norwegian Petroleum
Directorate (NPD), Petroleum Safety Authority (PSA) and Norwegian Oil Industry
Association (OLF) are at the front-end of this move, and have already sponsored
and launched various programs.
The breaking of the conventional barriers that has been a practice over the last
20–30 years is a serious challenge in itself. The implication of the ongoing change
processes is far beyond being pure technical. It induces a new socio-technical en-
vironment with inherent characteristics and complexities. Interestingly, it is com-
monly acknowledged within the industry that a much greater portion of current
challenges are not in fact technical but rather non-technical. It is constantly high-
lighted that increasing complexities and ill-defined solutions can easily increase the
vulnerabilities and the risk-exposure. With the expansion of activities at global
scale, the path to establish the best practice and thus to achieve commercial ex-
cellence has to overcome some critical challenges that mostly relate to the issue of
effective and efficient interfacing with fairly diversified sectors of the industry.
Understanding and interfacing of critical socio-technical dimensions are important
to avoid vulnerabilities and risks of the ongoing change processes (Health and
Safety Executive 1997; Perow 1999; Booher 2003). Table 24.3 illustrates some of
the known challenges that need to be overcome for fully integrated and fail-safe

Table 24.3. The challenges for full-scale integration task of e-operations–e-maintenance is

quite complex
Challenges for integrated e-operation–e-maintenance solutions
Liabilities of shared decisions and activities
Trust and openness between business partners and distinctive groups
Semantics and ontology for data integration
Security and reliability of digital infrastructure
Information quality, data filtering, common data exchange platforms
Incentives for and risk of knowledge-based industry integration
Standards and interfacing for work processes optimization
Human and organizational learning
Competence development programs for change absorption
Trade union matters

There is much to do to make sure that the new integrated e-operations–e-main-

tenance setting is fully functional and fail-safe. Perhaps the greater concern is that
the marvel of the success brought by ad hoc technological solutions may easily
lead to miscalculation of underlying risks of process re-engineering tasks. With this
606 J. Liyange

realization, a major portion of the industry has begun to adapt along a more
cautious, synchronized, and an incremental development path. Initiatives by
authorities (e.g. NPD, PSA, etc.) and by socio-political sources (e.g. OLF) are criti-
cal to establish a more harmonized setting to ensure necessary levels of safety and
security. Even though a systematic strategy may prolong the integration plan, the
argument is that such a systematic move will have substantial long-term pay back
rather than a rapid solution that would eventually expose major stakeholders to
deal with unforeseen events requiring ‘ad hoc solutions’ or ‘quick fixes’ that would
be too costly to bear.

24.9 Conclusion
Commencing from 2003–2004, the Norwegian O&G industry has launched a
dedicated program to overcome obvious commercial risks on the NCS. This is
termed the third efficiency leap that has directly supported the implementation of
integrated e-operations–e-maintenance solutions for offshore assets in North Sea.
This new practice greatly challenged the conventional practices of many
disciplines, particularly of O&M seeking a technological as well as a managerial
change. The new O&M practice pays major emphasis on the more active exploita-
tion of application technologies, new data and knowledge management techniques.
The change process has also begun to re-engineer the industry infrastructure to
actively integrate O&M expertise of O&G producers with that of the external
knowledge-based industry. The large-scale ICT network called Secure Oil
Information Link and onshore support centers mainly facilitate the rapid
development within O&M process. The new setting has already brought major
commercial benefits by streamlining D2D and D2A processes with substantial
improvements in work processes. However, some critical challenges still remains
to be addressed, and the socio-political organizations and authorities are keen on
ensuring fully functional and fail-safe operations. The demand and the interest to
complete the rest of the journey is through more cautious and systematic strategies
to sustain commercial benefits beyond the year 2050 without exposing the industry
to unwanted or hidden risks that would be too costly to bear.

24.10 References
Arnaiz, A., Arana, R., Maurtua, I., et al., (2005), Maintenance: future technologies,
Proceedings of the IMS (Intelligent Manufacturing System) International Forum IMS
Forum 2004 Como, Italy, May 17–19, pp. 300–307.
Bangemann, T., Rebeuf, X., Reboul, D., et al., (2006), PROTEUS-creating distributed
maintenance systems through an integration platform, Computers in Industry, 57(6),
pp. 539–551.
Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, M., (2001), A control-limit policy and
software for condition-based maintenance optimization, INFOR, 39, pp. 32–50.
Bonissone, G., (1995), Soft computing applications in equipment maintenance and service,
ISIE ’95, Proceedings of the IEEE International Symposium, 2, pp. 10–14.
Booher, HR. (ed.) (2003). Handbook of human systems integration, Wiley-Interscience.
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 607

Chande, A., Tokekar, R., (1998), Expert-based maintenance: a study of its effectiveness,
IEEE Transactions on Reliability 47, pp. 53–58.
Chang, Y.S., Makatsoris, H.C., Richards, H.D., (2004), Evolution of supply chain
management: symbiosis of adaptive value networks and ICT, Boston: Kluwer Academic
Djurdjanovic, D., Ni, J., Lee, J., (2002), Time-frequency based sensor fusion in the
assessment and monitoring of machine performance degradation, Proceedings of the
2002 ASME International Mechanical Engineering Congress and Exposition paper
number IMECE 2002-32032.
Djurdjanovic, D., Lee, J., Ni, J., (2003), Watchdog agent — an infotronics-based prognostics
approach for product performance degradation assessment and prediction, special issue
on intelligent maintenance systems, Engineering Informatics Journal 17 (3–4), pp. 107–
During, W., Oakey, R., et al. (ed.) (2004). New technology-based firms in the new
millennium. Elsevier.
Ellingssen, H.P., Liyanage, J.P., Ruså, R., (2006), Smart integrated operations and
maintenance solutions to manage offshore assets in North Sea, Proceedings of the 18th
EuroMaintenace, MM Support GmbH, pp, 319–324.
Emmanouilidis, C., MacIntyre, J., Cox, C., (1998), An integrated, soft computing approach
for machine condition diagnosis, Proceedings of the Sixth European Congress on
Intelligent Techniques & Soft Computing (EUFIT’98), vol. 2 Aachen, Germany, pp.
Emmanouilidis, C., Jantunen E., MacIntyre, J., (2006), Flexible software for condition
monitoring, Computers in Industry, 57(6), pp, 516–527.
García, M.C., Sanz-Bobi, M.A., (2002), Dynamic Scheduling of Industrial Maintenance
Using Genetic Algorithms, Proceedings of EuroMaintenance 2002, Helsinki, Finland.
Garcia, M.C., Sanz-Bobi, M.A., Pico, J., (2006), SIMAP: Intelligent systems for predictive
maintenance: Application to the health condition monitoring of a wind-turbine gearbox,
Computers in Industry, 7(6), pp, 552–568.
Han, T., Yang, B.S., (2006), Development of an e-maintenance system integrating advanced
techniques, Computers in Industry, 57(6), pp, 569–580.
Hansen, R., Hall, D., Kurtz, S., (1994), New approach to the challenge of machinery
prognostics, Proceedings of the International Gas Turbine and Aeroengine Congress and
Exposition American Society of Mechanical Engineers, pp. 1–8.
Health and Safety Executive (HSE). (1997). Human and organizational factors in offshore
safety. HSE, UK.
Hosni, Y.A., Khalil, T.M. (ed.) (2004). Management of technology. Elsevier.
Iung, B., (2003), From remote maintenance to MAS-based e-maintenance of an industrial
process, International Journal of Intelligent Manufacturing 14(1), pp. 59–82.
Jardine, A.K.S., Banjevic, D., Makis, V., (1997), Optimal replacement policy and the
structure of software for condition-based maintenance, Journal of Quality in Maintenance
Engineering, 3, pp. 109–119.
Jardine, A.K.S., Makis, V., Banjevic, D., et al., (1998), Decision optimization model for
condition-based maintenance, Journal of Quality in Maintenance Engineering 4 (2), pp.
Jardine, A.K.S. Lin, D., Banjevic, D., (2006) A review on machinery diagnostics and
prognostics implementing condition based maintenance, Mech. Syst. Signal Process. 20
(7), pp. 1483–1510.
Jantunen, E. Jokinen, H. Milne, R., (1996), Flexible expert system for automated on-line
diagnostics of tool condition, Integrated Monitoring & Diagnostics & Failure Prevention,
Technology Showcase, 50th MFPT Mobile, Alabama.
608 J. Liyange

Khatib, A.R., Dong, Z., Qiu, B., et al., (2000), Thoughts on future Internet based power
system information network architecture, in: Proceedings of the 2000 Power Engineering
Society Summer Meeting, vol. 1, Seattle, USA.
Koc, M., Lee, J., (2001), A system framework for next-generation e-maintenance system,
Proceeding of Second International Symposium on Environmentally Conscious Design
and Inverse Manufacturing Tokyo, Japan.
Lee, J. (1996), Measurement of machine performance degradation using a neural network
model, Computers in Industry 30, pp. 193–209.
Lee, J., (2004), Infotronics based intelligent maintenance system and its impacts to closed
loop product life cycle systems, Proceedings of the Proceedings of the IMS’2004
International Conference on Intelligent Maintenance Systems Arles, France.
Liao, H.T., Lin, D.M. Qiu, H., et al., (2005), A predictive tool for remaining useful life
estimation of rotating machinery components, ASME International 20th Biennial
Conference on Mechanical Vibration and Noise Long Beach, CA.
Liyanage, J.P., (2003), Operations and maintenance performance in oil and gas production
assets: Theoretical architecture and capital value theory in perspective, PhD Thesis,
Norwegian University of Science and Technology (NTNU), Norway.
Liyanage, J.P., Herbert, M., Harestad, J., (2006), Smart integrated e-operations for high-risk
and technologically complex assets: Operational networks and collaborative partnerships
in the digital environment, Wang, Y.C., et al., (ed.), Supply chain management: Issues in
the new era of collaboration and competition, Idea Group, USA, pp. 387–414.
Liyanage, J.P., Langeland, T., (2007), Smart assets through digital capabilities, Mehdi
Khosrow-Pour (ed.), Encyclopaedia of Information Science and Technology, Idea Group,
Liang, E., Rodriguez, R., Husseiny, A., (1988), Prognostics/diagnostics of mechanical
equipment by neural network, Neural Networks 1 (1), p. 33.
Marseguerra, M., Zio, E., Podofilini, L., (2002), Condition-based optimisation by means of
genetic algorithms and Monte Carlo simulation, Reliability Engineering and System
Safety 77, pp. 151–166.
Mezgaar, I., (2006), Integration of ICT in smart organizations, Hershey, PA: Idea Group Pub.
Moore, W.J., Starr, A.G., (2006), An intelligent maintenance system for continuous cost-
based prioritization of maintenance activities, Computers in Industry, 57(6), pp. 595–606.
OLF (Oljeindustriens landsforening / Norwegian Oil Industry Association), (2003). eDrift
for norsk sokkel: det tredje effektiviseringsspranget (eOperations in the Norwegian
continental shelf: The third efficiency leap), OLF ( (in Norwegian)
Palluat, N., Racoceanu, D., Zerhouni, N., (2006), A neuro-fuzzy monitoring system:
Application to flexible production systems, Computers in Industry, 57(6), pp. 528–538.
Perow, C. (1999). Normal accidents: Living with high-risk technologies, Pinceton University
Roemer, M. Kacprzynski, G., Orsagh, R. (2001), Assessment of data and knowledge fusion
strategies for prognostics and health management, IEEE Aerospace Conference
Proceedings, vol. 6, pp. 62979–62988
Russell, R.S., Taylor, B.W., (2006), Operations management: Quality and competitiveness
in a global environment, Hoboken, N.J.: Wiley
Sanz-Bobi, M.A., Toribio, M.A.D., (1999), Diagnosis of electrical motors using artificial
neural networks, IEEE International Symposium on Diagnostics for Electrical Machines,
Power Electronics and Drives (SDEMPED) Gijón, Spain, pp. 369–374.
Sanz-Bobi, M.A., Palacios, R. Munoz, A., et al., (2002), ISPMAT: Intelligent System for
Predictive Maintenance Applied to Trains, Proceedings of EuroMaitenance 2002,
Helsinki, Finland.
Swanson, L., (2001), Linking maintenance strategies to performances, International Journal
of Production Economics 70, pp. 237–244
Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets 609

van Oostendrep, H., Breure, L., Dillon, A., (2005), Creation, use, and deployment of digital
information, Mahwah, N.J. : Lawrence Erlbaum Associates.
Wang, W., (2002), A stochastic control model for on line condition based maintenance
decision support, Proceedings of the Sixth World Multiconference on Systemics,
Cybernetics and Informatics, Part 6, vol. 6, pp. 370–374
Wang, W.Y.C., Heng, M.S.H., Chau, P.Y.K., (2006), Supply chain management: Issues in
the new era of collaboration and competition, Idea Group Publishing.
Yager R., Zadeh, L., (1992), An Introduction to Fuzzy Logic Applications in Intelligent
Systems, Kluwer Academic Publishers.
Yang, B.S., Lim, D.S., Lee, C.M., (2000), Development of a case-based reasoning system
for abnormal vibration diagnosis of rotating machinery, Proceedings of the International
Symposium on Machine Condition Monitoring and Diagnosis Japan, pp. 42–48.
Yen, G.G., (2003), Online multiple-model-based fault diagnosis and accomodation, IEEE
Transaction on Industrial Electronics 50 (2).
Yu, R., Iung B., Panetto, H., (2003), A mutli-agents based e-maintenance system with case-
based reasoning decision support, Engineering Applications of Artificial Intelligence 16,
pp. 321–333.

Fault Detection and Identification for Longwall

Machinery Using SCADA Data

Daniel R. Bongers and Hal Gurgenci

25.1 Introduction
Despite the most refined maintenance strategies, equipment failures do occur. The
degree to which an industrial process or system is affected by these depends on the
severity of the faults/failures, the time required to identify the faults and the time
required to rectify the faults. Real-time fault detection and identification (FDI)
offers maintenance personnel the ability to minimise, and potentially eliminate one
or more of these factors, thereby facilitating greater equipment utilisation and in-
creased system availability.
This case study describes, in some detail, the application of data-driven fault
detection to an underground mining operation. However specific this application
may be, the concept can be employed on any system of machines, with or without
complex machine-machine or machine-environment interactions, or to individual
In addition to detailing the implementation of an FDI system in real-time, we
propose a semi-autonomous approach to dealing with inaccurate and incomplete
records of equipment malfunction. Since past equipment performance is often the
principal information source for maintenance planning and evaluation, it is of
utmost importance that this information be as accurate as possible. The method
described allows for varying levels of confidence in the record keeping.
Section 25.2 introduces the longwall mining system, the most common form of
mining coal underground at the present. The availability of longwall equipment
systems is low compared to surface systems of similar complexity. The present
approaches towards reducing equipment downtime in longwall mining are summa-
rised in Section 25.3. Common FDI approaches are summarised in Section 25.4.
Two data-driven techniques are used in this study, namely artificial neural net-
works and multi-variate statistics. The availability of quality training data is of
critical importance for either one. The issue is addressed in Sections 25.5–25.7.
Once the training data set is constructed, the application of the selected FDI tech-
612 D. Bongers and H. Gurgenci

niques is reasonably straightforward. The application and the results are summa-
rised in Section 25.8 with concluding remarks in Section 25.9.

25.2 The Longwall Mining System

Longwall is an underground mining technique used to extract coal from relatively
flat coalbeds. The basic principle is simple. A coalbed is selected and blocked out
into panels averaging nearly 240 m wide, 2100 m in length, and several metres in
height, by excavating passageways around its perimeter. A panel of this size con-
tains millions of tonnes of coal, most of which is recovered. In the extraction
process, numerous pillars of coal are left untouched in certain parts of the mine in
order to support the overlying strata. The mined-out area is allowed to collapse. In
some instances this may cause some surface subsidence.

Figure 25.1. Longwall equipment layout

Extraction by longwall mining is an almost continuous operation involving

the use of self-advancing hydraulic roof supports, a sophisticated coal-shearing
machine, and an armoured conveyor parallel to the coal face (see Figure 25.1).
Working under the movable roof supports and riding on the conveyor frame, the
shearing machine cuts and spills coal onto the conveyor for transport out of the
mine. When the shearer has traversed the full length of the coal face, it reverses
direction (without turning) and travels back along the face taking the next cut. As
the shearer passes each roof support, the support is moved closer to the newly cut
face. The steel canopies of the roof supports protect the workers and equipment
located along the face, while the roof is allowed to collapse behind the supports as
they are advanced. Extraction continues in this manner until the entire panel of coal
is removed.
Fault Detection and Identification for Longwall Machinery Using SCADA Data 613

25.2.1 Longwall Maintenance – Current Practices

As with all forms of mining, production downtime represents massive losses in

potential revenue. An effective maintenance program is seen in the industry as the
key to a profitable and sustainable longwall. The purpose of this section is to de-
scribe the current maintenance practices in most longwall mines.
Maintenance is broadly categorized as either planned or reactive. Typically,
one shift per week is assigned to routine planned maintenance, with major over-
hauls of machinery occurring at the end of each panel. Weekly maintenance is per-
formed by the regular longwall operators, and usually includes the following:
• Checking the levels and quality of lubricants and hydraulic fluids
• Visual inspection of electrical panels, pumps, motor, gearboxes and hoses
• Testing of all underground communication devices
• Start-up tests for all AC motors
• Replacement of worn shearer picks and faulty spray nozzles on the cutting
Reactive (or breakdown) maintenance refers to the repair or replacement of
failed or faulty equipment which interrupts longwall production. It is these actions
which are reported in the maintenance log of all unscheduled downtime. This
maintenance is attended to by longwall operators, as each team typically includes
specialists in both mechanical and electrical repairs. Daily Management Meetings

Senior mine management typically meets daily to discuss and update current
issues. A large portion of the meeting is spent discussing planned maintenance, and
any events from the previous day’s shifts that caused production delays. This
forum allows input from managers with an array of specialities, thereby deciding
on action the will most benefit longwall production as a whole.
Any urgent issues are relayed to the longwall team presently underground for
immediate attention. Other maintenance that can be postponed until a later produc-
tion shift or the weekly maintenance shift is requested in the form of a work order.
Work orders are given to a shift supervisor at the beginning of each shift, and are
carried out before production begins. Performance Indicators

Although coal tonnage per unit cost is the best indication of revenue, it gives no
information about the performance of individual operations such as longwall pro-
duction or new panel development. For this reason, maintenance teams use many
performance indicators.
Examples of such indicators are:
• Number of failures (or faults) of a piece of machinery in an operational
• Total downtime of a piece of machinery in an operational month
• Tonnes of coal per unit of longwall operations
614 D. Bongers and H. Gurgenci

Of the many indicators (or measurements) of performance, longwall mines

universally define key performance indicators (KPIs) that are production related.
These KPIs are:

Operating time
Longwall availability = × 100%
Operating time + Maintenance delays

This KPI looks at the ability to have the machines operate for the time that they
are planned to operate. It is simply the percentage of the available planned time
that they do actually operate. The ‘maintenance delays’ refer to scheduled and
breakdown maintenance. Some sites include only the breakdown maintenance in
this statistic, which leads to an inflated value of the equipment availability. Such
confusion in terms makes it difficult to benchmark practices between sites. Typical
values for this KPI average between 40% and 60%.

Actual operating time

Mean time between failure (MTBF) =
Number of maintenance delays

This KPI looks at the ability to sustain the operation of machines over periods
of time. It is a measure of how long, on average, before machines stop due to a
maintenance problem. Typical values average around 1 h.

Actual delay time

Mean time to repair (MTTR) =
Number of maintenance delays

This KPI looks at the ability to diagnose and remedy maintenance delays once
they have occurred. It is a measure of how long, on average, before machines that
have faulted are returned to operation. Typical values average around 20 min.
KPIs are typically reviewed on a weekly basis.

25.2.2 Longwall Monitoring

Equipment monitoring is playing an increasingly significant role in the modern

longwall. By monitoring is meant both the determination of the overall state of
plant as well as measurement of individual properties; for example, the running
temperature of a gearbox. This section discusses all forms of condition monitoring
undertaken in Australian longwalls.
Condition monitoring of longwall equipment takes on two main forms: on-line
and off-line. On-line monitoring includes all sensor measurements recorded and
transmitted to the surface using a PLC-driven, SCADA network. All other measure-
ment and monitoring, including regular maintenance inspections, are classed as off-
Fault Detection and Identification for Longwall Machinery Using SCADA Data 615 Off-line Monitoring

Oil is used in longwall machinery as both a lubricant and a cooling fluid. As such,
it is important to monitor the quality and quantity of all oils used to ensure
satisfactory operating conditions for machinery. A change in the quantity of oil in a
piece of machinery is typically due to some sort of leakage;S however changes in
the quality (or specific properties) of particular oil may be due to any or all of the
• Wearing away of machine part material will increase the amount of metal
inclusions in the oil
• Unsealed or cracked housings may allow water or dust to become mixed in
the oil
• The use of an incorrect oil may cause a change in the viscosity or PQ index
(the PQ or Particle Quantifier Index is a measure of the amount of entrained
debris in the oil)
Oil analysis is done at irregular intervals; however on average it is conducted
fortnightly. The properties usually measured are metal inclusions (Cr, Ni, Mo, Mn,
Pb, Zn, Cu, Sn, Fe, Si, Al, Mg), kinematic viscosity at 40 °C, PQ index, and water
content (% m/m).
The task of checking oil level, collecting and analysing oil samples, as well as
conducting vibration and thermographic analysis (see below) is typically contrac-
ted to one of a few condition monitoring companies. Results are returned to the
mine in around three days, and are classed as satisfactory, caution, or action for
each sample. If action is required, a facsimile is sent to the mine detailing the per-
ceived problem. An example of such action is that the measured viscosity of an oil
sample is significantly high or low and all other analysis normal, suggesting the
wrong lubricating fluid was used. The mine is immediately notified so that the
lubricant may be checked and/or changed in the next shift.
Vibration analysis is performed on all machinery with rotating parts, but is
focused primarily on the crusher rotor, where the largest forces exist. Accelero-
meters are attached in strategic locations to the casing of the machinery, and
measurements taken under operation for around 10 min. The results from the
analysis are almost immediate, and action can be taken if excessive play is found in
any plant.
Not all longwalls perform vibration analysis, as it is quite expensive and rarely
highlights any problems. This is also due to the fact that large machinery that is
rotating off-centre tends to generate massive noise, which is easily detected by
longwall operators.
Thermal analysis of various pieces of equipment is performed through
thermography, a form of photography in which different colours represent different
temperatures. Thermography is used both underground and on the surface to check
the temperatures of parts not monitored by the on-line system. Typical parts in-
spected are control boxes, sensor communication subsystems, and sometimes the
fluid coupling at the AFC drive.
616 D. Bongers and H. Gurgenci On-line Monitoring

The on-line monitoring systems at longwall mines consist of a network of sensors
and switches, which may be monitored from the surface or by operators under-
ground. The interface allows users to view the overall operation, as well as monitor
individual pieces of machinery. The system is designed to indicate simple faults
such as an over-temperature trip by displaying messages and changing the colour
of the particular machine at fault.
This system is very expensive, both to purchase and maintain. At present, the
display is only looked at when production personnel indicate that the longwall has
stopped due to an unknown fault. Data that is recorded, is rarely used, and
undergoes no more than a visual analysis.
Over 300 analogue sensors are typically monitored by these SCADA systems.
In the past, data storage was considered one of the major pitfalls of archiving all
that was sampled. Today, however, this has been mitigated by the inexpensive and
very large capacity computer hard drives.
Having ready access to millions of measurements tends to cause an information
overload, which is where FDI technology fits in. The ability to process the data in
real time allows the vast number of data points to be summarized into a highly
accurate and concise history of equipment failures.

25.3 Reducing Equipment Downtime: The Case for FDI

Available techniques to minimise equipment/process downtime are examined first.
After all, any maintenance program has the principle goal of achieving the greatest
possible equipment utilisation. The discussion of various techniques will be based
on generalized plant, with no regard to the specific challenges of the longwall
industry. Only in conclusion will such consideration be given, so as to justify the
method selection for this particular application.

25.3.1 Available Techniques Preventative Maintenance

The most common way that industries attempt to improve machine availability is
through preventative maintenance. This consists of regular inspections, regular
replacement of lubricant, filters etc. and occasional reconditioning of parts. Preven-
tative maintenance has proven very successful in prolonging the life of machinery,
as well as showing improvements in performance, such as the mean time between
failures (MTTF) and mean time to repair (MTTR).
Although the particular preventative maintenance philosophy employed may
vary depending on the training and experience of the engineers responsible, regular
maintenance tends to prove very cost effective, and is employed in nearly all indus-
tries. Design Modification

It is possible to redesign machinery to alter its inherent reliability. That is, based on
industry experience of certain faults, redesign parts or entire machinery to reduce
Fault Detection and Identification for Longwall Machinery Using SCADA Data 617

or stop the occurrence of the fault. Typically, machine design is the responsibility
of the manufacturer, and is a very time consuming and costly operation. Redundancy
The concept of redundancy is widely used on a component level to make machines
more reliable. The principle is rather simple – machine components such as relays,
hydraulic valves or electronic capacitors are unnecessarily duplicated in the design
in such a way that if one fails, another may take on its role. It may be possible in
some situations to extend this concept of redundancy to entire machinery, having
complete units on standby in case one should fail.
Although this method in no way improves the inherent reliability of the
individual plant, they are effectively made more available. The use of this sort of
redundancy is therefore suitable in those situations where the time to commission
replacement machinery is relatively small in comparison to the fault repair time (or
maintenance time).
Other factors that must be taken into account when considering the use of
redundant machinery are cost and storage. Many industries use machinery that is
far too expensive to purchase spares, or is too large to economically store. Diagnosis and Repair Time

When a fault occurs causing a machine to cease operation, the process time lost is
that required to diagnose and resolve/repair the fault. If this time can be reduced,
machinery is then made more available. This may be achieved by improved opera-
tor training, or specialization of operator tasks.
Another way to reduce the time required to determine the nature of the fault is
to employ a diagnostic system that uses sensor and/or operational information to
detect and isolate the fault. Such a system would typically produce an instanta-
neous diagnosis; however it would not help to reduce the time required to repair
the fault, once determined.
The maintainability of the system is improved by reduction of the diagnosis and
repair time. A reliable FDI tools assists this by providing the operating and main-
tenance staff with a clear indicator towards the failing component and sometimes
to the mode of failure. Failure Prediction

Similar to weather forecasting, the knowledge of an imminent failure of a piece of
machinery would allow machine operators to cease or change operation to minimize
and potentially avoid any subsequent downtime. Such a predictive system would
rely on a continual stream of quality data.
There is an inherent risk in such a concept in that false or misleading pre-
dictions could cause additional downtime, and quickly lose operator confidence.
For a predictive system to be effective it must produce not only a prediction of an
imminent fault, but provide a sensible reason for the prediction, so that operators
may make an informed decision whether to act or ignore the suggestion.
618 D. Bongers and H. Gurgenci

25.3.2 Potential Benefits

When determining what approach should be taken to improve the availability of

longwall machinery it is important to weigh the cost of any changes against the
potential benefits, both financial and improvements in worker safety. Since just a
handful of lost time injuries (LTIs) are recorded internationally each year as a
result of machine failure, we only consider here the financial benefits.
Quite obviously, the financial benefit of a more available longwall is the
additional extraction of coal. To quantify this, we consider a single rip undertaken
by a shearer traversing from the maingate to the tailgate, and then returning to the
maingate. In an average-sized longwall, a rip can take between 18 and 26 min
given no interruption, and will yield approximately 1800 tons of run-of-mine
(ROM) coal. Assuming a 70% yield for export quality coal, sold at $50 per ton,
this equates to between $145,000 and $210,000 gross turnover for each continuous
hour of longwall production. Generously accounting for expenses incurred such as
water and electricity, additional longwall production could be valued at around
$100,000 per hour.
With such a large potential profit for increased machine availability, it is clear
that the benefits from just a few hours of additional production per month would
far outweigh the cost of employing any of the techniques mentioned, with the
possible exclusion of machine redesign.

25.3.3 Conclusions

All longwall mines employ preventative maintenance. It is one of the largest sub-
operations at any mine, and proves effective in that, when less attention is paid to
maintenance, more faults occur. The optimum level of planned maintenance is
difficult to determine because not all failures are age-related. In fact, many failures
follow an exponential distribution with a uniform failure rate that is not related to
age. Redesign and re-engineering of major offenders has been effective in many
instances and it is believed that more improvements can be realized through such
efforts. Redundancy in design has not been fully explored by longwall machine
designers mainly due to the extra cost and the bulk associated with the redundant
It is the authors’ opinion that the future of longwall mining should include
intelligent predictive systems that rely on the currently unused monitoring data.
The possibility of such a system relies on answering the question: “does the
currently recorded condition monitoring data contain sufficient information regard-
ing imminent faults?” If such information exists in the data, then information must
also exist that the fault has occurred, and specifically which fault occurred.
The outcome of the work described in this case study, the detection and
isolation of major longwall faults, should therefore be seen as a stepping stone
towards a predictive system for longwall faults/failures. A detection system would
also act as a diagnostic tool, as described above, itself contributing to the goal of
improved longwall availability.
Fault Detection and Identification for Longwall Machinery Using SCADA Data 619

25.4 Fault Detection and Isolation Methodology

The fault detection and isolation problem can be viewed as a subset (or branch) of
the general classification problem. Analogous to detection, a distinct change in the
state of a dynamic system can indicate that a fault has been induced. The isolation
task, is thereby reduced to the categorization (or classification) of the state of the
system, thereby locating the source of the fault; e.g. see Willsky (1976).
The development of a specific FDI system requires the engineer to capture the
relationship between the available information (input data) and the state of the
system (which includes the presence and nature of faults). This system, when
developed, can then be viewed as a mapping function, allowing the state of the
system to be determined at each measurement interval. The nature of this function
is left to the reasoning of the engineer, dependent on both the nature of the
faults/failures, and the available system information (including sensor data and
fault/failure history).

25.4.1 FDI Techniques

Perhaps the most commonly applied FDI technique is the informal, qualitative
opinion of the expert. Analogous to the diagnostic method applied by a car
mechanic, operators (experts) use typical indicators such as heat, noise, vibration
or poor performance to ascertain the presence and nature of the fault. Typically,
faults detected using this rather subjective FDI technique must be confirmed by
further investigation.
The most rigorous of the FDI approaches, qualitative expert systems, are rule
based methods usually relying on a large number of if-then relationships. Expert
systems truly require an expert, as they rely heavily on knowledge of the influence
of all faults on system behaviour. This approach can provide excellent FDI;
however is not robust to variations in system parameters or the occurrence of un-
foreseen faults.
Model-based methods, as the name suggests, rely on a mathematical model of
the system of interest and/or a model of how system faults affect sensor measure-
ments; e.g. see Frank (1990). These techniques typically rely on analytical
redundancy. The principle behind analytical redundancy is simple: for a given
measured input, a mathematical model of the system may be used to generate
estimates of its output; the redundant measurements. Comparison of these and the
real output measurements allows inference to be made regarding the operating state
of the system. A commonly applied regime is that of the Kalman filter, an optimal
state estimator. The extended Kalman filter (EKF) is used when non-linearities are
dominant. In either case, the state representations can be chosen that are most
sensitive to fault induced behaviour. While originally developed for estimating
states in a control system, the Kalman filter has been applied in a wide range of
fields including control, communications, image processing, biomedical science,
meteorology, and geology. For more information on the Kalman filter and its appli-
cations, there are many excellent references available; e.g. Sorenson (1985); Gelb
(1974) and Grewal and Andrews (2001).
620 D. Bongers and H. Gurgenci

The inference to be drawn from the apparent difference between the model and
system outputs, referred to as the residual, often uses simple, statistical limits.
Assumptions enforced for model validity including the random distribution of
sensor noise allow chi-squared confidence limits, for example, to be determined for
each element of the residual vector. Expert knowledge is then employed to
establish which faults will be evident in each of these elements.
In contrast to model-based approaches where a priori knowledge of the system
is required, process history based or data-driven methods require only the avail-
ability of a large amount of historical process data. These techniques attempt to
capture the relationship between system measurements and system behaviour, with
the goal to detect and identify fault-affected behaviour from future measurements.
By definition, a data driven approach to fault detection and isolation is one in
which the decision criteria are based primarily or wholly on example data. Essen-
tially, a sufficiently large, example dataset representative of each fault of interest is
used to generate an algorithm which ‘maps’ a single observation input to a single
fault classification output. As new or ‘unseen’ observations of the systems are pre-
sented, they are subsequently classified (using these mappings), which allows both
the detection and isolation of faults.
Data driven methods are typically applied to systems for which the develop-
ment of accurate state-space or other dynamical equations is not possible or practi-
cal. Difficulty in the determination of accurate dynamical equations is common in
engineering problems for one or more of the following reasons:

1. A lack of understanding of the dynamic interactions between system com-

2. Unpredictable environmental conditions which significantly affect the
3. Non-linearities within the system for which suitable approximations have
not been determined

Whether applied to fault detection or other classification problems, data driven

methods are often referred to as black box solutions. This is because little or no
understanding of the particular system is required for their implementation. Al-
though often based on strict optimality criteria, little interpretation can be drawn
from the subsequent equations which map the input data to a system classification.
Regardless, these methods have proven useful tools for the detection and isolation
of faults in a wide variety of engineering problems.

25.4.2 Data Driven Techniques for FDI

Numerous journal and conference papers have been published describing the
application of data driven techniques to fault detection problems. Their popularity
is largely due to the fact that the established algorithms, namely principal compo-
nents analysis (PCA), partial least squares (PLS), linear discriminant analysis,
fuzzy logic discriminant analysis and neural networks, are simple and fast to apply
with little system knowledge. Venkatasubramanian et al. (2003) provide a compre-
Fault Detection and Identification for Longwall Machinery Using SCADA Data 621

hensive review of process history based methods applied to FDI, referencing over
140 such papers.
This section provides just a handful of brief descriptions of data driven FDI
applications, for the sole purpose of illustrating the methods by which the example
data classifications are typically determined.
McKay et al. (1996) described the use of an artificial neural network, or ANN
(see Section 25.4.4) to determine the acceptability of a polymer coating used to
coat copper wire. It was determined that the viscosity of the polymer as it exited
the extrusion process (during manufacture) was the most reliable indicator of
quality, short of destructive testing. A neural network was employed to estimate
this viscosity based on sensor measurements on the extrusion equipment and data
from an attached rheometer.
Network training data was developed over a period of time whereby laboratory
experiments were performed to accurately determine the viscosity of a number of
extruded polymer samples. This form of training data is manually generated, and
relies on a number of supervised sets of measurements.
Also described in McKay et al. (1996) is the use of a neural network integrated
as part of a model based predictive control scheme. In this case, a detailed model
of the process of mixing air and fuel in a combustion engine was developed, and
the model interrogated with a number of initial condition scenarios to generate a
predicted set of measurements. This set of conditions/artificial measurements
formed the training dataset for the neural network.
Chow (2000) describes the use of an ANN to detect and isolate simple faults in
a DC motor. In contrast to the two prior examples, the training process involved
expert diagnosis to classify faults/failures as they occurred. With each occurrence,
the network weights were updated. To expedite the process, faults were induced by
damaging components or changing the resistance of internal components.
The supervised approach to generating example data is typical of data-driven
FDI examples in the open literature. Such research focuses on new detection and
isolation regimes, and assumes that training data is both available and accurate.

25.4.3 Training Data Set

All data-driven FDI systems need to be trained first on known data before they are
applied on unknown data. Availability of quality training or example data is an
essential requirement whether one used statistical FDI or artificial neural networks.
Example data are a sufficiently large dataset with the state of the system identified
for each observation. The identification process maps every observation to a
discrete state. Below is an augmented matrix, illustrating the form in which such a
training set with associated classifications, Y , would be assembled.

⎡ y11 y12  y1 p C1 ⎤
⎢y y 22  y2 p C 2 ⎥⎥
Y =⎢

⎢      ⎥
⎢ ⎥
⎢⎣ y n1 yn 2  y np C n ⎥⎦
622 D. Bongers and H. Gurgenci

The last column in the above matrix includes the state descriptors assigned to
each observation vector (each row). Based on the assumption that the classifica-
tions accurately and discretely describe the state of the system, various algorithms
may be applied to generate rules (or equations) that map a single observation
vector input to a single classification output. Once generated from the training set,
these rules can be used to classify new observations of the system. As the state of
the system changes from ‘normal operation’ to a state indicative of the presence of
a particular fault, this may be recognized as a fault being both detected and isolated
Various data-driven techniques for FDI were discussed in the previous section.
The most common of these is multivariate statistical analysis (linear and non-
linear) and artificial neural networks. Both approaches have proven to be valuable
data-driven tools for the classification of multivariate observations.
The performance of an FDI system generated from example data is a function
of both the observability of each fault within the monitored variables and the
quality of the example data collected. Since these techniques are typically applied
where mathematical modeling is not feasible, a rigorous study of the observability
of each fault in observation space is not possible. The successful detection of faults
implies observability, but failure to detect certain faults does not imply non-ob-
servability. Observable faults will not be detected if the FDI function is not
sensitive to the specific changes exhibited by a fault, or if the training data set is
not of good quality.
It is paramount that one endeavours to apply a complete, unbiased and repre-
sentative training dataset in order to achieve a robust and accurate fault detection
and isolation system.

25.4.4 Neural Networks for FDI

Inspired by the way the biological nervous system processes information, artificial
neural networks (ANNs) are a mathematical paradigm, composed of a large
number of interconnected elements operating in parallel. The function of the net-
work, influenced by a number of factors including its architecture, is however
largely determined by the connections between elements. Analogous to the ability
of the biological system to learn by example, particular functions can be developed
by adjusting the value of these connections, which are known as weights.
Essentially, neural networks are adjusted, or trained, so that a particular input
produces a specific target output. Based on a comparison of the output and the
target, network parameters are adjusted in an iterative process until the output
adequately matches the target. This process is known as supervised learning, which
typically involves a large number of input/target pairs.
During training, each output is set to be a binary indicator for each data
classification. Unlike linear discriminant FDI, however, the output of the network
using unseen data is not open to interpretation of the likelihood that the observation
belongs to a particular class.
Figure 25.2 shows the mathematical workings of the most basic neural network
element, often termed a neuron. Each element of the vector input x is multiplied by
a weight. These products are summed, together with the neuron bias b, to form the
Fault Detection and Identification for Longwall Machinery Using SCADA Data 623

net input, n. This net input is then applied to a transfer function to produce the
neuron output, z. The projection of the neuron element can viewed as a discrimi-
nant function g(x) given by

⎛ n ⎞
g (x) ≡ z = f ⎜ ∑ xi wi + b ⎟
⎝ i =1 ⎠

Figure 25.2. Single neuron with vector input

It is the transfer functions of a neural network that allow them to produce

highly on-linear relationships between the input and output.
Figure 25.3 illustrates a multilayer neural network. Such a network has a signi-
ficantly greater expressive power, and is able to map a vector input to a vector out-
put. In this case, the input is the set of measurements collected by the SCADA
system. The output of the network is the state classification of the longwall system,
which may indicate that everything is normal, or that a particular fault is present.

Fig. 25.3 Multiple layered neural network with vector input

624 D. Bongers and H. Gurgenci

A number of software packages exist to implement neural networks of various

architectures for any classification task. As such, engineers tend not to focus on the
detailed and ambiguous task of governing the precise training process of the net-
work. Given this, however, it must be stated that it is the details of the training
algorithm that will dictate the level of classification success achieved, second only
to the requirement for quality training data.
A detailed description of the algorithm used in this application can be found in
Bongers (2004), which also outlines the flexibility that engineer has in varying
specific training parameters.

25.5 Longwall Mining FDI Training Set Development

The purpose of a training dataset is to provide an algorithm with sufficiently broad
examples of each classification to allow the generation of an FDI function with
high distinguishing power. It is important that the training set is not biased in its
ability to determine a particular class of observation, as this will lead to high rates
of misclassification of other classes. Most importantly, however, the training
dataset must provide the subsequent FDI function with sufficient information to
capture the underlying relationships between various distributions of observation
vectors and the associated state of the system.
In terms of classification bias, all data-driven techniques are affected in the
same way: an unequal number of each class of observation in the example dataset
will cause the resulting FDI function to be biased in its class assignment. Ironi-
cally, the class most often presented during the development stage (training) will
have a less than appropriate chance of being assigned to new observations. This is
a result of an overfitting of the FDI function to that particular class, effectively
placing a higher importance on variations from the class mean observation vector.
Clearly, this is undesirable, leading to above-average rates of misclassification.
Equally important in the quest for an accurate fault detection system is the
distribution of observations for each class. Since the data-driven approach does not
assume an understanding of individual faults on data properties, large amounts of
data must be collected. If possible, the data should span an operational time for
which a large number of example faults occur, and exhibit an ordinary proportion
of faults/failures per unit time. In situations where a large amount of data cannot be
collected, data-driven approaches may not be appropriate.
The goal of this section is to demonstrate the non-triviality of developing a
classification scheme for longwall fault detection and isolation. Also, links will be
formed to illustrate how this phenomenon is common to a large number of en-
gineering problems.
The first concern in processing data is, however, the estimation of missing
entries. Estimation must be accurate and efficient. The k-th nearest-neighbour (k-
NN) algorithm (Todeschini 1990) is used in this study.
A training dataset requires classifications associated with each observation, for
the purpose of generating an FDI function. In order to generate a list of classifi-
cations, one must attempt to determine the ‘state’ of the longwall at each obser-
vation. It should be noted that the development of an accurate FDI system relies on
Fault Detection and Identification for Longwall Machinery Using SCADA Data 625

the assumption that the state of the longwall system can be classified into a finite
number of categories. The only record of the activity of the longwall is the main-
tenance log, which details all unscheduled downtime at the longwall face.
Table 25.1 is an excerpt from the maintenance log corresponding to the
condition monitoring data discussed earlier. It records the time that the delay began
and the duration of downtime experienced. The plant responsible is also recorded,
as well as a description of the delay cause.
Figure 25.4 illustrates the inaccuracy of the maintenance records. It shows
traces of motor currents and the shearer position, which are centred on a time
corresponding to a documented delay. In this case, the maintenance records show
that a delay began at observation 9059, and that the longwall was inactive for 50
observations (25 min).

Table 25.1. Excerpt from the maintenance log

Date DST Dur. Major delay Minor delay Detail delay Remark
06-May-01 21:35 5 Support services Pumps – –
06-May-01 21:45 25 Support services Power supply – –
06-May-01 23:20 40 Support services Power supply – –
06-May-01 0:30 80 Support services Pumps – –
06-May-01 2:00 10 Maingate drive Drive assembly Cooling water supply Blown hose in pump station
06-May-01 3:40 5 Maingate drive Drive assembly Cooling water supply –
06-May-01 4:45 104 Panel Supplies – –
07-May-01 6:30 20 Labour Travel – Panel prepr.
07-May-01 6:50 30 Maingate drive Drive assembly Cooling water supply Tripper belt slip
07-May-01 8:25 20 Shearer Cutting drum assembly Cutter shear shaft Tripper belt slip
07-May-01 12:40 10 Shearer Cutting drum assembly Cutter shear shaft –
07-May-01 13:15 10 Shearer Electrical – Control Display – Screen ESR faults on remotes
07-May-01 13:58 45 Shearer Cutting drum assembly Cutter shear shaft Intermittant loss of shearer position
07-May-01 14:58 2 Maingate drive Drive assembly Cooling water supply –
07-May-01 15:08 7 Maingate drive Drive assembly Cooling water supply Tripper CST trip
07-May-01 15:15 10 Mining conditions Fall/clean up – Tripper CST trip
07-May-01 16:05 5 Maingate drive Drive assembly Cooling water supply Tripper CST pump fault

A stoppage in production is indicated by the shearer position remaining

constant (i.e. not moving) and all motor currents falling to zero. This figure shows
two examples of longwall shutdown, neither of which coincides with the documen-
ted event. The reason for this common discrepancy is simple. The shift supervisor
enters the details into the maintenance log. Values such as the delay start time and
duration are taken from his/her wristwatch, whereas the time associated with the
condition monitoring data is that of the computer clock. Additionally, there are
significantly more stoppages in production than the number of documented delays.
As a result, there is uncertainty as to which stoppage in longwall production
corresponds to each documented delay.
In addition to this uncertainty, there is no indication as to how long the fault
was present prior to the resulting shutdown. This information is necessary so that
fault-affected observations can be appropriately classified. The observations after
the shutdown are not generally useful for the purposes of fault isolation since all
626 D. Bongers and H. Gurgenci

shutdowns look similar, regardless of the triggering cause. Therefore, to generate a

training dataset, there are two distinct challenges:

1. To determine the event time; i.e. to determine which longwall stoppage

relates to each documented instance of a maintenance event
2. To determine the number of observations prior to shutdown that contain
information about the presence of a fault, and to develop a scheme to classi-
fy observations based on the maintenance record

Figure 25.4. Example of fault at observation 9059

Where possible, the challenges are approached in a generic manner. This will
illustrate the applicability of this research to a large number of engineering prob-
lems where system modeling is highly complex, and discrete states of the system
are not immediately apparent.
All faults considered lead to a complete longwall shutdown. That is, one or
more parameter (examples include gearbox temperatures, AFC chain tension and
earth leakage current) measures outside present safety limits, causing all major
longwall machinery to shutdown. As such, all longwall stoppages represent candi-
dates for each documented maintenance event. This section describes the process
by which the start time and duration of all longwall stoppages was determined, as
well as the selection criteria for candidates for each maintenance event of interest.
Fault Detection and Identification for Longwall Machinery Using SCADA Data 627

25.6 Event Time Determination

No binary channel exists to indicate whether the longwall is operational or shut-
down; therefore others must be used to make this simple decision. The most
obvious choice is the motor currents of the major equipment, namely the shearer,
AFC and BSL/Crusher, as shown in Figure 25.4. As alluded to, when the longwall
is shutdown, each of these records a value of 0.01 Amps, which is the minimum
level recordable set in the monitoring software of the SCADA system.
Fig. 25.4 illustrates the idiosyncrasies of longwall stoppages. First, when the
motor currents resume typical operating values, there is often a delay prior to
shearer movement. This is most commonly a result of the 1–3-min period required
for the conveyor start-up sequence. Second, large spikes in the armature current of
both AFC drives are evident 30 s to 1 min before the face equipment is powered.
This is known as the inrush current, which is the initial current demand on start-
up of an AC drive before a load resistance or impedance increases to its normal
operating value. Third, the shearer position may change (the shearer may be
moved) although the remainder of the longwall face is inactive. This is simply due
to the operators moving the shearer to allow access for repairs.
Due to the large number of stoppages present in the data, the process of
stoppage detection and candidate selection must be automated. It is important,
therefore, that the duration of each longwall be clearly defined. Considering the
idiosyncrasies mentioned above, a single longwall stoppage is defined from the
time when all face equipment motor currents have a value of 0.01A and the shearer
stops moving to the time the motor currents resume typical operating values,
ignoring any ‘non-zero’ values for current that occur for two observations (1 min)
or less.

25.6.1 Candidate Selection

We consider now the selection of candidate stoppages for each maintenance event.
It is of course likely that the true event time lies in the vicinity of the documented
delay start time (DDST), and most certainly within the same 8-h working shift.
Although not shown in Table 25.1, the maintenance log contains a ‘shift’ field,
which indicates day, afternoon or night shift. The shift schedule is known for the
mine from which the data was collected. Therefore, to establish a conservative
approach that will be adopted throughout this chapter, all longwall stoppages
within the same shift will be considered candidates for each fault occurrence of
interest. Procedure
The process of determining candidate stoppages was automated using the
following procedure:
Step 1: Determine a list L of all observations for which the value of all face
equipment motor currents are 0.01.
Step 2: Determine the observation number for each observation in L.
628 D. Bongers and H. Gurgenci

Step 2 is required since the removal of sparse observations (a pro-

cedure carried out during data preprocessing) disturbed the sequential
nature of observations in the data matrix Y.
Step 3: Beginning with the first entry in L, determine which successive
observation numbers that have a difference greater than 2. Place the
latter observation number of each such pair in a new list L2.
Step 4: Using lists L and L2, create a two-column matrix S which lists the start
time and duration of each stoppage.
The following steps are repeated for each maintenance event of interest:
Step 5: Using the maintenance log (stored electronically) determine the obser-
vation numbers spanned by the appropriate shift.
Step 6: Determine which stoppages listed in S have a start time within this
The stoppages listed in S that have a start time within the same shift as a
particular delay are the candidate stoppages for that delay. Results
When this procedure was applied to data representing five months of longwall
operations, 2452 stoppages were determined. The average duration for each
stoppage is 69 observations or 34.5 min (the sampling rate is two observations per
min). On average, five candidates were selected for each maintenance event of
interest using the procedure described. As further testimony as to the inaccuracy of
the maintenance log, analysis showed that two particular shifts had fewer longwall
stoppages than the number of catastrophic maintenance events documented for
each shift.

25.6.2 Event Candidate Cost Function

A number of electronically-recorded stoppages in longwall production have been

identified as those associated with documented maintenance events. Furthermore,
for reasons discussed in the previous section, only a handful of these are con-
sidered candidates for each occurrence of a fault of interest. This section attempts
to discriminate further between candidates.
A two-stage process was adopted. In the first stage, a number of stoppages in
longwall production were identified as candidates for each documented delay of
interest. On average, five candidates were selected for each event.
In the second stage (presented in this and subsequent sections), these likely
candidates were compared against each other to identify the best match to the event
described in the production delay history. It was important that this was done in a
generic manner, i.e. with no consideration given to the nature of each specific
failure. Each step in the development of the training dataset had to be universally
applicable, allowing the generation of FDI systems for a variety of applications.
The research question that must now be posed is: ‘what information is available
that can be used to determine which candidate corresponds to the documented
Fault Detection and Identification for Longwall Machinery Using SCADA Data 629

To answer this, we look to the maintenance log. The only information available
is the difference between the delay start time and duration of each candidate and
those of the documented event. We define ∆DST as the difference between the
delay start time of a candidate and the documented delay start time. ∆DD is
similarly defined as the difference between the duration of each candidate stoppage
and that of the documented delay. Each candidate will have associated values of
∆DST and ∆DD, and these will initially be used to determine which candidate
corresponds to the documented downtime.
The discriminating metric is simply a weighted sum of the available discrimina-
tory information, in this case ∆DST and ∆DD. Commonly referred to as a cost
function, it provides a crude way of determining which stoppage relates to the
documented maintenance event. The form of the cost function is

Cost = α ∆DST + β ∆DD

where α and β are the (generally) unequal weights.

A cost of zero indicates a stoppage whose start time and duration are congruent
with the maintenance records. Conversely, a large cost shows that one or both of
the indicators is significantly different to those documented. Typically, the candi-
date with the lowest cost would be selected. Determining α and β

The task of assigning values to the cost coefficients is usually approached in an ad
hoc manner. One must rely on particular knowledge of the application and make an
educated guess as to the contribution of each indicator. In this case, we are trying
to answer the question: ‘how much confidence can we place in the operator to
correctly document the delay start time and delay duration?’ More specifically, ‘by
what factor do we believe ∆DD to be more/less accurate than ∆DST?’
Section 25.2 presented the key performance indicators that are universal to
Australian longwall mines. Clearly, longwall availability and consequently ma-
chine downtime are under the watchful eyes of the mine manager. As such, it is
expected that the documented delay duration is a reasonably accurate reflection of
the actual lost time.
Discrepancies arise, however, when a failure and repair immediately precede or
follow a meal break. The latter is more likely; a failure of face equipment is the
most logical time for workers to break, rather than interrupt continuous production.
A result of this is that the documented DD may not include the time of the meal
break. Therefore, the computer records used to detect candidates will show one
long production delay rather then two distinct events.
On the other hand, there is little advantage in correctly documenting the delay
start time. In fact, some longwall operations see documenting the DST as needless
bookkeeping and do not record it.
Our experience of underground operations showed that operators were diligent
and accurate in the documentation of the delay duration. For no apparent reason, the
DD usually included any crib break that immediately followed, negating the
problem previously described. Large variation, however, was noticeable in the DST.
630 D. Bongers and H. Gurgenci

Table 25.2 shows the maintenance log from a single shift we observed. Table
25.3 is our record of the events as they occurred at the longwall face. Clearly, there
are discrepancies in both the DST and DD. Analysis of these errors shows the
average discrepancy to be 8 min and 31 min for DD and DST respectively.
In line with the previous arguments, and the limited comparative data, it is
decided that, on average, |∆DST| will be four times larger than |∆DD|. Therefore,
the cost function for initial candidate selection will be

Cost = ∆DST + 4 ∆DD

Table 25.2. Maintenance records


15:10 120 M Shearer out of hydraulic oil, pressure switch faulted
17:40 45 M Hydraulics: change stabilizer cylinder valve
19:15 20 M Replace LW shearer picks
19:45 25 M AFC Chain overtension

Table 25.3. True equivalent of Table 25.2

14:23 137 M Replace shearer hydraulic fluid pressure switch
17:32 44 M Changeover stabilizer cylinder valve on support #62
19:37 34 M Problem with AFC Tension - system reset
20:17 16 M Shearer picks Application of the Cost Function

Application of a simple cost function whose coefficients are based on ‘gut feeling’
and data from a single operating shift can provide misleading results. As such, a
candidate will only be selected if it has a cost three (nominal) times less than all the
other candidates within the 8-h window.
When applied to all 89 fault occurrences of interest, 11 were able to be selec-
ted. It is the data immediately prior to these that will form the basis of the work in
Section 25.7.

25.6.3 Clustering Algorithm for Candidate Selection

The purpose of this section is to employ a clustering algorithm to select candidates

for faults where a single candidate could not be conservatively selected by the cost
function. The approach taken relies on two assumptions; first, that the candidates
selected by the cost function are, in fact, the actual stoppages corresponding to the
documented delay (which can be confirmed by the manual inspection of key,
individual channels prior to the selected candidate), and second, that the dynamics
of the longwall are slow enough that the observations immediately prior to these
instances of shutdown contain information indicative of the fault that is present.
Fault Detection and Identification for Longwall Machinery Using SCADA Data 631 Discriminant Analysis

Similar to PCA, discriminant analysis produces a new dataset, via linear or non-
linear projection, which is of equal dimensionality as the original set. The projec-
tions are orthogonal, and are developed with criteria to maximize the separation (in
multivariate space) between classes of data, while minimizing the spread within
each class.
The hypothesis is, data representing similar longwall behaviour (known) can be
made to cluster. If data of an unknown class, under the same linear projections,
tends to join a particular cluster, it may then be classified accordingly. Clustering Results

In line with the conservative approach in the previous sections, two observations
prior to each identified candidate were classified according to the type of fault they
represent. Other classes of data included were randomly selected observations
corresponding to longwall shutdown and normal operation. Some observations
were duplicated to ensure an equal number of each class. The clustering algorithm
was applied to this data.
The methodology is that observations prior to certain unassigned candidates
projected under the same algorithm would fall into the confidence interval defined
by data representative of known fault type. This would allow the assignment of
candidates to the remaining faults, completing the event time determination pro-
Figure 25.5 shows an example where the clustering algorithm projected candi-
dates in such a way that one alone could be identified as the stoppage of interest.
The selected candidate is projected within 95% confidence interval of the test

Figure 25.5. Projections of candidate observations for classification

632 D. Bongers and H. Gurgenci

Similar results were seen for the majority of maintenance events, with the
exception of seven. Candidate selection was not possible for these because:
• No candidates were projected within the 95% confidence interval as defined
by the T2-statistic
• More than one candidate was projected within this confidence interval

25.7 Classification of Observations

The work presented in the previous section established the one-to-one correspon-
dence between stoppages in longwall production and documented maintenance
events. A simple cost function of the discrepancies in the start time and duration of
candidate stoppages allowed the automated selection of a small number of candi-
dates. Slow longwall dynamics permitted the classification of observations prior to
these shutdowns, which were used to generate a discriminant function to select the
remaining candidates.
This section describes the assignment of classifications to a large number of
observation vectors, which will combine to form a training set for FDI develop-
ment. In this training set, every observation vector will be assigned to a class. The
label itself is of little consequence; it serves only to indicate which group or class
of data each observation represents. These classifications must be accurate, discrete
descriptions of the state of the longwall. As with the event time determination
problem, this process must have a reasonable level of automation and generality to
allow fast and accurate classification of observations in applications other than
longwall fault detection.
The majority of the observations will be assigned to one of the two main
classes: the class ‘normal’, representing fault-free, normal operation; and the class
‘longwall shutdown’, representing the state of complete stoppage. Particular types
of failure are identified by analyzing the observations while the longwall is in
transition from ‘normal’ to ‘shutdown’ state. This assumes that there is a transition.
In other words, it is assumed that the data prior to each resulting shutdown contains
information regarding the presence of the fault. Analysis of this data should then
reveal the transition from normal operation to fault-induced operation.
This section addresses the problem of determining the length of that transition
period; i.e. identifying the set of observations that are distinct to each shutdown.
Unfortunately, the system has the same shutdown signature regardless of the cause
of the shutdown. Therefore, the observations to be nominated as training set entries
for a particular shutdown should occur immediately before the shutdown. The
following questions need to be answered:

1. Is there a set of N observations {yk − N −1 ,… yk −1} before the shutdown at yk,

which are different from normal operation and can be used as an indicator
of the development of the fault that eventually causes the shutdown?
2. What is the value of N? That is, how many observations prior to shutdown
can be included in the training set for each fault class?
Fault Detection and Identification for Longwall Machinery Using SCADA Data 633

Analysis of individual trends prior to shutdown would be a laborious process.

Also, reliance on specific knowledge of each fault is undesirable, in order to retain
a certain level of generality. As such, bulk data properties, or metrics, are observed.
Numerous metrics were investigated; two of these, Hotelling’s T2 statistic
(Hotelling 1931) and the PCA-residual Q-statistic (Bongers 2004), seemed to em-
phasize clearly a change in the data properties (the relationships between variables)
prior to each of the failures of interest.

25.7.1 Distribution Prior to Shutdown

Figure 25.6 shows the trace of the T2-statistic around the time of a longwall
stoppage identified as an example of a maingate drive cooling fault. The values on
the horizontal axis of this and subsequent figures have been shifted so that
observation zero represents the first measurement of longwall shutdown. There is a
clear transition from normal operation to shutdown indicated by the values of this
statistic starting to rise a number observations prior to shutdown.
The dashed lines represent the upper and lower confidence limit for data
representative of normal longwall operation. These were determined by conserva-
tively selecting data between a number of stoppages in production. The T2 values
for observations in the class normal will be likely to stay between these limits. It is
the violation of these limits that can be used to test if the system is behaving in a
abnormal manner.

Figure 25.6. T2 statistic prior to drive assembly cooling fault

This particular figure shows a distinct change from what is apparently normal
operation. The four observations prior to shutdown are clearly outside the 95%
confidence limit, which suggests that these represent operation with the fault
634 D. Bongers and H. Gurgenci

Figure 25.7. T2 statistic prior to maingate blockage

Figure 25.7 shows the T2 values in the vicinity of an AFC maingate blockage
fault. Once again, significant abnormal activity is observed prior to shutdown.

Figure 25.8. Q-statistic prior to BSL Dupline fault

The Q-statistic was able to illustrate abnormal system behaviour prior to

shutdown resulting from a BSL Dupline fault, as shown in Figure 25.8. This is an
encouraging result, as one would expect that a faulty Dupline controller would
contribute little detectable variation in system. Also, the detection of this sort of
variation in the Q-statistic highlights the fact that it is sensitive to smaller effects
not captured by the lower-dimensional, PCA representation.
Fault Detection and Identification for Longwall Machinery Using SCADA Data 635

25.7.2 Classification of Observations

The previous section highlighted distinct changes in data properties prior to

longwall shutdown. In the case of the maingate drive cooling fault, AFC blockage
(maingate) and BSL drive stall, this was represented by a gradual increase in the T2
value outside the confidence limits for normal longwall operation. A similar trend
was displayed by the Q-statistic values preceding a BSL Dupline fault. These will
be referred to as a onestage fault because the transition from normal to shutdown
class occurs over a constant-slope line.
Some instances of fault affected behaviour showed two distinct, abnormal
periods prior to shutdown, which will be referred to as twostage faults. Whether
one or two stage, it is these abnormal observations during longwall operation that
must be classified in a way that indicates the presence of a fault leading to a
particular cause for shutdown.
A separate class of data is required for each type of single stage fault, and two
classes for each twostage fault. The data corresponding to the first of the two
stages will classified as “fault ‘x’ imminent” to indicate that it precedes the second
class of data named “fault ‘x’ present”, with ‘x’ corresponding to a name indicating
the nature of the subsequent shutdown. Single stage faults will be named the latter
class, since it is the class of data which precedes longwall shutdown.
Figure 25.9 shows the metric values prior to shutdown for a specific fault with
the boundaries for each classification.

Figure 25.9. Classifications from a T2 trace Other Classifications

Although not investigated here, a large number of other classes of data are present
in the longwall data. These would include fault affected operation leading to
shutdown as a result of other failure modes not of interest here. As such, all ob-
636 D. Bongers and H. Gurgenci

servations not clearly representative of fault affected behaviour, longwall shutdown

or fault-free operation will be pooled into an alternative class, arbitrarily labelled
This will ensure that other fault affected operation will not be classified as
NORMAL, and may prove to suggest that abnormal behaviour has been detected.
Given this, it should be noted that less than 12% (one-ninth) of observations in the
training dataset fall into this category. Automation of Classification Process

In order to maintain the generic and semi-autonomous nature of the classification
scheme, the process of classifying observations must be automated. One question
that arises is: ‘must the classifications be determined on a case-by-case basis, at
each instance looking at either the T2 or Q-statistic, or should classifications for
each fault type be determined by an average number of affected observations prior
to shutdown’?
In generating training data for fault detection and isolation, all observations
prior to a documented shutdown of interest that are continuously outside the 95%
confidence interval for normal longwall operation will initially be classified as
“fault ‘x’ present”. Of these, if the first four or more observations outside this limit
show a total variation of less than 10% of their average value, they will be
classified as “fault ‘x’ imminent”.
This condition will be waived if either:
• These observations immediately precede the subsequent shutdown
• They represent measurements of the Q-statistic, which was utilized only for
the Dupline fault Interpretation of Classifications

Names given to classifications are purely arbitrary; however they should provide a
description of the discrete state of the system. Specifically, the IMMINENT classi-
fication should be distinguished as always preceding the PRESENT classification.
Detection of the former class of data serves to predict both the presence of a fault
and the shutdown that may ultimately follow.
In keeping with the preferred passive rather than active nature of an on-line
fault detection system, it is important to remember that the real-time detection of
faults merely provides forewarning for longwall operators. Detection precedes
shutdown, allowing changes in the system to be initiated to avoid or reduce long-
wall downtime. Also, since the classifications described above indicate conditions
that may or may not lead to a system shutdown, their detection should be treated as
only one factor in deciding any corrective action.
The basic classifications determined earlier in this section are more accurately
described as follows:

1. Normal longwall operation: current measurements indicate that the longwall

is operating in a fault-free manner
2. Fault x imminent: current measurements are commensurate with those in
the early stages of fault-affected operation which may lead to the eventual
shutdown of the longwall as a result of an x-type fault
Fault Detection and Identification for Longwall Machinery Using SCADA Data 637

3. Fault x present: current measurements are commensurate with those associ-

ated with fault-affected operation which suggests that a shutdown of the
longwall as a result of an x-type fault is imminent
4. Longwall shutdown: current measurements suggest that the longwall is in
full shutdown; that is, all major face equipment is not operational

25.7.3 Compilation of Training Set

To this point, a large number of observations have been classified as representing

normal operation, longwall shutdown, fault affected behaviour, or other (non-
normal operation). This section describes the engineering judgment that must be
employed in the assembly of the training set that will be used for FDI function
development. Such judgment is required to obtain an unbiased training dataset that
best characterizes the various states of the longwall system. Removing Bias from Training Set

An unbiased training set is one that contains an equal number of observations for
each class to be discriminated. Research has shown that an unequal quantity of
each class of data typically results in a discrimination function that is less likely to
classify new observations as the more frequently presented class. As such, some
fault affected observations were duplicated to ensure a large training set with an
equal proportion of each data class. Transitional Observations

Ultimately the goal of the training data development is to have a set of observa-
tions and associated classifications that, when applied to an FDI development
algorithm, can produce an FDI function with the greatest distinguishing power.
Erroneous classifications in the training set may alter the decision criteria in a way
that either:
• Reduces the mutual exclusivity of the decision space
• Incorrectly loosens or tightens the decision criteria for one or more classes

Observations immediately preceding the IMMINENT of FAULT PRESENT

classifications that were classified as NORMAL OPERATION may represent
operation with the presence of a fault. As mentioned previously, the dynamics of
the longwall are slow, requiring time for the presence of a fault to become evident
in the sensor measurements. Also, they may correspond to observations taken
when the severity of the impending fault is low. These so-called ‘normal’ obser-
vations are therefore considered transitional, and may not adequately characterize
fault-free operation. In order to remain consistent with the stated goal of training
data development, these transitional observations were removed from the training
638 D. Bongers and H. Gurgenci

25.8 FDI Results

A multilayer neural network was trained with the data described. Approximately
20% of the data was reserved for network validation, and was therefore not used in
the training process.
Figure 25.10 presents the output of the network in the vicinity of a maingate
drive cooling fault. The input data used was that reserved for validation, and hence
is ‘unseen’ data. Clearly, the output indicator of normal operation significantly
drops as the network indicates that the fault is present. After a time, the LONG-
WALL SHUTDOWN classification is dominant, and the other outputs essentially
zero. As with some figures shown previously, the horizontal axis has been shifted
so that observation zero indicates the first observation of shutdown.

Figure 25.10. Network output using ‘unseen’ data

In order to measure the overall FDI performance of the neural network, we

calculated the average recall and precision when applied to all ‘unseen’ faults,
which are defined as follows.
The recall(i) of a classification system for a given class of input, i, is defined as

output (i ) ∩ correct (i )
recall (i ) =
correct (i )

where output(i) refers to the set of all observations that the system classifies as that
of fault type i. The term correct(i) is the set of all observations in the input set that
are actually in fault class i. The recall is then the fraction of the correct classi-
fications of observation type i that the system correctly computes. It is of course
possible that correct(i) = 0 (when the system is presented with an input set for
Fault Detection and Identification for Longwall Machinery Using SCADA Data 639

which no correct classification exists). In such a situation recall is defined as unity,

regardless of the performance of the classification system.
The precision(i) of a classification system for a given class of input, i, is
defined as

output (i ) ∩ correct (i )
precision(i ) =
output (i )

Subtly different from the concept of recall, precision is the fraction of

observations that the system classifies as type i that are actually correct. In the
situation where output(i) = 0 (when the system never classifies an observation as
type i), it is defined:

if output(i) = 0, and correct(i) = 0 then precision(i) = 1

if output(i) = 0, and correct(i) = 0 then precision(i) = 0

Analogous to measures typically used in applied statistics, the recall and

precision are the complements to the probability of a Type I and Type II error
respectively. That is, for each class of data,

P(Type I Error) = 1 ෥ recall(i)

P(Type II Error) = 1 ෥ precision(i)

Table 25.4 presents the overall FDI performance of the neural network. For all
faults, the values of precision and recall are higher than that for the linear
discriminant algorithm. All instances of faults were both detected and isolated,
again, occasionally a few observations after the FAULT PRESENT class of data
had begun.

Table 25.4. FDI Performance using neural network

Fault No. test examples Recall Precision

MG drive cooling fault 14 0.929 0.813
AFC blockage (maingate) 21 0.952 0.833
BSL drive stall 16 0.934 0.789
BSL dupline fault 8 0.750 0.857

The results presented in this section show the successful detection and isolation
of faults using both the linear discriminant algorithm and the two-layer neural
network. The improvements in FDI performance offered by the NN suggest that
there exists some non-linearity in the relationship between sensor measurements
and the determined classifications. This is typical of most mechanical systems,
largely due to the non-linear effect of damping.
640 D. Bongers and H. Gurgenci

25.9 Concluding Remarks

The work presented here illustrates the application of fault detection and isolation
to a longwall mine. Given the accurate and timely detection of faults, the equip-
ment operators can preempt a catastrophic failure, or more rapidly respond to a
longwall shutdown. In either case, the FDI has served its purpose which was to
reduce the system downtime associated with equipment faults.
Condition monitoring data was collected from a longwall mine, which repre-
sented five months of operation. An error surface, in combination with analysis of
the distribution of missing entries per observation, was used to determine specific
limits, α and β, for the removal of rows and columns from the data matrix that
were deemed too sparse to allow sufficiently accurate missing entry estimation.
Missing values were estimated using the k-NN algorithm, which displayed a
smaller estimation error than other documented techniques.
The results show misclassification rates as low as 14.3%, which is considerably
better than the majority of documented performances of FDI systems using real
data. The two-layer neural network performed better than the linear discriminant
analysis, which revealed a level of non-linearity within the system. Overall, these
results were deemed largely successful, thereby verifying the validity of the
classification scheme.
Significant effort has been spent on correcting the maintenance logs, which
were the historical record of faults. In a number of industries, systems are in place
to ensure that this data is very accurate. As such, the implementation of FDI to
other systems may not require the degree of treatment presented here.
Finally, this study employed a data-driven approach for detection and isolation
of longwall face equipment faults. This was necessitated by the complexity of the
equipment that made a model-based approach impractical. However, it may be
possible to address at least subsets of the target fault lists by several model-based
approaches (e.g. Reid 2007). Although success has been shown using data-driven
techniques, any implemented FDI system would most likely be a hybrid system,
incorporating decision support from a number of FDI functions. As an example, a
sensor fault would cause an abnormal signal to be produced in a single channel
with a negligible effect on the overall relationships in the data. The approach taken
in this thesis would be insensitive to this sort of fault, however is ideally suited to
model-redundancy based algorithms.

25.10 References
Bongers, D., (2004) Development of a Classification System for Fault Detection in Longwall
Systems, PhD Thesis, The University of Queensland
Chow, M.Y., (2000) Guest Editorial: Special Section on Motor Fault Detection and
Diagnosis. IEEE Transactions on Industrial Electronics, 47(5):982–983
Frank, P.M., (1990) Fault diagnosis in dynamic systems using analytical and knowledge-
based redundancy – a survey and some new results, Automatica, 26(3): 459–474
Gelb, A., (1974) Applied Optimal Estimation, MIT Press, Cambridge, Massachusetts.
Fault Detection and Identification for Longwall Machinery Using SCADA Data 641

Grewal, M.S., Andrews, A.P., (2001) Kalman Filtering: Theory and practice using MATLAB,
John Wiley and Sons, New York
Hotelling, H., (1931) The generalization of Student's ratio. Annals of Mathematical Statistics,
McKay, B., Lennox, B., Willis, M., Barton, G., Montague, G., (1996) Extruder Modelling:
A Comparison of two Paradigms. UKACC International Conference on Control'96, 2:
734–739, Exeter, UK. Conference publication No. 427
Reid, A. (2007) Longwall Shearer Cutting Force Estimation, PhD Thesis, The University of
Sorenson, H.W., (1985) Kalman Filtering: Theory and Application, IEEE Press, New York
Todeschini, R., (1990) Weighted k-nearest neighbor method for the calculation of missing
values, Chemometrics and Intelligent Laboratory Systems, 9:201–205
Venkatasubramanian, V., Rengaswamy R, Yin K, Kavuri S, (2003) Review of Process Fault
Diagnosis – Parts I, II, III. Computers and Chem Eng, 27(3): 293–346
Willsky, A.S., (1976) A survey of design methods for failure detection in dynamic systems,
Automatica, 12:601–611
Contributor Biographies

Chapter 1

Khairy Kobbacy is the Professor of Management Science and Associate Head

(Research) of Salford Business School, Salford University, UK. He is also the
Director of the Management and Management Sciences Research Institute. Prof
Kobbacy has a BSc from Cairo, M.Sc. from Strathclyde and Ph.D. from Bath
University. He has sustained research interests in mathematical modelling in
maintenance, intelligent management systems in operations, and supply chain
management. He has over 40 refereed publications and edited 9 volumes including
conference proceedings, special issues of international journals and ORS 46
Keynote papers. He chaired the European Conference on Intelligent Management
Systems in Operations in 1997, 2001 and 2005 and the IBC Middle East Con-
ference: Superstrategies for Maintenance in 1998. He was elected Vice President of
the Operational Research Society (UK) 2001–2003.
Prabhakar Murthy obtained B.E. and M.E. degrees from Jabalpur University and
the Indian Institute of Science in India and M.S. and Ph.D. degrees from Harvard
University. He is currently Research Professor in the Division of Mechanical
Engineering at the University of Queensland. He has held visiting appointments at
several universities in the USA, Europe and Asia. His research interests include
various aspects of new product development, operations management (lot sizing,
quality, reliability, maintenance), and post-sale support (warranties, service con-
tracts). He has authored or coauthored 20 book chapters, 150 journal papers and
140 conference papers. He is a coauthor of five books and co-editor of two books.
He is on the editorial boards of eight international journals.
644 Contributor Biographies

Chapter 2

Liliane Pintelon holds degrees in Chemical Engineering (1983) and Industrial

Management (1984) of the KULeuven (Catholic University of Leuven, Belgium).
In 1988–1989 she worked as a visiting research associate at the W. Simon Gradu-
ate Business School (University of Rochester, USA). She obtained her doctoral
degree in industrial management (maintenance management) from the KULeuven
in 1990. Currently, she is professor at the Centre for Industrial Management
(KULeuven); she is also Board Member of BEMAS (Belgian Maintenance
Society) and of IFRIM (International Foundation for Research in Maintenance).
Her research and teaching area is industrial engineering and logistics, with a
special interest in maintenance. In this area lays the majority of her academic
publications. She also has considerable experience as an industrial consultant in
this area.
Alejandro Parodi-Herz received his M.Sc. degree in Mechanical Engineer at the
Simon Bolivar University, Venezuela (2002), the degree in Master of Industrial
Management (2003) at the Katholieke Universiteit Leuven and the degree of
Master in Operations and Technology Management (2004) at the Universiteit Gent.
Currently he works with the Centre of Industrial Management at the Katholieke
Universiteit Leuven as research associate to pursue his Ph.D. degree. His research
interest is mainly focused on maintenance, spare parts demand categorisation and
inventory control.

Chapter 3

Jay Lee is Ohio Eminent Scholar and L.W. Scott Alter Chair Professor in
Advanced Manufacturing at the University of Cincinnati and is founding director
of National Science Foundation (NSF) Industry/University Cooperative Research
Centre (I/UCRC) on Intelligent Maintenance Systems. His current research focuses
on autonomic computing and smart prognostics technologies for predictive main-
tenance and self-maintenance systems, as well and closed-loop product life cycle
service model studies. He has authored/co-authored over 100 technical publi-
cations, edited 2 books, contributed numerous book chapters, 3 U.S. patents and 2
trademarks. He received his B.S. degree from Taiwan, a M.S. in Mechanical
Engineering from the Univversity of Wisconsin-Madison, a M.S. in Industrial
Management from the State University of New York at Stony Brook, and D.Sc. in
Mechanical Engineering from the George Washington University. He is a Fellow
of ASME and SME.
Haixia Wang is a postdoctoral researcher in the NSF Industry/University Co-
operative Research Centre (I/UCRC) on Intelligent Maintenance Systems (IMS)
Center headquartered at the University of Cincinnati. Her current research interest
focuses on data streamlining for machinery prognostics and health management,
manufacturing process performance and quality improvement, and design for pro-
duct reliability and serviceability. Haixia Wang received her B.S. degree in
Mechanical Engineering from Shandong University at China, a Ph.D. in Mechani-
cal Engineering from Southeast University at China, a M.S. and a Ph.D. in Indus-
trial and Systems Engineering from the University of Wisconsin-Madison.
Contributor Biographies 645

Chapter 4

Marvin Rausand is Professor of Reliability Egineering at the Norwegian Univer-

sity of Science and Technology (NTNU). He worked for the research institute
SINTEF for ten years, mostly related to offshore oil and gas activities. The last
four years of this period he was Director of SINTEF Department of Safety and
Reliability. In 1989 he joined NTNU as a full time professor. He was head of
NTNU’s Department of Machine Design for five years and vice-dean of the
Faculty of Mechanical Engineering for six years. In 1985–1986 he was visiting
professor at Heriot-Watt University in Scotland, and in 2002–2003 he was visiting
professor at Ecole des Mines de Nantes. Professor Rausand is a member of the
Norwegian Academy of Technical Sciences, and of the Royal Norwegian Society
of Letters and Science.
Jørn Vatn is Professor of Maintenance Optimisation at the Norwegian University
of Science and Technology (NTNU). He worked for the research institute SINTEF
for 15 years, mostly related to transportation, critical infrastructure, and offshore
oil and gas activities. He has developed several computerized tools for decision
support in safety, reliability and maintainability. For the last five years he has been
involved in implementing a new maintenance strategy in the Norwegian National
Railway Administration.

Chapter 5

Wenbin Wang is Chair of Operational Research at the Centre for OR and Applied
Statistics, Salford Business School, University of Salford, UK. Prof. Wang
received his B.Sc. (Harbin, China) in Mechanical Engineering in 1981, M.Sc.
(Xian, China) in Operations Management in 1984 and Ph.D. in OR and Applied
Statistics from Salford University (UK) in 1992. He has over 20 years experience
in OR modelling in general and maintenance and reliability modelling in particular.
He received 3 EPSRC projects in the past and has authored and co-authored over
80 research papers. Professor Wang is a fellow of Royal Statistics Society, Opera-
tional Research Society, Institute of Mathematical Applications, and a charted
mathematician. He is also a member of the International Foundation for Research
in Maintenance. Professor Wang holds a guest professorship at Harbin Institute of
Technology, China.

Chapter 6

David Percy gained a B.Sc. degree with first class honours in mathematics from
Loughborough University in 1985 and a Ph.D. degree in statistics from Liverpool
University in 1990. He is a reader in mathematics at the University of Salford and
his research into Bayesian inference, stochastic processes and multivariate analysis
has produced 40 refereed publications and many conference presentations. He is
actively involved in collaborative research for industrial applications, particularly
concerning maintenance scheduling problems for complex systems. Dave is a
chartered scientist, chartered mathematician and member of the governing Council
for the Institute of Mathematics and its Applications.
646 Contributor Biographies

Chapter 7

Elsayed Elsayed is Professor of the Department of Industrial Engineering, Rutgers

University. He is also the Director of the NSF/ Industry/ University Co-operative
Research Centre for Quality and Reliability Engineering, Rutgers-Arizona State
University. His research interests are in the areas of quality and reliability
engineering and Production Planning and Control. He is a co-author of Quality
Engineering in Production Systems, McGraw Hill Book Company, 1989. He is also
the author of Reliability Engineering, Addison-Wesley, 1996. These two books
received the 1990 and 1997 IIE Joint Publishers Book-of-the-Year Award respec-
tively. He is a co-recipient of the 2005 Golomski Award for the outstanding paper.

Chapter 8

David Percy: See Chapter 6

Chapter 9

Khairy Kobbacy: See Chapter 1

Chapter 10

Bo Lindqvist is Professor in Statistics at the Department of Mathematical Sciences,

Norwegian University of Science and Technology, Trondheim (associate professor
since 1979, professor since 1988). He obtained the degree of Dr.Philos. in statistics
at the Univerisity of Oslo in 1982. Lindqvist's main research interest is in stochas-
tic modeling and statistical analysis related to reliability and survival analysis.
Lindqvist is Editor of Scandinavian Journal of Statistics (2007–). He is elected
member of The Royal Norwegian Society of Sciences and Letters and International
Statistical Institute.

Chapter 11

Robin Nicolai is a Ph.D. student at Tinbergen Institute Rotterdam. He is also

affiliated with the Econometric Institute at Erasmus University Rotterdam. His
research interests are maintenance optimization, in particular degradation model-
ling, discrete-event systems and simulation optimization. One of his papers has
been accepted for publication in Reliability Engineering and System Safety. Other
papers have appeared in proceedings of different international conferences.
Rommert Dekker is a full-time professor in Operations Research and Quantitative
Logistics at Erasmus University Rotterdam. His research interests are maintenance
optimization, inventory control, service and reverse logistics. He has published
over 100 papers in scientific journals and he has been involved in the development
of several decision support systems for maintenance planning.
Contributor Biographies 647

Chapter 12

Philip Scarf is a lecturer at the University of Salford. He obtained his Ph.D. in

1989 from the University of Manchester. Among his research interests are capital
replacement, reliability and maintenance modelling, and extreme value theory. He
has worked on capital replacement problems with the UK NHS, Mass Transit Rail
Corporation of Hong Kong, Express National Berhad Malaysia, and Malaysia
Truck and Bus Berhad. He currently serves as co-editor of the IMA Journal of
Management Mathematics.
Joseph Hartman is an Associate Professor of Industrial and Systems Engineering
at Lehigh University in Bethlehem, PA, USA. He also serves as Department Chair
and holds the Kledaras Endowed Chair. He received his Ph.D. in 1996 from the
Georgia Institute of Technology and currently serves as Editor of The Engineering
Economist, a journal devoted to the problems of capital investment. His research
and teaching interests are in economic decision analysis, including equipment
replacement analysis and transportation logistics.

Chapter 13

Gabriella Budai is a Ph.D. student at Tinbergen Institute Rotterdam. She is also

affiliated with the Econometric Institute at Erasmus University Rotterdam. Her
research topic is railway maintenance optimization, in particular scheduling
preventive railway maintenance activities and rescheduling of the rolling stock
during track possession. Her papers have been published in Journal of the Opera-
tional Research Society (JORS) and in proceedings of different international
Rommert Dekker: See Chapter 11
Robin Nicolai: See Chapter 11

Chapter 14

Wenbin Wang: See Chapter 5

Chapter 15

Prabhakar Murthy: See Chapter 1

Nat Jack is a Lecturer in Operational Research and Statistics at the University of
Abertay Dundee and has more than 30 publications in refereed journals, books, and
conference proceedings. The present focus of his research deals with product
warranty, in collaboration with Professor D.N.P. Murthy from the University of
Queensland, and this research has resulted in a series of papers examining optimal
maintenance strategies for items sold with one- and two-dimensional warranties.
His latest project involves a study of extended warranty decision-making using a
game theoretic approach.
648 Contributor Biographies

Chapter 16

Prabhakar Murthy: See Chapter 1

Jarumon Pongpech received her B.E. (IE) at Chiang Mai University, Thailand in
1993. She got the scholarship from Faculty of Engineering, Chiang Mai University
to pursue her master degree and graduated in the field of M.S. (EM) from The
George Washington University, USA in 1996. For her Doctoral degree she also got
Thailand’s grant of Commission on Higher Education in 2000 to study at Depart-
ment of Industrial Engineering, Chulalongkorn University in Thailand and to
conduct her research at Division of Mechanical Engineering, The University of
Queensland in Brisbane Australia. She was formerly a lecturer at Chiang Mai
University until 1999 before moving to Thammasat University. Her research inter-
ests are in the areas of maintenance policy of a system, service contract, engineer-
ing management, and industrial engineering.

Chapter 17

Ashraf Labib is Chair of Operations and Decision Analysis at Strategy and

Business Systems Department, Portsmouth Business School, University of Ports-
mouth. He holds a B.Sc. in Production Engineering, a M.B.A., a M.Sc. in integrated
manufacturing systems and a Ph.D. in maintenance systems. His research work
focuses on asset management, manufacturing maintenance systems, best practice
and decision-making. In particular, he is concerned with the analysis of data related
to machine failures and design and to the development of computerised maintenance
management systems (CMMSs). He is a Fellow of the Operational Research Society
(ORS), a Fellow of the IEE and a Chartered Engineer. He has published over 80
refereed papers in professional journals and international conferences proceedings.
He is currently the Associate Editor of IEEE Transactions SMC (Systems, Man, and

Chapter 18

Terje Aven is Professor of Risk Analysis and Risk Management at University of

Stavanger, Norway. He is also a Principal researcher at International Research
Institute of Stavanger (IRIS). He has been Professor II (adjunct professor) in reliability
and safety at University of Trondheim (Norwegian Institute of Technology) 1990–
1995 and Professor II in reliability and risk analysis at University of Oslo 1990–2000.
He was the Dean of the Faculty of Technology and Science, Stavanger University
College, 1994–1996. Dr. Aven has many years of experience from the petroleum
industry (The Norwegian State Oil Company, Statoil). He is the author of several
reliability and risk related books and he is an associate editor/area editor/member of
the editorial board of several international journals. He received his master's degree
(cand.real) and Ph.D. in mathematical statistics (reliability) at the University of Oslo
in 1980 and 1984, respectively
Contributor Biographies 649

Chapter 19

Uday Kumar is a Professor of Operation and Maintenance Engineering at Luleå

University of Technology, Sweden. He obtained his B. Tech. from India and a
Ph.D. degree in field of reliability and maintenance from Luleå University of
Technology, Luleå, Sweden in 1990. He worked six years in Indian mining indus-
tries prior to joining the postgraduate program. His research interests are equip-
ment maintenance, reliability and maintainability analysis, product support, life
cycle costing, risk analysis, system analysis, etc. He is also member of the editorial
boards and reviewer for many international journals. He has published more than
100 papers in international journals and conference proceedings.
Aditya Parida obtained his Ph.D. in the area of maintenance performance measure-
ment and hat taught operation and maintenance engineering at Luleå University of
Technology, Sweden since 2002. Prior to this, he was teaching the same subject in
couple of institutes in India and was joint-director of NIILM Centre for Manage-
ment Studies, New Delhi. He has a bachelor’s degree in mechanical engineering and
a post-graduation qualification in industrial engineering from IIT, Kharagpur, India
and has more than two decades experience in the area of operation and maintenance
engineering from the Indian Army, amongst others. He is actively involved in re-
search in the area of maintenance performance measurement and other related
issues. He has published a number of papers in this subject area and was the co-
editor for the proceedings of the COMADEM 2006.

Chapter 20

John Boylan is Professor of Management Science at Buckinghamshire Chilterns

University College. He holds degrees from Oxford and Warwick Universities and
has published papers on short-term forecasting in a variety of academic and
practitioner-oriented journals. In addition to his academic work, Professor Boylan
advises commercial organisations on forecasting processes and software. He also
leads a large project, funded by the European Union and the Learning and Skills
Council, facilitating the education and training of managers in small and medium
enterprises. His current research interests relate to demand forecasting in the
supply chain, with a particular emphasis on intermittent demand.
Aris Syntetos is a reader working with the Centre for Operational Research and
Applied Statistics (CORAS) at the University of Salford, UK. He holds a B.A.
degree from the University of Athens, an M.Sc. degree from Stirling University
and in 2001 he completed a Ph.D. at Brunel University – Buckinghamshire Busi-
ness School. His research interests relate primarily to intermittent demand fore-
casting and the interface between forecasting and stock control. Aris’s work has
appeared in the International Journal of Forecasting, International Journal of
Production Economics and Journal of the Operational Research Society. He is
currently holding three research grants — two from the Engineering and Physical
Sciences Research Council (EPSRC, UK) and one from the Department of Trade
and Industry (DTI, UK).
650 Contributor Biographies

Chapter 21

Jørn Vatn: See Chapter 4

Chapter 22

Renyan Jiang is a Professor and Director of the Quality, Reliability and

Maintenance Laboratory at Changsha University of Science and Technology, China.
He obtained his undergraduate and graduate degrees from Wuhan University of
Technology, China, and his Ph.D. from University of Queensland, Australia. He
held visiting appointments at City University of Hong Kong, University of
Saskatchewan, The Hong Kong Polytechnic University, and University of Toronto.
His research interests are in various aspects of quality, reliability and maintenance.
He is the author or co-author of three reliability related books, including Weibull
Models, Wiley, 2003. He has published 28 papers in international journals and a
number of other papers.
Xinping Yan is a Professor and Director of Reliability Engineering Institute at
Wuhan University of Technology, China. He obtained his undergraduate and
graduate degrees from Wuhan University of Technology, China, and his Ph.D.
from Xi’an Jiaotong University, China. He is a member of ISO/TC108/SC5 Com-
mittee and a member of the Council Committee of Tribology Institute of Chinese
Mechanical Engineering Society (CMES). He is an editorial member of Journal of
COMADEM(U.K.) and Journal of Maritime Environment (U.K.). His research
interests include condition monitoring and fault diagnosis, tribology and its
industrial application, and intelligent transport system.

Chapter 23

Uday Kumar: See Chapter 19

Ulla Espling is deputy director at Luleå Railway Research Centre (JVTC) and a
researcher within “Framework for Maintenance Strategies for Railway Infra-
structure” dealing with a regulated administration, outsourced maintenance, high
demands on safety and yearly funding. She has a M.Sc. degree in mechanical
engineering and a Licentiate in operation and maintenance engineering. She also
has a background from the railway which goes back to 1984. Within the railway
she has been working withh both traffic operation and planning, track engineer,
design leader and as the head for a track area, giving her a broad and rich
Contributor Biographies 651

Chapter 24

Jayantha Liyanage is an Associate Professor of Asset Operations, Maintenance

technology, and Asset Management at the University of Stavanger (UiS), Norway.
He is also the Chair and a project advisor of Center for Industrial Asset Manage-
ment (CIAM), and a member of the R&D group of the Center for Risk Manage-
ment and Societal Safety (SEROS), at UiS. In addition, Dr Liyanage also serves as
the Co-Organiser and Coordinator of the European Research Network for Strategic
Engineering Asset Management (EURENSEAM). Currently, he was appointed to
the Board of Directors of the Society of Petroleum Engineers (SPE) Stavanger
section, where he also take up the responsibilities as the Chairman of the Schoral-
ship committee. Dr Liyanage is actively involved in numerous joint industry
projects at advisory and managerial capacities. He has received a number of awards
for his excellent academic and research performance. He serves in international
editorial boards of a number of international journals and international steering
committees of many International conferences.

Chapter 25

Daniel Bongers received his B.E. (1999) and Ph.D. (2004) from the University of
Queensland, Australia. He is currently a research fellow for the Australian Co-
operative Research Centre Mining, and is responsible for managing two late-stage
technology development projects. His current research interests include physiologi-
cal signal processing, fault detection and isolation, physiological fatigue detection
and signal measurement.
Hal Gurgenci received his B.Sc. (1976) and M.Sc. (1979) from the Middle East
Technical University, Turkey, and Ph.D. (1982) from the University of Miami. He
is currently a professor with the School of Engineering, The University of Queens-
land in Brisbane. Previously, he was a Vice President of the Australian Coopera-
tive Research Centre on Mining responsible for research and education activities of
the Centre. He was the principal investigator of several large projects in mining
equipment design, automation, reliability and maintenance. His current research
interests include energy generation and conservation.

