Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

PROCESS FAULT DIAGNOSIS USING ONE-CLASS SUPPORT VECTOR MACHINES

G.T. Jemwa and C. Aldrich*

Department of Process Engineering,


University of Stellenbosch, Stellenbosch, South Africa
Fax +27(21)808 2059, *E-mail: ca1@sun.ac.za

Abstract: The process fault diagnosis problem is usually considered in classification


framework. Although used widely in diagnostic applications, artificial neural networks
and other nonlinear classifiers perform poorly under nonideal conditions encountered in
practice, owing to the arbitrary placement of decision boundaries in empty regions of the
input space and unbounded normal class region. This is particularly problematic where
few and noisy data are available. In this paper, the use of one-class support vector
machines for the diagnosis of process operations is proposed and their performance under
practical conditions assessed. One-class classifiers are shown to be superior to and more
robust than competing approaches previously proposed for diagnostic applications.
Copyright © 2007 IFAC

Keywords: Fault diagnosis; Process Control, Neural Networks; Pattern Recognition;


Machine Learning

1. INTRODUCTION commonly referred to as kernel methods or


algorithms (Schölkopf and Smola, 2002).
Support vector machines (SVMs) are state-of-the-art
learning algorithms inspired by insights from In this paper, the problem of fault diagnosis of
statistical learning theory. Instead of optimizing a process operations is investigated. Compared to fault
model by only minimizing the training error, SVMs detection and identification, diagnosis is a more
incorporate the idea of large margin maximization important yet difficult problem since knowledge of all
that gives probability guarantees of how well the faults that can possibly occur on a plant is impossible
trained classifier will perform when presented with to obtain. However, given sufficient historical data of
data not used in training. SVMs also provide for a a certain class of faults, e.g. sensor faults, as well as
method for controlling capacity or flexibility of the fault-free data, it is possible, at least in principle, to
resulting model by using kernel functions. Although a use pattern recognition methods such as multilayer
recent innovation, they have already been applied perceptrons to learn to predict what type of fault may
with success in many areas, such as image analysis, have occurred when a fault is detected.
signal processing, text mining, and bioinformatics, Unfortunately, it has been observed that the
among other (Shawe-Taylor and Cristianini, 2004). performance of multilayer perceptrons degrades when
Originally formulated in a pattern recognition deployed in nonideal conditions as often encountered
context, the ideas underlying SVMs have since been on industrial plants (Kramer and Leonard, 1990). This
extended to other learning problems such as paper proposes the use of one-class SVM classifiers
regression, density estimations and transductive as an alternative. Because of the large margin bias,
learning. The resulting group of learning tools is SVMs can be expected to yield improved
performance rates compared to other methods.
The paper is structured as follows. The next section REFERENCE INPUT x(t)

discusses the problem of process fault diagnosis from


a classification perspective. A brief review of the use
artificial neural networks in fault diagnosis is NOMINAL 1
ST
FAULT 2
ND
FAULT N TH FAULT
presented. The one-class classification framework is MODEL MODEL MODEL MODEL

then considered, including a basic formulation of the


relevant algorithm. An analysis of the performance of
one-class SVM in diagnostic situations is then DECISION FUNCTION
GENERATOR
presented using a simulated example representative of
many chemical reactor models. FAULT DECISION LOGIC:
f*=G(f0(x),f1(x),…,fN(x))
2. THE FAULT DIAGNOSIS PROBLEM
To identify the root cause(s) of a detected fault f*(x)

condition in a process, knowledge of all possible


faults associated with the process must be available FAULT ANALYSIS AND
INTERPRETATION
(Stephanopoulos and Han, 1996). Clearly, this is an
ill-posed decision logic problem, since it is
impossible to have a priori information of everything Fig. 1. A schematic representation of fault diagnosis
that can go wrong in a process. Although in some as multiclass pattern recognition problem
cases certain fault conditions, particularly those
associated with sensors, can be simulated by use of a 3. LIMITATIONS OF DIAGNOSIS WITH
process model, the design of diagnostic system must NEURAL NETWORKS
ideally include the possibility of novel conditions that
have not yet been realized. In a critical analysis of fault diagnosis using MLPs,
Kramer and Leonard (1990) investigated performance
Generally, given known fault conditions associated of neural classifiers under non-ideal conditions often
with a process, including fault-free or normal encountered in practice. The objective of their study
operating conditions, the task of fault diagnosis was to identify process fault conditions not easily
involves fitting a model relating inputs and outputs in detected by MLPs. It was observed that the tendency
a training data set. The outputs typically take the form of MLPs to extrapolate when a new sample falls
of an indicator matrix, such that for a process with N outside the training data range led to some serious
known faults, the output vectors lie in an N+1- problems. Five such situations were identified in
dimensional space, with each column representing a which extrapolation from the training data was
specific fault. The extra column in the output vectors required viz.:
is used to accommodate the normal operating
condition, say f0. Each column is assigned a unit entry (i) small training sets,
if the input vector corresponds to that fault; otherwise (ii) changes in the parent distributions of the
it is assigned a zero entry. For a problem with N classes that occur after training,
classes, the decision function generator builds a (iii) corrupted data,
model for each j; for j = 0,1,2…,N. When presented (iv) appearance of novel fault classes, and
with a new instance x (typically, a symptom or (v) training of the network with synthetic
residual vector from a preceding fault detection and data.
identification phase), the vector is fed to each model,
which then computes an output fj(x). These outputs Through a series of investigations, it was concluded
are evaluated in the fault decision logic, that is that distance-based classifiers should be instead of
MLPs in fault diagnosis problems, because of their
f * ( x) = G ( f 0 ( x ), f1 ( x ), , f N (x) ) (1) greater reliability when dealing with non-
representative training data.
where G is the decision logic evaluation function and
f* the most probable fault condition, as indicated in 4. ONE-CLASS CLASSIFICATION
Fig. 1. The goal of one-class classification is to estimate the
Since fault diagnosis is essentially a pattern regions of data space in which the mass of the data is
recognition problem, pattern recognition algorithms concentrated, sometimes referred to as the support of
are typically used to solve the problem. These include the distribution. Estimation of the support of the
distance-based classifiers, multilayer perceptrons distribution is a considerably more tractable problem
(MLP), radial basis functions, support vector than density estimation, particularly for small-sized
samples. Mathematically, the problem can be
machines, and so on. Of these, multilayer perceptron
(MLP) networks are among the most widely used formulated as finding a function f that is positive in a
models in diagnostic applications (Sorsa and Koivo, “small” region capturing most of the data points and
1991; Venkatasubramanian and Chan, 1989). negative elsewhere (Schölkopf et al., 2001; Rätsch, et
al., 2002). In this sense, the function estimation
problem can be considered as a quantile (or minimum
volume) estimation problem (Polonik, 1997).
4.1 One-class support vector machines
Inspired by support vector learning theory, a few
algorithms have been proposed for solving one-class
f ( x ) = sgn { i
αi ( xi ⋅ x ) −
1
t n
(z n }
⋅ x) . (5)

classification problems. These include the one-class The threshold parameter ρ is computed by use of the
SVM which seeks a hyperplane that separates the Karush-Kuhn-Tucker (KKT) conditions from
unlabelled data set X from the origin with maximum optimization theory. Thus, the algorithm outputs large
margin (Schölkopf et al., 2001), and the SV data values for points similar to the image of X and small
descriptor that finds an optimal enclosing hypersphere values for generic points from Z.
with minimal volume (Tax and Duin, 1999).
It turns out that only relatively few of the training
The above algorithms are restricted as far as the kind
patterns with non-zero αi’s define the decision
of ‘abnormality’ that can be detected is concerned.
function, f. These are termed support vectors. Also,
Prior information of what the abnormal class looks
the trade-off parameter has a special meaning; it is a
like can be encoded in the one-class SVM. Instead of
lower bound on the number of support vectors, an
finding an f that is maximally separate from the
upper bound on the number of outliers, and
origin, one considers a large margin hyperplane w
asymptotically equals both the fraction of support
that maximizes the separation of the set X =
vectors and fraction of outliers (Schölkopf et al.,
{x1,x2,…,xm} from the centroid of another set Z =
2000).
{z1,z2,…,zt} (Schölkopf et al., 2000; Hayton et al.,
2000). More formally, the decision function is The algorithm as formulated is restricted to linear
obtained by minimizing a weighted sum of a decision functions. Noting the data appear as dot
regularizer and a training error term that depends on products in both the objective and evaluation
an overall margin ρ and training errors ξi functions (Equations 4 and 5), the “kernel trick” can
be used to improve the flexibility or capacity of the
1 2 1 m
model (Schölkopf and Smola, 2002). A kernel k is a
min w + ξi − ρ
w, ,ρ 2 νm i =1
mathematical function that implicitly computes the
dot product of images of input space data induced by
1 t
(2) some transformation ϕ, that is
w⋅ x− zj ≥ ρ − ξi ,
subject to t j =1
k ( x, z ) = ϕ ( x), ϕ ( z ) (6)
ξi ≥ 0
with a trade-off parameter. The corresponding where the angle brackets denote dot product in the
decision function is given by transformed or feature space. An example of
commonly used kernel function is the Gaussian radial
basis function:
1 t

f ( x ) = sgn w⋅ x− zj −ρ (3)
t 2
j =1
x−z
k ( x, z ) = exp

2
which takes positive values for most patterns in X, (7)
while the regularizer ||w||2 is still smaller. This is
≡ ϕ ( x), ϕ ( z )
equivalent to a large margin separation from the
centroid of set Z. This formulation will be refereed to with width σ a kernel hyperparameter that must be
as the generalized one-class SVM in the subsequent specified by the user. Cross-validation is typically
sections. used to determine the optimal kernel width.
The solution to Equation 2 is typically obtained by The advantage of using kernels is that it is not
first transforming it to an equivalent dual formulation necessary to know the exact form of the mapping ϕ as
in terms of Lagrange multipliers αi , that is long as it satisfies the so-called Mercer conditions. By
using kernels one avoids computational difficulties
associated with working in a possibly high
1 m
dimensional feature space while retaining flexibility.
min α iα j ( x ⋅ x) + Q − Q j − Qi Hence, by replacing all occurrences of dot products in
w, ,ρ 2 i, j (4)
the preceding formulations with kernels, statistically
subject to {0 ≤ α i ≤ νm
1
, i
αi = 1 simple yet rich class of models are possible.

4.2 Diagnosis Using One-class SVMs: An Empirical


where
Analysis
Q≡ 1
t2 np
(z n
⋅ z p ) , and
Qi ≡ 1
t n
(x i
⋅ zn ) . To investigate the use of one-class SVMs for
diagnosis, we consider the following generalized fault
diagnosis problem previously considered in Kramer
Finally the function value of test point x is evaluated
and Leonard (1990):
according to
x = x0 + α p + v (8) unavoidable. Very few (in some case none)
extrapolation errors were observed. This is a
where x is a two-dimensional vector of observed significant advantage over MLPs that, because of the
variables sampled from a static process with a arbitrary placement of the decision boundary in
nominal steady state x0 = (0,0), p is the fault
empty regions as well as the influence of extreme
parameter vector of size np, α = [1 1; 1 -1] is a linear points, give models with extrapolation errors.
operator incorporating the directional effect of one of
the fault parameters on the measurements, and Hence, it can be concluded that the generalized one-
v∼N(0,0.015) is a vector of Gaussian distributed class SVM is a better alternative to multilayer
measurement disturbances. Fault conditions are perceptrons, particularly for fault conditions
classified into Ck by defining inequalities on the characterized by small sets of noisy samples
failure parameters according to commonly found in practice.

Norma l (C0 ) :| p1 | < 0.05, | p2 |< 0.05


(a)
Fault 1 (C1 ) :| p1 | > 0.05 1
R2 R1
Fault 2 (C2 ) :| p2 | > 0.05
R0
x2 0
Fault 1 causes both process variables x1 and x2 to
deviate in the same direction, while Fault 2 results in R1 R2
deviations in opposite directions. The normal class
(C0) occupies the intersection region, as indicated in -1
-1 0 1
Fig. 2. x1
(b)
1
1
R1

R0

Normal
x2 0
x 0 Fault 1
2
Fault 2 R2

-1
-1 0 1
x1
-1 (c)
-1 0 1
x 1
1 R1
R2
Fig. 2. Scatter plot of the two-dimensional fault
diagnosis problem 0
x2

R0
A generalized one-class SVM model was constructed,
using Gaussian kernels with the following widths: σ =
[0.25 0.5 1 2 4 8]. For comparative purposes, -1
corresponding models for the standard one-class -1 0 1
SVM (no abnormal class information) and a standard x1
binary SVM classifier were also fit using the same
kernel and kernel parameters. Fig. 3 shows the
resulting decision functions for the three classifiers Fig. 3 Decision regions generated using (a) the
for a kernel width = 4. The generalized one-class generalized one-class SVM (b) the standard one-
SVM identified the different process operating class SVM and (c) a binary SVM classifier. The
regions satisfactorily, as shown in Fig. 3(a). In fault model representing each region is indicated
contrast, the other two classifiers could not separate by R0, R1, and R2 for the Normal, Fault 1, and
the different regions very well. Fault 2 conditions respectively.
Fig. 4(a) summarizes the distribution of errors for 100
independent test sets, each with a sample size of 4.3 Robustness Analysis
1000. The scaling of the figure corresponds with that To investigate the effect of small changes in class
of Fig. 2. As indicated, the classification error did not distributions after the classifier has been trained,
vary too widely over the range of kernel widths small and large shifts for different types of possible
considered (0.25 to 8). Fig. 4(b) shows a scatter plot fault modes were simulated as summarized in Table
of a sample of misclassified data points from an 1. These changes are similar to those investigated by
independent test set observed in the simulations. The Kramer and Leonard (1990) and were chosen to allow
misclassified data are distributed in the overlapping
regions between the three classes and are generally
for a comparison with the 1-nearest neighbour 0.25
classifier that had the best performance in their study. Nominal
Small Shift
(a) (b) 0.2 Large Shift
1
0.06 One-class SVM 1-NN Classifier
classification error

0.5 0.15

Mean Error
0.05
x2 0
0.04
0.1
0.03 -0.5

-1 0.05
0.25 0.5 1 2 4 8 -1 -0.5 0
kernel width σ x1

DRIFT

DRIFT
NOISE

NOISE
DIREC

DIREC
BIAS

BIAS
Fig. 4. (a) Variation of misclassification error with
kernel width (b) Typical distribution of
misclassified patterns. The errors are associated Fig. 5. Results from robustness analysis of the
with class overlap and generally are unavoidable. performance of one-class SVM under changes in
the underlying fault class distribution. Also shown
Table 1. Simulated changes for Robustness Analysis for comparison are results for a distance based
of Diagnosis with one-class SVMs classifier (1-NN) as investigated in Kramer and
Leonard (1990).
Error
Small Shift Large Shift 6. APPLICATION EXAMPLE:
Type
Sensor PGM FLOTATION PLANT
±0.025 ±0.05
Bias As an illustration, the proposed methodology is
Sensor applied textural information of froth structures from
2×N(0,0.015) 3×N(0,0.015)
Noise image data obtained from a platinum group metals
Fault (PGM) flotation plant in South Africa (Aldrich et al.
±7.5° ±15°
Direction 2004). The original data are composed of the
Process following features characterizing the froth: (a) small
x0=(±0.025,±0.025) x0=(±0.05,±0.05)
Drift number emphasis (SNE), (b) ENTROPY (c) image
local variations metric or ENTROPY (d) local
homogeneity metric (LOCHOM) and (e) gray level
dependencies (GLD). For visualization purposes two-
5. DISCUSSION
dimensional projections of the data that explain 94%
of the total variance (or principal components) are
An additive change in the sensor bias resulted in considered.
moderate increases in the classification error, with The data were grouped into three classes (‘ ’,‘+’,‘ ’)
small and large shifts increasing the error from a related to different operating conditions, with the ‘ ’
nominal 4% to 6% and 4% to 8% respectively. group representing the desirable conditions (see
Doubling and tripling the sensor noise uncertainty Figures 6 and 7). The ‘ ’ class is associated with
increased the error rate to 7% and 12% respectively. ellipsoidal froth structures containing relatively
Small quantitative shifts in the fault directions can higher gangue minerals, whereas the other classes
arise from process changes. To simulate these with less gangue had spherical bubbles as well as a
changes, small rotational changes in the fault darker colour. However, the stability of the ‘+’ class
directions were induced to the base case scenario was poor compared to the normal or ‘ ’ class. The
(angle = ±45°). A small rotation increased the fault diagnosis objective in this case would be to
nominal error rate by 1%, while a large rotation identify the specific abnormal condition when a fault
resulted in a 5% change. Finally, changes to the is detected by, for example, statistical process
nominal operating point incurred errors of 7% and 9% monitoring techniques.
respectively for the different shifts. The decision functions separating the normal class
from abnormal conditions obtained using multilayer
With the exception of the rotational changes, the one- perceptron networks and generalized one-class SVMs
class SVM gave marginal to significant improvement are shown in Figures 6 and 7 below respectively.
in performance compared to the one-nearest (Although not shown, a 1-NN classifier gave a more
neighbour distance-based classifier reported in or less similar decision boundary as MLP networks.)
Kramer and Leonard (1990). Fig 5 summarizes the It can be seen that the presence of empty regions and
performance of the one-class SVM against the uneven distribution of the data in the measurement
competitive distance-based one-nearest neighbour (1- space leads to spurious placement of the decision
NN) classifier. The one-class SVM results are for a function in the case of an MLP-based approach. Also,
kernel of width 4. Similar results were obtained for the absence of data of the negative classes results in a
kernels with other widths for the same system. decision boundary extending outside the training data
regions, as indicated in the top left of Figure 6.
insensitive to the influence of data lying in extreme
5
regions, and did not exhibit serious extrapolation
C
1,2
C
1,2 errors. Moreover, the one-class approach gave better
C performance than a one-nearest neighbour classifier,
0
previously proposed as a preferable alternative to
artificial neural networks. Not only does the use of
X2
0 C
1,2
kernels allow for a larger class of functions (including
distance-based classifiers) but also induces
C
0 C
1,2 appropriate regularization for the learning algorithm.
C
0
The study will be extended to larger systems in a
C
1,2
C
separate paper.
0
-5
-5 0 5 REFERENCES
X1
Aldrich, C., N.J. Le Roux and S. Gardner (2004).
Fig. 6. Fault diagnosis decision boundaries obtained Monitoring of metallurgical process plants by use
by using a single hidden layer MLP with four of biplots. AIChE Journal, 50, 2167-2186.
sigmoidal transfer functions. Regions associated Hayton, P., B. Schölkopf, L. Tarassenko and P.
with normal and abnormal operating conditions Anuzis (2000). Support vector novelty detection
are labelled C0 and C1,2 respectively. applied to jet engine vibration spectra. In:
Advances in Neural Information Processing
5
Systems, 13, 946-952.
Kramer, M.A. and J.A. Leonard (1990). Diagnosis
C
1,2 C
1,2 using backpropagation neural networks – Analyis
and critisicm, Computers and Chem Engng, 14,
1323-1338.
C
0 Polonik, W. (1997). Minimum volume sets and
X2
0 generalized quantile processes. Stochastic
C
1,2
Processes and their Applications, 69, 1-24.
C
0 Rätsch, G., S. Mika, B. Schölkopf, and K.-R. Müller
C
1,2 (2002). Constructing boosting algorithms from
SVMs: An application to one-class classification.
C
1,2 IEEE Transactions on Pattern Analysis and
-5
-5 0 5 Machine Intelligence, 24, 1184-1199.
X1 Schölkopf, B, J.C. Platt, and A.J. Smola (2000).
Kernel method for percentile feature extraction.
Fig. 7. Fault diagnosis decision boundaries obtained Technical Report, Microsoft Research 2000-22.
by using a generalized one-class SVM with a Schölkopf, B., J.C. Platt, J. Shawe-Taylor, A.J.
Gaussian kernel of unit width and nu-parameter Smola, and R.C. Williamson (2001). Estimating
value of 0.2. C0 and C1,2 denote normal and the support of a high-dimensional distribution.
abnormal operating conditions respectively. Neural Computation, 13, 1443-1471.
Schölkopf B. and A.J. Smola (2002). Learning with
Using one-class SVMs, improved decision boundaries Kernels. MIT Press, Cambridge, MA.
around a specific class are obtained, as shown for the Shawe-Taylor, J. and N. Cristianini (2002). Kernel
normal class in Figure 7. An important property of the Pattern Analysis. Cambridge University Press,
generalized one-class SVM is the exclusion of Cambridge, UK.
spurious decision regions resulting from lack of Sorsa, T. and H.N. Koivo (1991). Neural networks in
sufficient data or empty regions as in the case for process fault diagnosis. IEEE Transactions on
MLPs and k-NN classifiers. Furthermore, tighter Systems, Man, and Cybernetics, 21, 815-825.
boundaries are also useful in detecting novel Stephanopoulos, G. and C. Han (1996). Intelligent
operating regimes not previously encountered, systems in process engineering: a review.
limiting the risk of improper diagnostic decisions. Computers and Chemical Engineering, 20:743-
791.
7. CONCLUSIONS Tax, D.M.J. and Duin, R.P.W. (1999). Support vector
A framework for using one-class SVMs in fault domain description. Pattern Recognition Letters,
diagnosis was discussed and critically analysed using 20, 11-1-1199
a simple two-dimensional system. In spite of its Venkatasubramanian, V. and K Chan (1989). A
simplicity, the system is representative of a number of neural network methodology for process fault
industrial reactor systems. Compared to previously diagnosis, AIChE Journal, 35, 1993-2002.
proposed nonlinear methods of fault diagnosis using
artificial neural networks, the one-class SVM
approach was shown to be robust to process changes,

You might also like