Professional Documents
Culture Documents
PROCESS FAULT DIAGNOSIS USING ONE CLASS SUPPORT VEC - 2007 - IFAC Proceedings Vo
PROCESS FAULT DIAGNOSIS USING ONE CLASS SUPPORT VEC - 2007 - IFAC Proceedings Vo
classification problems. These include the one-class The threshold parameter ρ is computed by use of the
SVM which seeks a hyperplane that separates the Karush-Kuhn-Tucker (KKT) conditions from
unlabelled data set X from the origin with maximum optimization theory. Thus, the algorithm outputs large
margin (Schölkopf et al., 2001), and the SV data values for points similar to the image of X and small
descriptor that finds an optimal enclosing hypersphere values for generic points from Z.
with minimal volume (Tax and Duin, 1999).
It turns out that only relatively few of the training
The above algorithms are restricted as far as the kind
patterns with non-zero αi’s define the decision
of ‘abnormality’ that can be detected is concerned.
function, f. These are termed support vectors. Also,
Prior information of what the abnormal class looks
the trade-off parameter has a special meaning; it is a
like can be encoded in the one-class SVM. Instead of
lower bound on the number of support vectors, an
finding an f that is maximally separate from the
upper bound on the number of outliers, and
origin, one considers a large margin hyperplane w
asymptotically equals both the fraction of support
that maximizes the separation of the set X =
vectors and fraction of outliers (Schölkopf et al.,
{x1,x2,…,xm} from the centroid of another set Z =
2000).
{z1,z2,…,zt} (Schölkopf et al., 2000; Hayton et al.,
2000). More formally, the decision function is The algorithm as formulated is restricted to linear
obtained by minimizing a weighted sum of a decision functions. Noting the data appear as dot
regularizer and a training error term that depends on products in both the objective and evaluation
an overall margin ρ and training errors ξi functions (Equations 4 and 5), the “kernel trick” can
be used to improve the flexibility or capacity of the
1 2 1 m
model (Schölkopf and Smola, 2002). A kernel k is a
min w + ξi − ρ
w, ,ρ 2 νm i =1
mathematical function that implicitly computes the
dot product of images of input space data induced by
1 t
(2) some transformation ϕ, that is
w⋅ x− zj ≥ ρ − ξi ,
subject to t j =1
k ( x, z ) = ϕ ( x), ϕ ( z ) (6)
ξi ≥ 0
with a trade-off parameter. The corresponding where the angle brackets denote dot product in the
decision function is given by transformed or feature space. An example of
commonly used kernel function is the Gaussian radial
basis function:
1 t
f ( x ) = sgn w⋅ x− zj −ρ (3)
t 2
j =1
x−z
k ( x, z ) = exp
2σ
2
which takes positive values for most patterns in X, (7)
while the regularizer ||w||2 is still smaller. This is
≡ ϕ ( x), ϕ ( z )
equivalent to a large margin separation from the
centroid of set Z. This formulation will be refereed to with width σ a kernel hyperparameter that must be
as the generalized one-class SVM in the subsequent specified by the user. Cross-validation is typically
sections. used to determine the optimal kernel width.
The solution to Equation 2 is typically obtained by The advantage of using kernels is that it is not
first transforming it to an equivalent dual formulation necessary to know the exact form of the mapping ϕ as
in terms of Lagrange multipliers αi , that is long as it satisfies the so-called Mercer conditions. By
using kernels one avoids computational difficulties
associated with working in a possibly high
1 m
dimensional feature space while retaining flexibility.
min α iα j ( x ⋅ x) + Q − Q j − Qi Hence, by replacing all occurrences of dot products in
w, ,ρ 2 i, j (4)
the preceding formulations with kernels, statistically
subject to {0 ≤ α i ≤ νm
1
, i
αi = 1 simple yet rich class of models are possible.
R0
Normal
x2 0
x 0 Fault 1
2
Fault 2 R2
-1
-1 0 1
x1
-1 (c)
-1 0 1
x 1
1 R1
R2
Fig. 2. Scatter plot of the two-dimensional fault
diagnosis problem 0
x2
R0
A generalized one-class SVM model was constructed,
using Gaussian kernels with the following widths: σ =
[0.25 0.5 1 2 4 8]. For comparative purposes, -1
corresponding models for the standard one-class -1 0 1
SVM (no abnormal class information) and a standard x1
binary SVM classifier were also fit using the same
kernel and kernel parameters. Fig. 3 shows the
resulting decision functions for the three classifiers Fig. 3 Decision regions generated using (a) the
for a kernel width = 4. The generalized one-class generalized one-class SVM (b) the standard one-
SVM identified the different process operating class SVM and (c) a binary SVM classifier. The
regions satisfactorily, as shown in Fig. 3(a). In fault model representing each region is indicated
contrast, the other two classifiers could not separate by R0, R1, and R2 for the Normal, Fault 1, and
the different regions very well. Fault 2 conditions respectively.
Fig. 4(a) summarizes the distribution of errors for 100
independent test sets, each with a sample size of 4.3 Robustness Analysis
1000. The scaling of the figure corresponds with that To investigate the effect of small changes in class
of Fig. 2. As indicated, the classification error did not distributions after the classifier has been trained,
vary too widely over the range of kernel widths small and large shifts for different types of possible
considered (0.25 to 8). Fig. 4(b) shows a scatter plot fault modes were simulated as summarized in Table
of a sample of misclassified data points from an 1. These changes are similar to those investigated by
independent test set observed in the simulations. The Kramer and Leonard (1990) and were chosen to allow
misclassified data are distributed in the overlapping
regions between the three classes and are generally
for a comparison with the 1-nearest neighbour 0.25
classifier that had the best performance in their study. Nominal
Small Shift
(a) (b) 0.2 Large Shift
1
0.06 One-class SVM 1-NN Classifier
classification error
0.5 0.15
Mean Error
0.05
x2 0
0.04
0.1
0.03 -0.5
-1 0.05
0.25 0.5 1 2 4 8 -1 -0.5 0
kernel width σ x1
DRIFT
DRIFT
NOISE
NOISE
DIREC
DIREC
BIAS
BIAS
Fig. 4. (a) Variation of misclassification error with
kernel width (b) Typical distribution of
misclassified patterns. The errors are associated Fig. 5. Results from robustness analysis of the
with class overlap and generally are unavoidable. performance of one-class SVM under changes in
the underlying fault class distribution. Also shown
Table 1. Simulated changes for Robustness Analysis for comparison are results for a distance based
of Diagnosis with one-class SVMs classifier (1-NN) as investigated in Kramer and
Leonard (1990).
Error
Small Shift Large Shift 6. APPLICATION EXAMPLE:
Type
Sensor PGM FLOTATION PLANT
±0.025 ±0.05
Bias As an illustration, the proposed methodology is
Sensor applied textural information of froth structures from
2×N(0,0.015) 3×N(0,0.015)
Noise image data obtained from a platinum group metals
Fault (PGM) flotation plant in South Africa (Aldrich et al.
±7.5° ±15°
Direction 2004). The original data are composed of the
Process following features characterizing the froth: (a) small
x0=(±0.025,±0.025) x0=(±0.05,±0.05)
Drift number emphasis (SNE), (b) ENTROPY (c) image
local variations metric or ENTROPY (d) local
homogeneity metric (LOCHOM) and (e) gray level
dependencies (GLD). For visualization purposes two-
5. DISCUSSION
dimensional projections of the data that explain 94%
of the total variance (or principal components) are
An additive change in the sensor bias resulted in considered.
moderate increases in the classification error, with The data were grouped into three classes (‘ ’,‘+’,‘ ’)
small and large shifts increasing the error from a related to different operating conditions, with the ‘ ’
nominal 4% to 6% and 4% to 8% respectively. group representing the desirable conditions (see
Doubling and tripling the sensor noise uncertainty Figures 6 and 7). The ‘ ’ class is associated with
increased the error rate to 7% and 12% respectively. ellipsoidal froth structures containing relatively
Small quantitative shifts in the fault directions can higher gangue minerals, whereas the other classes
arise from process changes. To simulate these with less gangue had spherical bubbles as well as a
changes, small rotational changes in the fault darker colour. However, the stability of the ‘+’ class
directions were induced to the base case scenario was poor compared to the normal or ‘ ’ class. The
(angle = ±45°). A small rotation increased the fault diagnosis objective in this case would be to
nominal error rate by 1%, while a large rotation identify the specific abnormal condition when a fault
resulted in a 5% change. Finally, changes to the is detected by, for example, statistical process
nominal operating point incurred errors of 7% and 9% monitoring techniques.
respectively for the different shifts. The decision functions separating the normal class
from abnormal conditions obtained using multilayer
With the exception of the rotational changes, the one- perceptron networks and generalized one-class SVMs
class SVM gave marginal to significant improvement are shown in Figures 6 and 7 below respectively.
in performance compared to the one-nearest (Although not shown, a 1-NN classifier gave a more
neighbour distance-based classifier reported in or less similar decision boundary as MLP networks.)
Kramer and Leonard (1990). Fig 5 summarizes the It can be seen that the presence of empty regions and
performance of the one-class SVM against the uneven distribution of the data in the measurement
competitive distance-based one-nearest neighbour (1- space leads to spurious placement of the decision
NN) classifier. The one-class SVM results are for a function in the case of an MLP-based approach. Also,
kernel of width 4. Similar results were obtained for the absence of data of the negative classes results in a
kernels with other widths for the same system. decision boundary extending outside the training data
regions, as indicated in the top left of Figure 6.
insensitive to the influence of data lying in extreme
5
regions, and did not exhibit serious extrapolation
C
1,2
C
1,2 errors. Moreover, the one-class approach gave better
C performance than a one-nearest neighbour classifier,
0
previously proposed as a preferable alternative to
artificial neural networks. Not only does the use of
X2
0 C
1,2
kernels allow for a larger class of functions (including
distance-based classifiers) but also induces
C
0 C
1,2 appropriate regularization for the learning algorithm.
C
0
The study will be extended to larger systems in a
C
1,2
C
separate paper.
0
-5
-5 0 5 REFERENCES
X1
Aldrich, C., N.J. Le Roux and S. Gardner (2004).
Fig. 6. Fault diagnosis decision boundaries obtained Monitoring of metallurgical process plants by use
by using a single hidden layer MLP with four of biplots. AIChE Journal, 50, 2167-2186.
sigmoidal transfer functions. Regions associated Hayton, P., B. Schölkopf, L. Tarassenko and P.
with normal and abnormal operating conditions Anuzis (2000). Support vector novelty detection
are labelled C0 and C1,2 respectively. applied to jet engine vibration spectra. In:
Advances in Neural Information Processing
5
Systems, 13, 946-952.
Kramer, M.A. and J.A. Leonard (1990). Diagnosis
C
1,2 C
1,2 using backpropagation neural networks – Analyis
and critisicm, Computers and Chem Engng, 14,
1323-1338.
C
0 Polonik, W. (1997). Minimum volume sets and
X2
0 generalized quantile processes. Stochastic
C
1,2
Processes and their Applications, 69, 1-24.
C
0 Rätsch, G., S. Mika, B. Schölkopf, and K.-R. Müller
C
1,2 (2002). Constructing boosting algorithms from
SVMs: An application to one-class classification.
C
1,2 IEEE Transactions on Pattern Analysis and
-5
-5 0 5 Machine Intelligence, 24, 1184-1199.
X1 Schölkopf, B, J.C. Platt, and A.J. Smola (2000).
Kernel method for percentile feature extraction.
Fig. 7. Fault diagnosis decision boundaries obtained Technical Report, Microsoft Research 2000-22.
by using a generalized one-class SVM with a Schölkopf, B., J.C. Platt, J. Shawe-Taylor, A.J.
Gaussian kernel of unit width and nu-parameter Smola, and R.C. Williamson (2001). Estimating
value of 0.2. C0 and C1,2 denote normal and the support of a high-dimensional distribution.
abnormal operating conditions respectively. Neural Computation, 13, 1443-1471.
Schölkopf B. and A.J. Smola (2002). Learning with
Using one-class SVMs, improved decision boundaries Kernels. MIT Press, Cambridge, MA.
around a specific class are obtained, as shown for the Shawe-Taylor, J. and N. Cristianini (2002). Kernel
normal class in Figure 7. An important property of the Pattern Analysis. Cambridge University Press,
generalized one-class SVM is the exclusion of Cambridge, UK.
spurious decision regions resulting from lack of Sorsa, T. and H.N. Koivo (1991). Neural networks in
sufficient data or empty regions as in the case for process fault diagnosis. IEEE Transactions on
MLPs and k-NN classifiers. Furthermore, tighter Systems, Man, and Cybernetics, 21, 815-825.
boundaries are also useful in detecting novel Stephanopoulos, G. and C. Han (1996). Intelligent
operating regimes not previously encountered, systems in process engineering: a review.
limiting the risk of improper diagnostic decisions. Computers and Chemical Engineering, 20:743-
791.
7. CONCLUSIONS Tax, D.M.J. and Duin, R.P.W. (1999). Support vector
A framework for using one-class SVMs in fault domain description. Pattern Recognition Letters,
diagnosis was discussed and critically analysed using 20, 11-1-1199
a simple two-dimensional system. In spite of its Venkatasubramanian, V. and K Chan (1989). A
simplicity, the system is representative of a number of neural network methodology for process fault
industrial reactor systems. Compared to previously diagnosis, AIChE Journal, 35, 1993-2002.
proposed nonlinear methods of fault diagnosis using
artificial neural networks, the one-class SVM
approach was shown to be robust to process changes,