Download as pdf
Download as pdf
You are on page 1of 15
ELSEVIER Discriminant analysis of high-dimensional dat: CChemometrcs and Intelligent Laboratory Systems 33 (1996) 47-61 Chemometrics and intelligent laboratory systems : a comparison of principal components analysis and partial least squares data reduction methods EK. Kemsley Instiute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK Received 15 June 1995; accepted 29 October 1995 Abstract Partial least squares (PLS) methods are presented as valuable altematives to principal components analysis (PCA) for compressing high-dimensional data before performing linear discriminant analysis (LDA). It is shown that using PLS, con- siderable improvement in class separation and thus discriminant ability can be obtained. In general, fewer of the compressed dimensions are required to give the same level of prediction successes, and for some data sets, PLS methods yield higher prediction success rates than those obtainable using PCA scores. Results are presented for two experimental data sets, com- prising mid-infrared spectra of edible oils and plant seeds. The potential dangers of PLS methods are also demonstrated, in particular its ability to introduce apparent groupings into data where there is no inherent class structure ‘Keywords: Partial least squares; Principal components analysis; Linear discriminant analysis; Infrared spectroscopy 1, Introduction Infrared, Raman and nuclear magnetic resonance (NMR) spectroscopies are powerful analytical tech- niques [1], used in research laboratories all over the world to provide qualitative information on the struc ture and composition of a diverse range of samples. Increasing interest is also being shown in addressing quantitative problems by spectroscopic methods, and in particular, by infrared spectroscopy. In recent years, considerable effort has been expended on ex- Ploiting the data compression or reduction methods of principal component analysis (PCA) and partial least squares (PLS). With the rapid expansion in afford- able computing power that has taken place over the last decade, PCA and PLS have now moved from the ‘chemometrician’s development package to the spec- troscopist's instrument-driver software. Increasingly, large data sets and complex applications are being treated by these methods. Since most infrared spec- tral data is high-dimensional, with a single spectrum containing several hundred or even thousand vari- ables, it is pethaps not surprising that compression methods have quickly become established as _valu- able tools for spectroscopic data analysis. PCA is a well-known technique of multivariate analysis. It was first proposed in 1901 by Pearson [2], and developed independently some years later by Hotelling [3], but in common with many multivariate methods, was not widely used until the arrival of modem computing technology. Today, however, it is available in virtually every statistical computer pack- (0169-7439 /96/$15.00 © 1996 Elsevier Science B.V. All rights reserved SSDI 0169-7439(95)00090-9 ry EK. Kemsley / Chemometrics and Intelligent Laboratory Systems 33 (1996) 47-61 age. The main goal of PCA is to reduce the dimen- sionality of a data set in which there are a large num- ber of intercorrelated variables, whilst retaining as much as possible of the information present in the original data. This reduction is achieved by a linear transformation to a new set of variables, the principal component (PC) scores, which are uncorrelated, and ordered such that the first few retain most of the variation present in all of the original variables. A subset comprising a few of the transformed variables only may then be used in further procedures of com- paratively reduced complexity. PCA has already been used in conjunction with a range of discriminant analysis techniques to tackle classification problems, [4—7}, for example, a popular approach is to use a subset of scores as variables in a linear discriminant analysis (LDA). An alternative strategy is to carry out principal component regression (PCR) on a dummy variable or variables that indicate class membership; however, this approach has been less widely adopted in the quantitative spectroscopy field, and thus the discussion in this paper is restricted to data compres- sion followed by traditional LDA. PLS is a generic term for a family of related mul- tivariate modelling methods, derived from the con- cepts of iterative fitting developed by Wold around a decade ago [8]. These ideas arose as pragmatic solu- tions to some of the problems that are encountered when conventional maximum likelihood methods are applied to large, intercorrelated data sets. In its basic regression form, PLS models the relationship be- tween two data sets using a series of local least- squares fits. Like PCA, it can be viewed as an axis rotation method, and there are many similarities be- tween the two techniques. PLS regression has been used extensively by chemometricians to tackle cali- bration applications in the physical sciences, with considerable success [9-111] In this paper, PLS methods are presented as alter- natives to PCA for data reduction prior to LDA. Al- though in the past PLS has been used mostly for cal- ibration, it is possible to stop short of the regression step, and use a suitably modified algorithm for data compression only. The transformed variables can then be used in an LDA. This procedure has been applied to data sets comprising mid-infrared spectra of extra virgin and refined olive oils, and of plant seeds in three different categories. It is shown that consider- able improvement in class separation and discrimi- nant ability can be obtained using PLS data reduc- tion methods. 2. Methods 2.1. PCA and PLS data reductions An infrared spectrum comprises measurements of absorbance at d different wavelengths, where d is typically several hundred. An experiment usually in- volves collecting n such spectra, and in practice it is almost always the case that n 1 dependent variables, an (n X s) matrix Y is required. Again, X and y or Y are generally mean-centred as a first step. Various algo- rithms exist that provide related but different defini- tions of PLS. The original and computationally most simple algorithm, termed orthogonalised PLS1 for one y-variable, was devised by Wold et al. in 1983 [13]. An alternative definition, known as non-ortho- gonalised PLS, was developed by Martens et al. in 1987 [14]. In terms of modelling the dependent vari- able, the two methods are equivalent, yielding the same regression equation between y and X. How- ever, when viewed as data rotation methods, there are some differences. In both formulations, vectors w, are chosen such that the covariance of each score 2, X,w, with y, is maximised, where X, and y, are the residual variability in X and y at the beginning of the ith pass through the algorithm. Up to i= min(n — 1, d) scores can be calculated. The first vector w, and score z, are equivalent in both methods; thereafter, the algorithms differ. In non-orthogonalised PLS, all vectors infra) afe in fact orthogonal, and the linear transformation matrix W = P describes a rigid rota- tion. However, the scores obtained are not necessar- ily orthogonal. This definition of PLS is conceptually easy to understand: it is a rigid rotation of the origi- nal co-ordinate system, such that scores along the transformed axes have successively maximised co- variance with y. In contrast, the orthogonalised PLS algorithm ensures that the scores obtained are uncor- related by using an alternative method of expressing the residual variability in X at each stage. This re- quires a set of additional vectors b,, termed the esti- mated loadings, onto which the projections of X, have maximum covariance with the scores z;,. The inear transformation matrix is given by P = W(B™ W)!, and in general, its columns are not or- thogonal, so that this form of PLS does not describe a rigid rotation. 2.2. PCA and PLS in LDA In this paper, PCA and PLS scores are used as variables in LDA. It is shown that PLS reductions, when performed with a y-vector or Y-matrix filled with suitable dummy variables, give improved dis- criminant ability in comparison with the use of PCA scores. The origin of this improvement is believed to be that PLS reductions yield scores that maximise the between-groups variance, as will now be shown, Suppose the observations in the matrix X can be assigned to one of g=2 groups, and are arranged such that the first n, rows belong to group 1, and the subsequent n, rows to group 2. A y-vector can be constructed containing m, entries of n,/n, and ny of —n,/m; this vector has column mean zero. The co- variance of z, and y can be written: yia, yi Xw cov(yst,) = = This quantity is clearly maximised when xTy wi a @ (y™ XXT y) . This where the denominator is used to set |w,|= vector defines the first PLS loading. Now consider the between-groups variance of the elements of z,, defined as: Q) where z{? denotes the mean of the entries in z, as- signed to group j. Since X is mean-centred, Z;=0; furthermore, it follows that: y'z)'_[y™xw,)’ =) = a 50 EX. Kemsley/Chemomerrcs and Intelligent Laboratory Systems 33 (1996) 47-61 because gray (3) where z, denotes the ith entry in z,. Since, ;= 0, we can write so that Eq. (3) becomes: m\t [atm ora)'=(Es) -(E fet j=1,2 ‘The between-groups variance can thus be written: (Xm) Lay = ELH. fx ra jet ist f Clearly, when w, corresponds to the first PLS loading defined by Eq. (1), the between-groups vari- ance will also be maximised. This indicates that the first PLS score represents the best single dimension for separating the two classes. Indeed, it is found that the between-groups variances of subsequently calcu- lated scores are equal to zero, implying perhaps that these will be of minimal use for discrimination; how- ever, as will be seen from the experimental data, this is not always the case and sometimes their inclusion can improve the prediction success rate, This can be understood by considering that when LDA is applied to multiple scores, it is the multivariate analogues of the between-groups and within-groups variances, that is, the between-groups and within-groups covariance matrices that influence the success of a model [15]; these are affected even by scores for which the uni- variate between-groups variance defined by [2] is, zero. When there are g > 2 classes, the matrix ¥ can be constructed from g columns of binary variables, which are then mean-centred as a preprocessing step. (Conventionally, g—1 binary variables are used to indicate group membership, but some of the associ- ated algebra is simplified if Y is of the order (n x g), which will be assumed throughout this work). In such cases, a PLS2 algorithm for multiple dependent vari- ables is required. This differs from PLS1, requiring the introduction of an additional vector u to sum- ‘arise the residual variability in ¥. It is the covari- ance of u with the scores that is successively max- mised. For the first score, the relationship between u and Y can be written {16}: = (4) where an Ht (5) a, Combining Eqs. (4) and (5) leads to: YYTz, ¢ where ¢ is a scalar given by: LYYT2, 22, ‘The covariance of u and z, can thus be written: TY", wi XT YY"Xw, (n= 1) e(n=1) cov(u,z,) (6) ‘Maximisation of this quantity is brought about by an iterative procedure; a full description of the algo- rithm can be found in the text by Martens and Naes {14]. With the observations in X mean-centred and ar- ranged groupwise as above, the between-groups vari- ance of the first score can be written z,YQY"z, wi XT YQY" Xw, gol gol @) in which Q is a diagonal matrix with entries 1/n,, Try yol/ny, Where nj... are the number of ob- servations in each of the j~ 1...g groups. By exami- nation of Eqs. (6) and (7), it is clear maximising the covariance of z, and u simultaneously maximises the between-groups variance. Moreover, it is found that the first g— 1 scores have substantially non-zero be- EK. Kemsley / Chemometrics and Intelligent Laboratory Systems 33 (1996) 47-61 st tween-groups variance; this is consistent with what one might expect, since to characterise the separa- tions between g groups, g~1 dimensions are re- Quired. In this paper, the PCA and PLS data transforma- tions described above are applied to two sets of ex- perimental data. The scores obtained have been used in LDA. The procedure is summarised as follows: (®) Preprocessing Each observation is assigned to a class. An appro- priate y-vector (or matrix) of dummy variables is constructed. The X and y-data are mean-centred, and the X-data variance-scaled. ii) Data Reduction PCA or PLS data reduction is performed to yield a scores matrix Z. A subset of r of these scores are retained in a reduced matrix Z, of order (n r), and the remainder discarded. (iii) LDA applied to training set ‘The class mean scores are calculated. The Maha- lanobis D? distance [17] of each observation’s scores from each group mean is computed, and the observa- tion re-assigned to the nearest group mean. The per- centage correctly re-classified is examined. (The n observations and g group means are represented by (1X r) row vectors. The Mahalanobis D? between the jth observation 2

You might also like