Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 35, NO.

2, APRIL 2005 339

Identifying Critical Variables of Principal


Components for Unsupervised Feature Selection
K. Z. Mao

Abstract—Principal components analysis (PCA) is probably the vided by PCA are a set of new variables carrying no clear phys-
best-known approach to unsupervised dimensionality reduction. ical meanings [2].
However, axes of the lower-dimensional space, i.e., principal com- Principal components (PCs) are linear combinations of all
ponents (PCs), are a set of new variables carrying no clear physical
variables (or called features) available. However, these variables
meanings. Thus, interpretation of results obtained in the lower-di-
mensional PCA space and data acquisition for test samples still in- are not necessarily equally important to the formation of PCs:
volve all of the original measurements. To deal with this problem, some of the variables might be critical, but some might be re-
we develop two algorithms to link the physically meaningless PCs dundant, irrelevant or insignificant. Motivated by this fact, it
back to a subset of original measurements. The main idea of the was attempted to link the physically meaningless PCs back to
algorithms is to evaluate and select feature subsets based on their a subset of the original variables through selecting critical vari-
capacities to reproduce sample projections on principal axes. The ables or eliminating redundant, irrelevant or insignificant vari-
strength of the new algorithms is that the computaion complexity
involved is significantly reduced, compared with the data struc- ables. B2 and B4 algorithms are probably the best known ap-
tural similarity-based feature evaluation [20]. proaches of this kind [17], [18]. The B2 algorithm discards vari-
ables that are highly associated with the last few PCs, while the
Index Terms—Backward elimination, forward selection, prin-
B4 algorithm selects variables that are highly associated with the
cipal components analysis (PCA), unsupervised feature selection.
first few PCs. Since the significance of variables are evaluated
individually, redundant features might not be removed by the
I. INTRODUCTION B2 and B4 algorithms. Other PCA-based feature-selection al-
gorithms include principal variable method [19] and data struc-
D IMENSIONALITY reduction is a long-standing issue of
interest in high-dimensional data analysis. Feature selec-
tion and feature extraction are two commonly used approaches
ture preserved (DSP) algorithm [20]. The main character of the
DSP algorithm is to evaluate a feature subset based on its data
to this problem. In the literature, the terms feature selection structural similarity to the full feature set. Comparison of data
and feature extraction are sometimes used interchangably, how- structure involves deviation of principal components underlying
ever, they have subtle distinctions. Feature selection refers to the feature subset and matrix matching under translation, rota-
selecting features in the measurement space, while feature ex- tion and dilation. The computation complexity of evaluating a
traction technique selects features in a transformed space (see single-feature subset consisting of features is , where
is the number of training samples. This could be quite high
for example [1]–[3]). Features provided by the feature-selec-
if is a large number. In the present study, we propose to eval-
tion technique is a subset of the original measurements, while
uate a feature subset based on its capacity to reproduce sample
features obtained through feature extraction are a set of new
projections on principal axes. The modification here is subtle,
variables carrying no clear physical meanings. In applications
but it simplifies the evaluation of a feature subset to an ordinary
where interpretable features are desired, the feature-selection
technique should be the choice. least-square estimation (LSE) problem, and hence reduces the
Quite often feature selection is studied in the paradigm of su- computation complexity to . Experiment studies show
that the LSE-based feature evaluation improves computational
pervised learning (see for example [1]–[10]). Feature selection
efficiency significantly while the feature subset provided leads
in the paradiam of unsupervised learning is relatively rare. In the
to similar classification results to that of the data structural sim-
literature, unsupervised feature selection is often customized to
ilarity-based feature evaluation.
a particular clustering algorithm, where intra-cluster scatter and
The present study is organized as follows. In Section II, the
inter-cluster distance are used as evaluation criteria [11], [12].
basic idea of identifying critical variables to principal compo-
Other criteria that have been used in unsupervised feature se-
nents is introduced, and a forward feature-selection algorithm
lection include entropy measure [13], expectation measure [14],
and a backward feature-elimination algorithm are developed,
dependency measure [15] and relevance measure [16], etc. Prin-
and computation complexity is analyzed. Experiment studies
cipal components analysis (PCA) is also frequently used in un-
are presented in Section III, and concluding remarks are given
supervised dimensionality reduction, but PCA is generally con-
in Section IV.
sidered as a feature extraction technique because features pro-
II. SELECTING FEATURE SUBSET THROUGH IDENTIFYING
CRITICAL VARIABLES OF PRINCIPAL COMPONENTS
Manuscript received July 7, 2003; revised January 31, 2004 and August 16,
2004. This paper was recommended by Associate Editor V. Govindaraju. A. The Basic Idea
The author is with the School of Electrical and Electronic Engineering,
Nanyang Technological University, Singapore (e-mail: ekzmao@ntu.edu.sg). Principal components analysis can be considered as a linear
Digital Object Identifier 10.1109/TSMCB.2004.843269 transform that maps data from the original measurement space
1083-4419/$20.00 © 2005 IEEE
340 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 35, NO. 2, APRIL 2005

to a new space spanned by a set of new variables. Assume the If multiple PCs are taken into account, the feature evaluation
linear transform is denoted by matrix , pattern in the new and selection procedure should be based on minimization of the
space is represented by following combined cost function:

(1)
(5)
where , , and

where is the number of PCs to be considered. An important


issue here is the determination of a suitable value for . This is a
, and quite often . The projection of data points common problem whenever PCA is employed. A rule of thumb
onto the original variables gives the values of the original vari- is to keep eigenvalues accounting for 90% of the variance or to
ables, and the projection of data points onto the new variables keep eigenvalues larger than the average of all the eigenvalues
gives the values of the new variables. The new variables , (see for example [18]).
are called PCs. A typical feature-selection algorithm includes a feature eval-
Consider the projection of uation criterion and a search algorithm. The role of the search
on the th principal axis algorithm is to explore the solution space and generate candidate
feature subsets, while the evaluation criterion is used to assess
the quality of the subsets. Equations (4) and (5) can be con-
sidered as a feature evaluation criterion. This criterion must be
(2)
combined with a search algorithm such as the forward selection
or the backward elimination to constitute a feature-selection al-
As shown in (2), the projection of a sample on principal axis gorithm. The forward selection is a bottom-up method, starting
is a linear combination of all variables. However, some of the from an empty set. The significance of a feature is evaluated in
variables might be redundant, irrelevant or insignificant. This terms of error reduction after inclusion of the feature in (3). The
indicates that feature selection can be done through identifying backward elimination is a top-down method starting from the
a subset of variables whose roles are critical in determining data full feature set, where the significance of a feature is evaluated
projections on principal axes. based on error inflation after removing the feature from (3). To
The significance of variable can be evaluated based on the select features from a set of features, the standard forward
value of the corresponding parameter . Variables whose cor- selection algorithm needs to evaluate
responding parameters are smaller than a threshold can be con- feature subsets, while the standard backward elimination algo-
sidered as irrelevant or insignificant. Other ways of selecting rithm evaluates feature subsets.
influential variables include discarding variables that are highly As shown in (5), linear least square estimation operations
associated with the last a few PCs or keeping variables that are are needed to evaluate one feature subset. Thus, the total number
highly associated with the first a few PCs [17]. These methods of linear least square estimation operations performed in se-
are easy to implement, but none of them is able to remove redun- lecting from features are and
dant features. In the present study, we consider all features in a for the forward selection and
feature subset as a team, and evaluate the significance of them backward elimination respectively. These constitute the major
in a collective manner based on their capacity to reproduce data computations in the feature-selection procedure.
projections on principal axes. This idea of feature evaluation can The DSP algorithm [20] is also a PCA-based feature selec-
be summarized as follows. tion. The main idea of the DSP algorithm is to evaluate fea-
Given and , build a linear model ture subsets based on data structural similarity to the full feature
set. The feature evaluation process involves matrix comparison
(3) through Procrustes analysis which translates, rotates and dilates
one matrix to match the other. Assume matrices and de-
note sample projections on principal axes derived from the full
such that the following cost function is minimized: feature set and feature subset respectively. Comparison under
translation is done through mean removing, while comparisons
(4) under rotation and dilation is implemented by fixing matrix
as a reference, rotating and dilating to match the reference
. The rotation matrix and the dilation factor are computed
where , , and using
. , , and
. , denotes the th sample. The (6)
outstanding character of the above idea is that it converts the
problem of identifying critical variables of principal compo- (7)
nents to the model building problem, which is relatively easy
to solve. The significance of a single variable can be measured
based on the error reduction or inflation of (4) after adding or where is the singular value decomposition of ,
deleting the variable from the feature subset. and denotes the trace of a matrix.
MAO: IDENTIFYING CRITICAL VARIABLES OF PRINCIPAL COMPONENTS FOR UNSUPERVISED FEATURE SELECTION 341

Since a backward elimination search procedure was em- % The LSE 0 based forward selection algorithm
ployed in the DSP algorithm [20], the number of feature subsets (1) Initialize S to an empty set ;
to be evaluated should be the same as the LSE-based backward (2) Initialize R to the full feature set ;
elimination algorithm. The computation complexity difference (3) Perform principal components analysis on the
between the LSE- and DSP-based algorithms is attributed complete data ;
to the difference in evaluating a feature subset. Assume the (4) Calculate sample projections on the first d principal
feature subset to be evaluated consists of features, and the axes ;
number of training samples is . To find , a singular value (5) For l = : %
1 m m iterations for m features ;
decomposition (SVD) of the -dimensional covariance =
h number of features in set R ;
matrix is usually performed. The computation complexity of For j 1 h = : % evaluate each variable in R ;
this SVD decomposition is . Reference [20] employed Take variable j from set R and add it temporarily to set S ;
a more efficient implementation, performing SVD on the For k 1 d= : % consider d PCs ;
-dimensional data matrix, instead of on the covariance Fit a linear model using all the l variables in Set S ;
matrix. With this efficient implementation, the computation Calculate fitting error on PC k ;
complexity of deriving is . Other com- End
putations in the feature evaluation include computing rotation Calculate the total fitting error ;
matrix and dilation factor and matching the two matrices. In End
most pattern-classification applications, the number of samples Add the variable that leads to minimum total error to set S ;
is much higher than the number of features, i.e.,, and Eliminate the variable selected from R:
, thus the total complexity involved in the data structure End
similarity-based evaluation of a feature subset consisting of
features is given by C. The LSE-Based Backward Feature-Elimination Algorithm
(8) In applications where most of the variables are to be kept, an
In contrast, the major computation of feature evaluation in our elimination algorithm is prefered. The elimination algorithm is
algorithms is the estimation of parameters based on training the combination of the evaluation criterion (4) and (5) with a
samples using a linear least square estimation algorithm. Op- backward elimination search algorithm. The pseudocode of the
erations involved are multiplication of -dimensional and LSE-based backward elimination algorithm is as follows:
-dimensional matrices, inversion of a -dimensional
matrix, multiplication of -dimensional and -dimen- % The LSE 0 based backward elimination algorithm
sional matrices, and multiplication of -dimensional and (1) Initialize R to the full feature set ;
-dimensional matrices. The computation complexity of (2) Perform PCA analysis on the complete data determine d ;
the above operations is , , and principal axes to be used ;
respectively. Considering , the computation complexity (3) Calculate sample projections on the first d principal
of our algorithm in evaluating a feature subset consisting of axes ;
features is given by = :
(4) For i 1 n 0 m % (
delete n 0 m features ) ;
=
l number of features in set R ;
(9) For j 1 l= : % evaluate each variable in R ;
Obviously, should be much higher than . In other words, Remove variable j temporarily from R ;
the linear LSE-based feature evaluation is much more efficient For k 1 d= : % consider d PCs ;
than the data structural similarity and Procrustes analysis-based Fit a linear model using all the variables in Set R ;
feature evaluation. Calculate fitting error to on PC k ;
Besides sequential forward selection and backward elimina- End
tion, variants of the two search algorithms such as the floating Calculate the total fitting error ;
search [4], the orthogonal forward-selection, and orthogonal End
backward elimination [9], [10] can also be used. Only the Remove the variable leading to minimum error inflation from R ;
forward feature-selection and the backward feature-elimination End
algorithms are presented as examples next.
Model fitting is part of the forward-selection and the back-
B. The LSE-Based Forward Feature-Selection Algorithm ward-elimination algorithms. The model fitting is based on the
Finding critical variables to PCs consists of two steps. In the linear LSE algorithm, which demands more data points than the
first step of the procedure, the PCs are derived from the full fea- variables in the model. In applications where features are more
ture set using a principal components analysis procedure, and than training samples, the backward-elimination algorithm is
sample projections on the PCs corresponding to the largest not applicable. The forward-selection algorithm still can be ap-
eigenvalues are computed. In the second step, a sequential for- plied to such applications, but the maximum number of features
ward-selection algorithm is performed, with the objective of se- that can be selected is equal to , where is the number
lecting a subset of features that has the most capacity to repro- of training samples.
duce the data projections on the PCs. The pseudocode of se- An important issue in feature selection is the determination
lecting feature subset through identifying critical variables of of the numer of features to be selected. If the two algorithms are
PCs is summarized as follows: applied to select features for supervised learning, the number of
342 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 35, NO. 2, APRIL 2005

Fig. 2. Approximation error versus number of feature selected for


Fig. 1. Three clusters among 600 samples for experiment 1. experiment 1.

features to be selected can be determined in terms of cross-vali-


dation classification results. This is a common practice in super-
vised feature selection. If the algorithms are applied to unsuper-
vised learning problems, the number of features to be selected
can be determined based on error of approximation (5). In gen-
eral, the approximation error decreases with the increase of the
number of features used. The selection could be stopped when
the approximation error reduction resulted from the addition of
the next best feature is below a pre-specified threshold.

III. EXPERIMENTS
A. Experiment 1: An Unsupervised Learning Problem
In the first experiment, a synthetic problem was used to test
our algorithms. The synthetic data, described by two features
and , consists of three Gaussian clusters as shown in Fig. 1.
The covariance matrices of the three clusters are all identity ma-
trices, and the mean vectors of the three clusters are , Fig. 3. Classification results versus number of feature selected for
and respectively. In the experiment, four ad- experiment 1.
ditional features were generated by adding Gaussian random
noise with zero mean and variance at levels 0.2 and 0.4 to backward elimination algorithm. In both algorithms, irrelevant
and . Obviously among the six features, four are redundant. and redundant features were identified and deleted.
Besides four redundant features, ten Gaussian random variables Since the data is synthetic, the correct number of clusters and
with mean zero and variance one were generated and added to the true cluster assignments are known. The clustering perfor-
the dataset. The objective here is to identify and discard ten irrel- mance was tested. In each of the ten repeats, 540 samples were
evant and four redundant features using the LSE-based forward used to train a two-dimensional self-organizing map (SOM)
selection and backward elimination algorithms. with three nodes, and the 90 test samples were then assigned to
In the experiment, tenfold cross validation was performed. In the nearest node. The error rate versus the number of features
each of the ten repeats, PCA was first performed on 540 training is shown in Fig. 3. The results again demonstrate that only two
samples, and the number of PCs to be used was determined features are necessary and the two features selected are correct.
based on the eigenvalues of the training data matrix. In all the In the above experiment, the number of PCs to be used is
ten repeats, two principal axes were selected. Features were then determined as the number of eigenvalues that are larger than the
selected based on their abilities to reproduce data projections on average of the 16 eigenvalues. In the experiment, the influence
the two principal axes. The error of approximation to the data of the number of PCs on feature selection was investigated.
projections (averaged over ten repeats) versus the number of fea- We found that the curve of approximation error versus number
tures used is plotted in Fig. 2, which clearly shows that two fea- of features might be different when different numbers of PCs
tures are sufficient to well reproduce data projections on the first were used in feature selection, but this did not affect feature
two principal axes. In the forward selection, the two features se- selection because error reduction was trivial after two features
lected were and , while and were removed in the were selected in all cases.
MAO: IDENTIFYING CRITICAL VARIABLES OF PRINCIPAL COMPONENTS FOR UNSUPERVISED FEATURE SELECTION 343

TABLE I TABLE III


LSE-BASED RESULTS FOR DIABETES DATASET LSE-BASED RESULTS FOR WDBC DATASET

TABLE IV
PA-BASED RESULTS FOR DIABETES DATASET
TABLE II
LSE-BASED RESULTS FOR WBC DATASET

TABLE V
PA-BASED RESULTS FOR WBC DATASET
B. Experiment 2: Supervised Learning Problems
In Experiment 1, the unsupervised feature selection method
is tested in the paradigm of unsupervised learning. As a matter
of fact, this unsupervised feature-selection algorithm is appli-
cable to supervised learning problems as well. In Experiment 2,
we used three supervised learning problems to test the two algo-
rithms. All the three real-world problems are from UCI Machine
Learning Repository [23].
The first example is the Pima Indian diabetes problem. The
dataset contains 768 samples from two classes, where 500
samples are from class 1 and the remaining 268 samples are
from class 2. Each sample is represented by eight features. The
problem posed is to predict whether a patient would test positive
for diabetes according to World Health Organization criteria.
To estimate the classification accuracy, 12-fold cross-validation
was used in the experiment. The reason why 12-fold rather than Malignant). Again, tenfold cross validation was used to estimate
tenfold was employed is that 768 is divisible by 12, and 12 is the classification accuracy.
close to the commonly used 10. In the experiment, the k-nearest neighbor (k-NN) classifier
The second example is the Wisconsin breast cancer (WBC) was used to classify the samples, and both the LSE-based for-
problem. The dataset contains 699 samples, where 458 are be- ward selection and the LSE-based backward elimination algo-
nign samples and 241 are malignant samples. Each sample is de- rithms were used to select feature subsets. The results of fea-
scribed with nine features. The task here is to predict diagnosis ture selection and classification for the three datasets are sum-
results (Benign or Malignant). In this example, tenfold cross val- marized in Tables I–III, respectively, where the accuracy and
idation was used to estimate the classification accuracy. the number of features are presented in the format of mean
The third example is the Wisconsin diagnostic breast cancer value standard deviation. For comparison, Procrustes analysis
(WDBC) problem. The WDBC dataset consists of 357 benign (PA)-based forward selection and backward elimination were
samples and 212 malignant samples, with 30 real-valued fea- also applied to the three datasets, and the results obtained are
tures. The task here is to predict diagnosis results (Benign or summarized in Tables IV–VI respectively.
344 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 35, NO. 2, APRIL 2005

TABLE VI [2] A. K. Jain, R. P. W. Dulin, and J. Mao, “Statistical pattern recognition:


PA-BASED RESULTS FOR WDBC DATASET A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp.
4–37, Jan. 2000.
[3] A. R. Webb, Statistical Pattern Recognition, 2nd ed. New York: Wiley,
2002.
[4] P. Pudil, J. Novovicova, and J. Kittler, “Floating search methods in fea-
ture selection,” Pattern Recognit. Lett., vol. 15, no. 11, pp. 1119–1125,
1994.
[5] P. Pudil and J. Novovicova, “Novel methods for subset selection with
respect to problem knowledge,” IEEE Intell. Syst., vol. 13, no. 2, pp.
66–74, Mar.-Apr. 1998.
[6] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artif.
Intell., vol. 97, no. 1–2, pp. 273–324, 1997.
[7] J. Yang and V. Honavar, “Feature subset selection using a genetic algo-
rithm,” IEEE Intell. Sys., vol. 13, no. 2, pp. 44–49, Mar.-Apr. 1998.
[8] A. Jain and D. Zongker, “Feature selection: Evaluation, application and
small sample performance,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 19, no. 2, pp. 153–158, Feb. 1997.
[9] K. Z. Mao, “Fast orthogonal forward selection algorithm for fea-
ture subset selection,” IEEE Trans. Neural Netw., vol. 13, no. 5, pp.
1218–1224, Sep. 2002.
[10] , “Orthogonal forward selection and backward elimination algo-
TABLE VII rithms for feature subset selection,” IEEE Trans. Syst., Man, Cybern. B,
COMPARISON OF COMPUTATION TIME (IN SECONDS) OF DIFFERENT METHODS Cybern., vol. 34, no. 1, pp. 629–634, Feb. 2004.
[11] M. Devaney and A. Ram, “Efficient feature selection in conceptual clus-
tering,” in Proc. 14th Int. Conf. Machine Learning, 1997, pp. 92–97.
[12] L. Talavera, “Feature selection as a preprocessing step for hierarchical
clustering,” in Proc. 16th Int. Conf. Machine Learning, 1999, pp.
389–397.
[13] M. Dash and H. Liu, “Dimensionality reduction for unsupervised data,”
in Proc. 9th IEEE Int. Conf. Tools with AI, 1997, pp. 522–539.
[14] J. Dy and C. Brodley, “Feature subset selection and order identification
for unsupervised learning,” in Proc. 17th Int. Conf. Machine Learning,
2000, pp. 247–254.
Obviously, the results of the LSE-based algorithms are com- [15] L. Talavera, “Dependency-based feature selection for clustering sym-
bolic data,” Intell. Data Anal., vol. 4, no. 1, 2000.
parable with those of the PA-based counterpart. However, our [16] J. M. Pena, J. A. Lozano, P. Larranaga, and I. Inza, “Dimensionality
LSE-based methods are computational much more efficient than reduction in unsupervised learning of conditional Gaussian networks,”
the PA-based methods as shown in Table VII, where the compu- IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 590–603, Jun.
2001.
tation time illustrated is the time needed to perform one of the [17] I. T. Jolliffe, “Discarding variables in a principal component analysis I:
iterations in the k-fold validation (k is12 in Diabetes problem, Artificial data,” Appl. Statist., vol. 21, no. 2, pp. 160–173, 1972.
and k is 10 in the remaining two problems). The implementation [18] , “Discarding variables in a principal component analysis II: Real
is based on a 800-MHz PC with 256-M memory using Matlab. data,” Appl. Statist., vol. 22, no. 1, pp. 21–31, 1973.
[19] G. P. McCabe, “Principal variables,” Technometrics, vol. 26, no. 2, 1984.
The best results achieved by the LSE-based forward selec- [20] W. J. Krzanowski, “Selection of variables to preserve multivariate data
tion algorithm are respectively 79.43%, 97.86%, and 98.07% for structure using principal components,” Appl. Statist., vol. 36, no. 1, pp.
the three datasets, while the best results achieved by the LSE- 22–33, 1987.
[21] I. T. Jolliffe, Principal Component Analysis. New York: Springer-
based backward elimination algorithm are 78.39%, 98.43%, and Verlag, 1986.
97.89% respectively. Reference [24] lists results of many clas- [22] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.
sifiers, where the best results published for the first two datasets P. Mesirov, H. Coller, M. Loh, J. R. Downing, M. A. Caligiuri, C. D.
are 77.7% and 97.5%, respectively. Obviously, our results are Bloomfield, and E. S. Lander, “Molecular classification of cancer: Class
discovery and class prediction by gene expression monitoring,” Science,
better. vol. 286, pp. 531–537, 1999.
[23] C. L. Blake and C. J. Merz. (1998) UCI Repository of Machine
Learning Databases. Univ. California, Dept. Inform. Comput. Sci.,
IV. CONCLUDING REMARKS Irvine, CA. [Online]. Available: http://www.ics.uci.edu/ mlearn/Ma-
chine-Learning.html
In the present study, we have presented two unsupervised fea- [24] W. Duch. Datasets Used for Classification: Comparison of
ture-selection algorithms based on identification of critical vari- Results. Faculty of Physics, Astronomy and Informatics, Nico-
ables of principal components. The algorithms link the phys- laus Copernicus Univ., Torun, Poland. [Online]. Available:
ically meaningless PCs back to a subset of the original mea- http://www.phys.uni.torun.pl/kmk/projects/datasets.html
surements retaining their original meanings. Computation com-
plexity analysis and experiment studies have shown that the two
K. Z. Mao was born in Shandong, China, on March
algorithms are computational much more efficient than the Pro- 11th, 1967. He received the Ph.D. degree from the
crustes analysis-based feature selection, while the feature sub- University of Sheffield, Sheffield, UK, in 1998.
sets provided lead to similar classification results to those of the He was a Research Associate at the University of
Sheffield for six months, and then joined Nanyang
Procrustes analysis-based algorithm. Technological University (NTU), Singapore, as a Re-
search Fellow. Since June 2001, he has been an Assis-
REFERENCES tant Professor at School of Electrical and Electronic
Engineering, NTU. His current research interests in-
[1] P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Ap- clude machine learning, data mining, biomedical en-
proach. London, U.K.: Prentice-Hall, 1982. gineering, and bioinformatics.

You might also like