Professional Documents
Culture Documents
Identifying Critical Variables of Principal Components For Unsupervised Feature Selection
Identifying Critical Variables of Principal Components For Unsupervised Feature Selection
Abstract—Principal components analysis (PCA) is probably the vided by PCA are a set of new variables carrying no clear phys-
best-known approach to unsupervised dimensionality reduction. ical meanings [2].
However, axes of the lower-dimensional space, i.e., principal com- Principal components (PCs) are linear combinations of all
ponents (PCs), are a set of new variables carrying no clear physical
variables (or called features) available. However, these variables
meanings. Thus, interpretation of results obtained in the lower-di-
mensional PCA space and data acquisition for test samples still in- are not necessarily equally important to the formation of PCs:
volve all of the original measurements. To deal with this problem, some of the variables might be critical, but some might be re-
we develop two algorithms to link the physically meaningless PCs dundant, irrelevant or insignificant. Motivated by this fact, it
back to a subset of original measurements. The main idea of the was attempted to link the physically meaningless PCs back to
algorithms is to evaluate and select feature subsets based on their a subset of the original variables through selecting critical vari-
capacities to reproduce sample projections on principal axes. The ables or eliminating redundant, irrelevant or insignificant vari-
strength of the new algorithms is that the computaion complexity
involved is significantly reduced, compared with the data struc- ables. B2 and B4 algorithms are probably the best known ap-
tural similarity-based feature evaluation [20]. proaches of this kind [17], [18]. The B2 algorithm discards vari-
ables that are highly associated with the last few PCs, while the
Index Terms—Backward elimination, forward selection, prin-
B4 algorithm selects variables that are highly associated with the
cipal components analysis (PCA), unsupervised feature selection.
first few PCs. Since the significance of variables are evaluated
individually, redundant features might not be removed by the
I. INTRODUCTION B2 and B4 algorithms. Other PCA-based feature-selection al-
gorithms include principal variable method [19] and data struc-
D IMENSIONALITY reduction is a long-standing issue of
interest in high-dimensional data analysis. Feature selec-
tion and feature extraction are two commonly used approaches
ture preserved (DSP) algorithm [20]. The main character of the
DSP algorithm is to evaluate a feature subset based on its data
to this problem. In the literature, the terms feature selection structural similarity to the full feature set. Comparison of data
and feature extraction are sometimes used interchangably, how- structure involves deviation of principal components underlying
ever, they have subtle distinctions. Feature selection refers to the feature subset and matrix matching under translation, rota-
selecting features in the measurement space, while feature ex- tion and dilation. The computation complexity of evaluating a
traction technique selects features in a transformed space (see single-feature subset consisting of features is , where
is the number of training samples. This could be quite high
for example [1]–[3]). Features provided by the feature-selec-
if is a large number. In the present study, we propose to eval-
tion technique is a subset of the original measurements, while
uate a feature subset based on its capacity to reproduce sample
features obtained through feature extraction are a set of new
projections on principal axes. The modification here is subtle,
variables carrying no clear physical meanings. In applications
but it simplifies the evaluation of a feature subset to an ordinary
where interpretable features are desired, the feature-selection
technique should be the choice. least-square estimation (LSE) problem, and hence reduces the
Quite often feature selection is studied in the paradigm of su- computation complexity to . Experiment studies show
that the LSE-based feature evaluation improves computational
pervised learning (see for example [1]–[10]). Feature selection
efficiency significantly while the feature subset provided leads
in the paradiam of unsupervised learning is relatively rare. In the
to similar classification results to that of the data structural sim-
literature, unsupervised feature selection is often customized to
ilarity-based feature evaluation.
a particular clustering algorithm, where intra-cluster scatter and
The present study is organized as follows. In Section II, the
inter-cluster distance are used as evaluation criteria [11], [12].
basic idea of identifying critical variables to principal compo-
Other criteria that have been used in unsupervised feature se-
nents is introduced, and a forward feature-selection algorithm
lection include entropy measure [13], expectation measure [14],
and a backward feature-elimination algorithm are developed,
dependency measure [15] and relevance measure [16], etc. Prin-
and computation complexity is analyzed. Experiment studies
cipal components analysis (PCA) is also frequently used in un-
are presented in Section III, and concluding remarks are given
supervised dimensionality reduction, but PCA is generally con-
in Section IV.
sidered as a feature extraction technique because features pro-
II. SELECTING FEATURE SUBSET THROUGH IDENTIFYING
CRITICAL VARIABLES OF PRINCIPAL COMPONENTS
Manuscript received July 7, 2003; revised January 31, 2004 and August 16,
2004. This paper was recommended by Associate Editor V. Govindaraju. A. The Basic Idea
The author is with the School of Electrical and Electronic Engineering,
Nanyang Technological University, Singapore (e-mail: ekzmao@ntu.edu.sg). Principal components analysis can be considered as a linear
Digital Object Identifier 10.1109/TSMCB.2004.843269 transform that maps data from the original measurement space
1083-4419/$20.00 © 2005 IEEE
340 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 35, NO. 2, APRIL 2005
to a new space spanned by a set of new variables. Assume the If multiple PCs are taken into account, the feature evaluation
linear transform is denoted by matrix , pattern in the new and selection procedure should be based on minimization of the
space is represented by following combined cost function:
(1)
(5)
where , , and
Since a backward elimination search procedure was em- % The LSE 0 based forward selection algorithm
ployed in the DSP algorithm [20], the number of feature subsets (1) Initialize S to an empty set ;
to be evaluated should be the same as the LSE-based backward (2) Initialize R to the full feature set ;
elimination algorithm. The computation complexity difference (3) Perform principal components analysis on the
between the LSE- and DSP-based algorithms is attributed complete data ;
to the difference in evaluating a feature subset. Assume the (4) Calculate sample projections on the first d principal
feature subset to be evaluated consists of features, and the axes ;
number of training samples is . To find , a singular value (5) For l = : %
1 m m iterations for m features ;
decomposition (SVD) of the -dimensional covariance =
h number of features in set R ;
matrix is usually performed. The computation complexity of For j 1 h = : % evaluate each variable in R ;
this SVD decomposition is . Reference [20] employed Take variable j from set R and add it temporarily to set S ;
a more efficient implementation, performing SVD on the For k 1 d= : % consider d PCs ;
-dimensional data matrix, instead of on the covariance Fit a linear model using all the l variables in Set S ;
matrix. With this efficient implementation, the computation Calculate fitting error on PC k ;
complexity of deriving is . Other com- End
putations in the feature evaluation include computing rotation Calculate the total fitting error ;
matrix and dilation factor and matching the two matrices. In End
most pattern-classification applications, the number of samples Add the variable that leads to minimum total error to set S ;
is much higher than the number of features, i.e.,, and Eliminate the variable selected from R:
, thus the total complexity involved in the data structure End
similarity-based evaluation of a feature subset consisting of
features is given by C. The LSE-Based Backward Feature-Elimination Algorithm
(8) In applications where most of the variables are to be kept, an
In contrast, the major computation of feature evaluation in our elimination algorithm is prefered. The elimination algorithm is
algorithms is the estimation of parameters based on training the combination of the evaluation criterion (4) and (5) with a
samples using a linear least square estimation algorithm. Op- backward elimination search algorithm. The pseudocode of the
erations involved are multiplication of -dimensional and LSE-based backward elimination algorithm is as follows:
-dimensional matrices, inversion of a -dimensional
matrix, multiplication of -dimensional and -dimen- % The LSE 0 based backward elimination algorithm
sional matrices, and multiplication of -dimensional and (1) Initialize R to the full feature set ;
-dimensional matrices. The computation complexity of (2) Perform PCA analysis on the complete data determine d ;
the above operations is , , and principal axes to be used ;
respectively. Considering , the computation complexity (3) Calculate sample projections on the first d principal
of our algorithm in evaluating a feature subset consisting of axes ;
features is given by = :
(4) For i 1 n 0 m % (
delete n 0 m features ) ;
=
l number of features in set R ;
(9) For j 1 l= : % evaluate each variable in R ;
Obviously, should be much higher than . In other words, Remove variable j temporarily from R ;
the linear LSE-based feature evaluation is much more efficient For k 1 d= : % consider d PCs ;
than the data structural similarity and Procrustes analysis-based Fit a linear model using all the variables in Set R ;
feature evaluation. Calculate fitting error to on PC k ;
Besides sequential forward selection and backward elimina- End
tion, variants of the two search algorithms such as the floating Calculate the total fitting error ;
search [4], the orthogonal forward-selection, and orthogonal End
backward elimination [9], [10] can also be used. Only the Remove the variable leading to minimum error inflation from R ;
forward feature-selection and the backward feature-elimination End
algorithms are presented as examples next.
Model fitting is part of the forward-selection and the back-
B. The LSE-Based Forward Feature-Selection Algorithm ward-elimination algorithms. The model fitting is based on the
Finding critical variables to PCs consists of two steps. In the linear LSE algorithm, which demands more data points than the
first step of the procedure, the PCs are derived from the full fea- variables in the model. In applications where features are more
ture set using a principal components analysis procedure, and than training samples, the backward-elimination algorithm is
sample projections on the PCs corresponding to the largest not applicable. The forward-selection algorithm still can be ap-
eigenvalues are computed. In the second step, a sequential for- plied to such applications, but the maximum number of features
ward-selection algorithm is performed, with the objective of se- that can be selected is equal to , where is the number
lecting a subset of features that has the most capacity to repro- of training samples.
duce the data projections on the PCs. The pseudocode of se- An important issue in feature selection is the determination
lecting feature subset through identifying critical variables of of the numer of features to be selected. If the two algorithms are
PCs is summarized as follows: applied to select features for supervised learning, the number of
342 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 35, NO. 2, APRIL 2005
III. EXPERIMENTS
A. Experiment 1: An Unsupervised Learning Problem
In the first experiment, a synthetic problem was used to test
our algorithms. The synthetic data, described by two features
and , consists of three Gaussian clusters as shown in Fig. 1.
The covariance matrices of the three clusters are all identity ma-
trices, and the mean vectors of the three clusters are , Fig. 3. Classification results versus number of feature selected for
and respectively. In the experiment, four ad- experiment 1.
ditional features were generated by adding Gaussian random
noise with zero mean and variance at levels 0.2 and 0.4 to backward elimination algorithm. In both algorithms, irrelevant
and . Obviously among the six features, four are redundant. and redundant features were identified and deleted.
Besides four redundant features, ten Gaussian random variables Since the data is synthetic, the correct number of clusters and
with mean zero and variance one were generated and added to the true cluster assignments are known. The clustering perfor-
the dataset. The objective here is to identify and discard ten irrel- mance was tested. In each of the ten repeats, 540 samples were
evant and four redundant features using the LSE-based forward used to train a two-dimensional self-organizing map (SOM)
selection and backward elimination algorithms. with three nodes, and the 90 test samples were then assigned to
In the experiment, tenfold cross validation was performed. In the nearest node. The error rate versus the number of features
each of the ten repeats, PCA was first performed on 540 training is shown in Fig. 3. The results again demonstrate that only two
samples, and the number of PCs to be used was determined features are necessary and the two features selected are correct.
based on the eigenvalues of the training data matrix. In all the In the above experiment, the number of PCs to be used is
ten repeats, two principal axes were selected. Features were then determined as the number of eigenvalues that are larger than the
selected based on their abilities to reproduce data projections on average of the 16 eigenvalues. In the experiment, the influence
the two principal axes. The error of approximation to the data of the number of PCs on feature selection was investigated.
projections (averaged over ten repeats) versus the number of fea- We found that the curve of approximation error versus number
tures used is plotted in Fig. 2, which clearly shows that two fea- of features might be different when different numbers of PCs
tures are sufficient to well reproduce data projections on the first were used in feature selection, but this did not affect feature
two principal axes. In the forward selection, the two features se- selection because error reduction was trivial after two features
lected were and , while and were removed in the were selected in all cases.
MAO: IDENTIFYING CRITICAL VARIABLES OF PRINCIPAL COMPONENTS FOR UNSUPERVISED FEATURE SELECTION 343
TABLE IV
PA-BASED RESULTS FOR DIABETES DATASET
TABLE II
LSE-BASED RESULTS FOR WBC DATASET
TABLE V
PA-BASED RESULTS FOR WBC DATASET
B. Experiment 2: Supervised Learning Problems
In Experiment 1, the unsupervised feature selection method
is tested in the paradigm of unsupervised learning. As a matter
of fact, this unsupervised feature-selection algorithm is appli-
cable to supervised learning problems as well. In Experiment 2,
we used three supervised learning problems to test the two algo-
rithms. All the three real-world problems are from UCI Machine
Learning Repository [23].
The first example is the Pima Indian diabetes problem. The
dataset contains 768 samples from two classes, where 500
samples are from class 1 and the remaining 268 samples are
from class 2. Each sample is represented by eight features. The
problem posed is to predict whether a patient would test positive
for diabetes according to World Health Organization criteria.
To estimate the classification accuracy, 12-fold cross-validation
was used in the experiment. The reason why 12-fold rather than Malignant). Again, tenfold cross validation was used to estimate
tenfold was employed is that 768 is divisible by 12, and 12 is the classification accuracy.
close to the commonly used 10. In the experiment, the k-nearest neighbor (k-NN) classifier
The second example is the Wisconsin breast cancer (WBC) was used to classify the samples, and both the LSE-based for-
problem. The dataset contains 699 samples, where 458 are be- ward selection and the LSE-based backward elimination algo-
nign samples and 241 are malignant samples. Each sample is de- rithms were used to select feature subsets. The results of fea-
scribed with nine features. The task here is to predict diagnosis ture selection and classification for the three datasets are sum-
results (Benign or Malignant). In this example, tenfold cross val- marized in Tables I–III, respectively, where the accuracy and
idation was used to estimate the classification accuracy. the number of features are presented in the format of mean
The third example is the Wisconsin diagnostic breast cancer value standard deviation. For comparison, Procrustes analysis
(WDBC) problem. The WDBC dataset consists of 357 benign (PA)-based forward selection and backward elimination were
samples and 212 malignant samples, with 30 real-valued fea- also applied to the three datasets, and the results obtained are
tures. The task here is to predict diagnosis results (Benign or summarized in Tables IV–VI respectively.
344 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 35, NO. 2, APRIL 2005