Identification of Discriminatory Variabl

Analytica Chimica Acta 767 (2013) 35–43
Contents lists available at SciVerse ScienceDirect
Analytica Chimica Acta

journal homepage: www.elsevier.com/locate/aca
Identification of discriminatory variables in proteomics data analysis

by clustering of variables夽
Sadegh Karimi a,b , Bahram Hemmateenejad a,b,∗
a
Medicinal and Natural Products Chemistry Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
b
Chemistry Department, Shiraz University, Shiraz, Iran
h i g h l i g h t s g r a p h i c a l a b s t r a c t
◮ A new method was suggested for A new method based on the clustering of variables (CLoVA) was proposed for identification of discrimi-
identification of discriminatory vari- natory variables in proteomic data analysis.
ables.
◮ The method works based on the clus-
tering of variables (CLoVA).
◮ CLoVA was used as an efficient
method in proteomics data analysis.
◮ The method was applied successfully
in cancer detection.
a r t i c l e i n f o a b s t r a c t
Article history: This article presents a data analysis method for biomarker discovery in proteomics data analysis. In fac-
Received 5 August 2012 tor analysis-based discriminate models, the latent variables (LV’s) are calculated from the response data
Received in revised form measured at all employed instrument channels. Since some channels are irrelevant and their responses
24 December 2012
do not possess useful information, the extracted LV’s possess mixed information from both useful and
Accepted 28 December 2012
irrelevant channels. In this work, clustering of variables (CLoVA) based on unsupervised pattern recog-
Available online 8 January 2013
nition is suggested as an efficient method to identify the most informative spectral region and then it is
used to construct a more predictive multivariate classification model. In the suggested method, the instru-
Keywords:
Classification
ment channels (m/z value) are clustered into different clusters via self-organization map. Subsequently,
Proteomics the spectral data of each cluster are separately used as the input variables of classification methods such
Clustering of variables as partial least square-discriminate analysis (PLS-DA) and extended canonical variate analysis (ECVA).
Cancer The proposed method is evaluated by the analysis of two experimental data sets (ovarian and prostate
Discriminant analysis cancer data set). It is found that our proposed method is able to detect cancerous from healthy samples
Self-organization map with much higher sensitivity and selectivity than conventional PLS-DA and ECVA methods.
© 2013 Elsevier B.V. All rights reserved.
1. Introduction
Ovarian cancer is the most common cancer and it causes more

deaths than any other cancer of the female reproductive system [1].
This occurs most often in women aged 50–79; over 70% of the can-
cers occur after age 50. With early detection and quick treatment,
夽 Paper presented at the XIII Conference on Chemometrics in Analytical Chemistry
the overall survival rate increases to 95%. However, early stage diag-
(CAC 2012), Budapest, Hungary, 25–29 June 2012.
∗ Corresponding author at: Chemistry department, Shiraz University, Shiraz, Iran. nosis is unusual. At the present time, only 25% of all ovarian cancers
Tel.: +98 711 613 7360; fax: +98 711 228 6008. are found at an early stage [2], this fact confirms the importance of
E-mail address: hemmatb@sums.ac.ir (B. Hemmateenejad). methods which can improve the early detection of ovarian cancer.
0003-2670/$ – see front matter © 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.aca.2012.12.050
36 S. Karimi, B. Hemmateenejad / Analytica Chimica Acta 767 (2013) 35–43
Prostate cancer is one of the most widespread malignancies in were first clustered into different parts and then PCA was applied
men and the second leading cause of death from cancer. Currently, on each part, separately. So, the main object was improvement the
prostate-specific antigen (PSA) is the best tumor marker available modeling ability of principal component regression. In the present
for the early detection of prostate cancer. But, PSA lacks specificity research, we investigated the applicability of this method for vari-
and is also detectable in benign cases as well as it can be elevated able selection in proteomics data analysis based on classification
in who with cancer [3]. methods. Similar to SPCAR, the basic principle of the method is
Since the development of large mass spectrum profiling; includ- clustering of variables, before doing regression or classification
ing huge large number of biomarkers, modern statistical research analyses. So, we would like to call it “clustering of variable” with
has been focused on selecting the important biomarkers related to CLoVA abbreviation. It should be noted that CLoVA is a new exten-
diagnosable symptoms such as prostate and ovarian cancers [4]. sion of SPCAR for using in variable selection. In the present study,
Analyzing of such large data sets has various difficulties, one of CLoVA has been successfully applied for identification of the most
the most general problem is about the high-throughput in mass significant variables (m/z values) in discrimination and then for
spectrometry methods like SELDI-TOF (surface enhanced laser des- obtaining more discriminating models. As classification methods,
orption/ionization time-of-flight) and MALDI-TOF (matrix-assisted partial least square-discriminate analysis (PLS-DA) and extended
laser desorption and ionization time-of-flight) [5]. Each generated canonical variates-discriminant analysis (ECVA-DA) [24] have been
spectrum by these instruments contains tens of thousands inten- employed. Successful application of the suggested method for can-
sity measurements representing an unknown number of protein cer detection based on proteomics data confirms the superiority of
peaks. The problem of both methods for such studies is that the our method with respect to the existing ones.
ratio of spectral variable number to available samples may be larger
than 1000 [6], which increases the chance correlation. In addition, 2. Theory
the unknown complexity of the relationship(s) between the mea-
sured mass spectrum profiles and the observed disease states is 2.1. Notations
another problem which should be concerned. On the other hand,
these techniques efficiently investigate a biological sample to verify The standard chemometrics notations will be used. Capital
the relative abundance of many protein/peptide sequences simul- and lowercase letters in boldface demonstrate matrix and vec-
taneously [7]. Moreover, at the present, these methods are the most tor, respectively. Lowercase italic letters denote the scalars. The
popular techniques applied for detecting quantitative or qualitative response data matrix (X) is denoted by an (m × n), where m and
changes of proteins in body fluids including serum, plasma, urine n are the number of objects (samples) and variables (m/z), respec-
etc. [8]. tively, each row is a mass spectrum of a studied sample. The column
For finding the potential biomarkers and revealing differences vector of class information (predictor variable) is denoted by y. The
in complex mass spectral patterns of proteins with rather low response data matrix is divided into different sub-matrices (clus-
signal-to-noise ratio, computational analysis is necessary [9]. In ters), each sub-matrix is denoted by Xi and the symbol ‘|’ is used
this context, pattern recognition analysis has been found to play to show clustering of X so that X = [X1 |X2 |. . .|Xq], where q is the
a considerable role in proteomics for biomarker discovery. Recent number of clusters.
publications on cancer classification using MALDI or SELDI-TOF
data sets have been focused on identifying biomarkers in serum to 2.2. CLoVA algorithm
distinguish between cancer, benign and normal samples [10–18].
Until now, different feature selection and learning algorithms have From the classification point of view, the X-variables can be
been reported in cancer studies (Prostate [11,18] and Ovarian partitioned into two parts (i) those having information about the
[10,12,15] cancers). Genetic algorithm (GA)[10], AUC (area under class variable (cancerous or healthy samples) and (ii) those are
receiver operating characteristic curve) [11,18], ANOVA [12] and irrelevant and their information is not related to the class infor-
statistic tests [14] have been widely used in feature selection. mation. Various efforts have been made to propose an algorithm
Statistical methods such as t-test simply give an estimate of the for selecting the useful part of data which is suitable for classifica-
significant of variables in a regression models. However, in prac- tion purposes. But, it is difficult to separate these parts from each
tical applications, they are associated with some limitations, e.g., other completely. On the other hand, the informative variables,
the significance of a variable in a model, as measured by t-test, which possess high correlation with y-variable, are also correlated
would be dependent on the absence or presence of the other vari- with each other and can be considered as collinear variables. The
ables in the model [19]. In this context, new and efficient methods irrelevant variables that are not correlated with the class informa-
for assessing the significant of variables are demanding [20]. As tion can form another sets of collinear variables. Nevertheless, if the
learning algorithms, different clustering and classification methods PCA is applied on the original data matrix including all variables,
such as self-organizing map [10,11] decision tree [11,18], discrim- the extracted PCs, which are used as input of discriminant model,
inant analysis [15], and support vector machine [12] are often will possess mixed information from both irrelevant and relevant
employed. In each study, the best m/z values, denoted as discrim- variables. The CLoVA attempts to find clusters of variables based on
inatory patterns or proteomics biomarkers, have been identified interrelation between them and to use variables of each cluster as
to categorize cancer samples with very high sensitivity and speci- input of separate classification models.
ficity. However, finding a few numbers of variables from the pool The CLoVA method is composed of two main steps:
of thousands of instrumental channels is a difficult task. Further- (1) Clustering of variables (here m/z values) into q clusters (sub-
more, from these studies, has been found that some identified matrices) using unsupervised clustering method such as PCA, self-
discriminatory patterns with high sensitivity and specificity are not organizing map and so on:
biologically significant [10]. In addition; the lack of reproducibility
X = [X1 |X2 | . . . |Xq ] (3)
of discriminatory patterns by different analysis methods is also a
problem [12]. Each sub-matrix Xi is composed of a subset of variables clustered
Very recently, we have suggested a segmented principal com- in one cluster. The number of clusters should be optimized through
ponent analysis and regression (named SPCAR) approach in model development.
multivariate calibration and QSAR studies [21–23]. In those stud- (2) Using the variables of each cluster to build a separate classi-
ies, instead of application of PCA on whole data set, the variables fication model by e.g., PLS-DA or ECVA.
S. Karimi, B. Hemmateenejad / Analytica Chimica Acta 767 (2013) 35–43 37
Thus, in the CLoVA, a clustering method is followed by a regres- work, the F-test criterion has been chosen to select the marker
sion or a classification method. Here, self-organizing map has candidate.
been used as clustering method, PLS-DA and ECVA-DA have been
employed as classification methods. For more details about this
3. Experimental
algorithm, the interested readers can refer to our previous publica-
tions [21–23]. All the chemometrics methods which have been used
3.1. Data sets
in this work are well known, however in order not to prolong the
manuscript the descriptions of these algorithms have been done in
To illustrate the performance of the proposed algorithm, two
supporting information.
cancer data sets (ovarian and prostate) have been analyzed.
Self-organizing map (SOM) [25] is an interactive and iterative
Both data sets were downloaded from the Clinical Proteomics
technique used to map the multivariate data. The maps, which are
Program Data bank (http://home.ccr.cancer.gov/ncifdaproteomics/
consisting of layer of neuron arranged in two-dimensional grid,
ppatterns.asp). Using the data set on this website, for the ovar-
should be able to reflect the relationship that exists between sam-
ian data set we have analyzed the third ovarian dataset (8-7-02).
ple points in the original space. The interesting point of view is
It is composed of the low resolution MALDI-TOF mass spectra
that, during the training step, the network adapt itself in a way
proteomic profiles of blood serum for 162 biopsy-proven ovarian
that similar input (here variables) are associated with the topolog-
cancer patients and 91 unaffected women (controls). In order to
ical close neuron in the network. This method can be considered as
have more balance data set, only first 100 numbered patients (of
unsupervised clustering method. Self-organizing map projects the
the cancer group) have been selected. Each proteomic profile, or
input data into a bi-dimensional regular array (grids) of nodes size
spectrum, consists of 15,154 distinct m/z values ranging from 0 to
(p × p), where p is defined as network size. For each neuron in the
20,000.
grid, we have a weight vector which is in the same dimensions as
The prostate data set is composed of mass spectral profiles
the pattern of input data set. The input vector is passed to each neu-
consisting of 15154 SELDI-TOF m/z ratios from 321 samples (69
ron (in the created network) and the neuron whose weight is the
patients diagnosed with malignant prostate cancer, 189 patients
most similar to pattern vector of input is selected as winner. So for
with benign prostate hyperplasia and 63 controls) [26].
the winning neuron its weight is adjusted in such way to be much
more similar to the pattern of input vector. Those neurons which
are neighbors to winner neuron also adjusted its weight as well but 3.2. Computation
in different degree. Finally, when the process is completed, similar
input vectors (variables) are clustered in the space based on similar- Data Analyses have been performed in MATLAB environment
ities. Note that we have applied self-organizing map to cluster the (Math works, Inc., Natick, MA, USA, version 7.2). PLS-DA cali-
variables not the objects. Hence, the variables in the original data bration is based on the PLS Toolbox version 4 from Eigenvector
matrix, including similar information (in our case similar spectral Research. ECVA has been performed with the MATLAB models
information regarding the sample classes), are projected into one available at http://www.models.life.ku.dk. The self-organizing map
node or neighboring nodes. During the computation of the map, the Toolbox, provided by Todeschini and Ballabio was downloaded
variables are projected onto the self-organizing map’s array and the from the website of Milano Chemometrics and QSAR research group
model vectors adapted according to the learning rule. The output of (http://michem.disat.unimib.it/chm/download/kohoneninfo.htm).
self-organizing map is a pattern distribution of variables in a map
of (p × p) nodes. Among the original variables in X matrix, the vari-
3.3. Training and test set selection
ables in each node can be collected to form a sub-matrix Xi . Thus,
the number of clusters is equal to power of 2 of node number in
These data sets have been divided into training and external
self-organizing map.
test sets by sample set partitioning based on joint x–y distances
(SPXY) algorithm, so that about 30% of each group has been selected
2.3. Selectivity ratio (SR) as test set and the remaining have been used as training to build
the classification model. This algorithm, which considers both dis-
Among the different criteria suggested for identification of the tances in the dependent and independent variable spaces, has
most important discriminatory variables, selectivity ratio (SR) plot been explained in details by Galvao et al. [27]. This procedure
has been reported recently by Rajalahti et al. as a promising method ensures that representative samples, internal to the data domain,
[6,20]. SR is closely connected to the ratio of inter- to intra-group are selected as test objects.
variation and it is a measure of variable’s performance to separate
groups. SR can reveal regions in spectral profiles with both high
explanatory and high predictive significance for the investigated 3.4. Preprocessing
response. Explained vexpl,j and residual (unexplained) vres,j variance
for each variable j in the target projection model can be calculated PCA, PLS-DA and ECVA are scale-dependent methods. In addi-
from Eq. (1). tion, variable clustering by self-organizing map would be also
affected by scaling method since self-organizing map does not use
X = X̂TP + ETP = tTP pTTP + ETP (1) simple correlation to measure the similarity between variables. The
disadvantage of auto-scaling in optical spectroscopy data is about
The ratio between explained and residual (Eq. (2)) variance the exaggeration of the import of noisy wavelength regions, which
defines a selectivity ratio SR for each variable: may mask the interested effects. Pareto scaling [28], has become
vexp l,j prevalent for NMR and MS data sets. It can be regarded as method
SRi = , j = 1, 2, 3, . . . (2) between the extremes of no scaling and unit variance scaling and
vres,j
involves dividing each spectral variable by the square root of its
A high SR value means that the variable in question has a strong standard deviation after first centering. In our study Pareto scaling
(predictive) correlation to the response, that is, the variable is has been applied to these data sets prior to the multivariate data
highly selective. The limit between spectral regions with marker analyses. We have also applied the orthogonal signal correction in
candidates and less important regions is chosen by user. In this each cluster, before application of discriminate analysis methods.
4. Results and discussion of ovarian samples (whole region) are given in Table (S1) for the
first 10 principal components (PCs). The two-dimensional scores
The MALDI–TOF spectra of the ovarian data set and SELDI-TOF plot of the first and second PCs, which explain respectively 30.39%
spectra of prostate samples are represented in Fig. S1 and S2 of and 28.56% of total spectral variations is shown in Fig. S4a. This
supporting information, respectively. One can observe that all sam- figure indicates the relative position of samples with each other
ples of each data set exhibit similar spectra, so that, discrimination based on the similarity between their MALDI-TOF spectra. Due to
between healthy and cancerous samples by visual inspection of the high similarity between the MALDI-TOF spectra of the ovarian sam-
spectra is impossible. In this section, the results of the analysis for ples, there is no evidence of separation between two classes along
these data sets by CLoVA method followed by discriminant analysis the first two principal axes and the groups are severely overlap-
methods (PLS-DA and ECVA) are explained separately and then they ping. In other words, much of the spectral variation is unrelated to
are compared with the results of conventional discriminant analy- class differences and the model is not practical. Extraction of further
sis methods. In the proceeding sections, we will show how CLoVA principal component (PCs) did not provide any separation either.
helps us to discriminate the healthy and cancerous samples and to However, by CLoVA variable analysis before PCA, better class
determine the most significant variables for biomarker discovery. separation has been observed in two-dimensional spaces of fac-
As clustering method in CLoVA, we used self-organizing map in tor scores. In Fig. S4b one representative plot is shown. The
this study since in previous study [21] we found that self-organizing scores of this plot have been calculated from cluster S1,2 of the
map over-performed with respect to PCA. The latter is considered self-organizing map model of p = 2. Accordingly, a much better
as linear clustering method whereas self-organizing map is a non- separation of classes is observed. This suggests that the vari-
linear one. Thus, variables having nonlinear relationships which ables clustered in cluster S1,2 possess better spectral information
others (such as quadratic, logarithmic, etc. relationships) are con- regarding the ovarian cancer biomarkers.
sidered as neighboring variables in the self-organizing map pattern.
In addition, according to a parallel study in our group on the com- 4.1.3. Classification by PLS-DA
parison of the performance of 5 different clustering methods in Firstly, PLS-DA has been run on all m/z variables. To reduce the
CLoVA (the results of which will be published elsewhere) it was number of components resulting from the large variations in X
found that self-organizing map usually resulted in better modeling which are not related to the Y space, the orthogonal signal correc-
power. So, in this study we preferred to use self-organizing map as tion (OSC) algorithm [30,31] suggested by Wold et al. has been used.
inner clustering method in CLoVA. The number of significant components has been determined using
leave-many-out cross-validation (LMO-CV) with 1/8th of the data
4.1. Data set 1 (ovarian cancer dataset) being excluded during each round. The changes in the classification
error as a function of PLS latent variables are shown in supporting
4.1.1. CLoVA analysis of ovarian data set information (Fig. S5). This yielded a five PLS-components model as
The first step in CLoVA analysis is to find clusters of variables. the optimum one. Then, it has been used to predict the class variable
Indeed the self-organizing map is similarly unsupervised as PCA, of the prediction samples. Fig. 1a shows the two-dimensional score
but it is a non-linear method on the contrary to PCA which both plot of PLS-DA on whole data set. It is observed that PLS-DA presents
of them can be employed in CLoVA [21–23]. Moreover the results a partial discrimination between normal and cancerous samples
of self-organizing map depend on initial given weights. Previously, with some degrees of overlapping. The classification parameter
we have shown that self-organizing map as a nonlinear cluster- results for calibration and prediction samples are shown in the first
ing method is more efficient than PCA. So that, in present study, row of Table 1. From the results in Table 1 it is evident that conven-
we used self-organizing map to find clusters of variables. Self- tional PLS-DA failed to completely differentiate between healthy
organizing map produces a distribution map of variables in an and cancerous persons. Similar to PCA, it can be concluded that all
(p × p) array of nodes. Subsequently, the total number of clusters parts of the used m/z channels do not possess useful information
is p2 . The number of clusters in CLoVA should be optimized. The- about the class of ovarian samples. On the other hand, many parts
oretically, the number of clusters can be changed between 1 and of the MALDI-TOF data are highly collinear and containing simi-
nV , where nV is the number of variables. If number of cluster sets lar spectral information. Therefore, CLoVA analysis of variables has
to 1, this means that all variables are used as input of classifi- been performed before PLS-DA running.
cation method, which is the conventional discriminant analysis Eight self-organizing map networks with node size of 2–9 were
method. On the other hand, the number of clusters of nV resem- checked (see Fig. (S3) from supporting information section as an
bles the stepwise variable selection method, where all variables example). The numbers in the clusters which present the m/z values
are examined separately. In real applications, the number of clus- of the MALDI-TOF spectra are shown in Fig. S1. The best clas-
ters could be optimized by sequential increasing in p followed by sification results (lower number of misclassified samples) have
discriminant analysis to find a model with the desired level of been obtained when the node number is 2 and hence the results
acceptability (the least prediction error). Here, we have examined of this network size will be presented and explained. The shown
the number of nodes from 2 to 9. The distribution of variables in the self-organizing map pattern (Fig. S3) indicates that the pattern dis-
self-organizing map of network size (n = 2) is presented for ovarian tribution of these m/z values is not homogeneous. Obviously, some
data set as an example (Fig. S3) in the supporting information. The clusters (S2,1 , S2,2 of 2 × 2 network) contain a high population of m/z
clusters in self-organizing map are arranged in a bi-dimensional values whereas some others (S1,1 and S1,2 of 2 × 2 network) include
(matrix-like) pattern. Each cluster is denoted as Si,j , where i and j are limited numbers. The interesting aspect of clustering method such
the row and column numbers of the cluster in the self-organizing as self-organizing map is that m/z value with similar information
map, respectively. The variables of each cluster have been used as are collected into some specified cluster. The m/z values of each
input of separate discriminant analysis models (PLS-DA or ECVA- cluster form a sub-matrix (Xi ) of the original variable data matrix
DA). (X). Each sub-matrix has been separately subjected to PLS-DA to
find the most suitable m/z values, which are useful for classifica-
4.1.2. PCA of ovarian data set tion. Some statistical parameters of the PLS-DA models obtained
PCA [29] has been performed to obtain an overview of the data. from variables of each cluster are listed in Table 1. In addition,
The spectra span the mass range from 0 to 20,000 Da (15,154m/z the two-dimensional PLS score plots of different clusters are given
channels). The results of application of PCA on spectral data matrix in Fig. 1b–e. Apparently, result of cluster S2,1 leads to superior
(a) (b)
20
PLS component 2
PLS component 2
50
0 0
-50
-20
-450 -400 -350 -150 -140 -130
PLS component 1 PLS component 1
(c) (d)
40
PLS component 2
PLS component 2
20
20
0 0
-20
-20
-40
-280 -260 -240 -220 -200 -180 -130 -120 -110 -100
(e)
PLS component 2
40
20
-20
-300 -280 -260 -240

PLS component 1
Fig. 1. Distribution pattern of the ovarian samples in the two-dimensional PLS-DA based factor space of their MALDI-TOF spectra (a) whole spectral region and (b–e) CLoVA-
based PLS-DA of different clusters of (2 × 2) self-organizing map model: (b) cluster S1,1 , (c) cluster S2,1 , (d) cluster S1,2 and (e) cluster S2,2 . (, 䊉) cancer, (, ) control. Filled
marker is calibration and open marker is prediction.
discrimination information than the full data. Interestingly, a very In addition to model production of a zero misclassification error,
clear separation is observed between healthy and cancerous sam- similar to other variable selection method, CLoVA is associated
ples from the score plot of S1,2 , which suggests that variables of this with two other interesting features: (1) using much lower num-
cluster own the largest discriminating information. PLS-DA model ber of variables (902m/z variables in cluster S1,2 compared to total
of this cluster has non-error rate of 1 for both calibration and test number of 15,154 variables) it is possible to make faster pattern
samples (see Table 1), which means that this model has perfect recognition model and (2) it identifies m/z channels that are rele-
prediction for all samples. vant for the investigated classes of samples.
Table 1
Number of m/z value (Ni ). Number of latent variable used in each model (LV)a , non-error rate (NER), i.e., percentage of correctly assigned samples achieved for ovarian cancer
data set. Sensitivity (Sn)d and specificity (Sp)e achieved for ovarian cancer data set.
Method N LVa NERcalb NERprec Normal Cancer
Snd Spe Sn Sp
Cal Pred Cal Pred Cal Pred Cal Pred
PLS-DA 15,154 5 0.8557 0.8617 0.9000 0.9512 0.8085 0.8077 0.8085 0.8077 0.9512 0.9000
C CLoVA based-PLSDA(S1,1 ) 233 2 0.8247 0.7979 0.8200 0.9024 0.7898 0.7308 0.7898 0.7308 0.9024 .8200
CLoVA based-PLSDA(S1,2 ) 907 2 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
CLoVA based-PLSDA(S2,1 ) 2943 3 0.9381 0.9149 0.9800 0.9756 0.8936 0.8846 0.8936 0.8846 0.9756 0.9800
CLoVA based-PLSDA(S2,2 ) 11,076 5 0.8765 0.8723 0.8800 0.9024 0.8723 0.8654 0.8723 0.8654 0.9024 0.8800
a
Number of PLS latent variables.
b
NER for calibration set.
c
NER for prediction set.
d
Class sensitivity (Sn) describes the model ability to correctly recognize samples belonging to the gth class, i.e. if all the samples belonging to g are correctly assigned, Sn
is equal to 1.
e
Class specificity (Sp) describes the model ability to reject samples of all the other classes from class g, i.e. if samples not belonging to g are never assigned to g, Sp is equal
to 1.
Table 2
Non-error rate (NER), i.e., percentage of correctly assigned samples achieved with
different classification methods for prostate data set.
Methods LVa NERcalb NERprec
PLS-DA 8 0.7634 0.7320

CLoVA based-PLSDA(S3,2 of p = 3) 4 0.9141 0.8763
CLoVA based-ECVA(S3,3 of p = 5) 26 1.000 0.9287
a
Number of PLS latent variables.
b
NER for calibration set.
c
NER for prediction set.
4.2.1. PCA of prostate data set

The results of PCA application in the prostate samples are given
in Table (S3). The PCA has been applied on whole region of spectral
data matrix and the results are presented for the first 10 principal
components (PCs). The eigenvalue (EV) of each PC, the percent of
variances in the data explained by each PC (PV) and the cumulative
percent of variances (CPV) are reported. Table (S3) reveals that the
first two principal components explain about 60% of the total spec-
tral variations. In other words by projecting 15,154-dimensional
spectra into two-dimensional factor space, about 60% of informa-
tion is retained. In accordance with Fig. S6a, a severe overlapping
between classes is observed. PCA is an unsupervised method and
the PCs in PCA are calculated only from the data matrix (X) and
Fig. 2. SR plot for identification of discriminatory variables of ovarian data set: (a) do not use the class information (Y). Subsequently, they might not
PLS-DA of whole spectral region and (b) CLoVA-based PLS-DA (cluster S1,2 of (2 × 2) necessarily be the components relevant for discrimination. In addi-
self-organizing map model. tion, whole region has been used for extracting the PC’s, and as it is
shown, the information of whole regions is not useful for discrim-
inations. However, better separation of data has been achieved by
Once the classification of cancerous and healthy samples has application of CLoVA concept. As it is shown in Fig. S6b, by applying
been achieved by CLoVA-based PLS-DA, it has been investigated to the PCA on the variables, which is in cluster S3,2 from network size
identify the most significant variables, which is of special impor- 3, one can see some degrees of improvement in the scatter plot of
tance in biomarker discovery. Fig. 2 compares the SR plot of the PCs.
PLS-DA analysis of whole spectral region with spectral variables
included in cluster S1,2 of (2 × 2) self-organizing map network. 4.2.2. Classification by PLS-DA
Noticeably, the number of significant variables identified by SR The PLS-DA model has been developed using selected 224
plot for CLoVA–based PLS-DA (15) (Table (S2)) is much lower than training samples selected by SPXY algorithm. The changes in the
those have been found for conventional PLS-DA (872). The sim- classification error (utilizing 10-segment contiguous block CV) as a
pler pattern of the SR plot obtained by CLoVA-based PLS-DA is function of number of PLS latent variables are depicted in suppor-
more convenient for biomarker discovery. Selection of important ting information (Fig. S7). The lowest minimum misclassification
variables based on SR plot and CLoVA-based PLS-DA approach error of cross-validation has been obtained at 8 numbers of PLS
(Fig. 2b) provides one main m/z region for biomarker candidates latent variables and at this latent variable the errors of calibra-
(25–1000). It should be noted that since we have 97 samples tion and cross-validation are very similar. Thus, an 8-component
in calibration section, the critical F-value with 0.95 probability PLS model; has been used for class prediction of the test set com-
level is 1.41. Thus, with this criterion many false biomarker can- pounds. The non error rate (NER, % of correctly assigned samples)
didates have been removed. Actually, the SR combines predictive for calibration and prediction samples are shown in the first row
power (regression coefficients) with explanatory power (vari- of Table 2, also in Table 3 specificity and sensitivity achieved on
ance/covariance between spectral variables). Noticeably, many of all samples are shown. The distribution of samples in the three-
the largest coefficients associated with m/z values lies below 1000, dimensional PLS score space is shown in Fig. 3a. Whilst the class
which is in good agreement with previous findings [32]. separations are better than that we observed in the score space of
PCs, the classification quality is not high at all. There is a high degree
of overlapping between benign and cancerous samples.
4.2. Data set 2 (prostate cancer data set) In the next step, CLoVA method has been employed to increase
the degree of classification by selecting the most informative m/z
This data set is more complex than the ovarian data set. The variables. PLS-DA analysis of the size 3 network clusters has been
number of samples in this case is much more than the previous produced the best classification results for both calibration and pre-
one. In addition, here we have 3 classes including 63 normal, 69 diction. By analyzing different clusters of 3 × 3 self-organizing map,
cancer and 189 benign samples which cause to more complex- the cluster S3,2 (possessing 1893m/z variables out of 15,154 vari-
ity of prostate dataset in comparison with ovarian ones dataset. ables; about 90% reduction in the number of variables) is resulted in
Thus, finding the subset of m/z values which are related to the class better sensitivity and selectivity. The distribution of sample points
information property is not so simple. in the three-dimensional PLS score space of the S3,2 variables is
In the same manner as in the ovarian data set study, the m/z presented in Fig. 3b. Clearly, there is an improvement in sample
variables have been subjected to CLoVA analyses by self-organizing discrimination with respect to conventional PLS-DA (Fig. 3a). Also,
map. The network sizes of 2–9 have been investigated and the the statistical parameters in the second row of Table 2 confirm the
same convention has been applied to denote the clusters in the superiority of CLoVA-based PLS-DA model in predicting class vari-
self-organizing map. ables of prostate data set. However, the reported data in Table 2
Table 3
Specificity (Sp)a and sensitivity (Sn)b achieved by different classification methods for prostate data set.
Subset Method Normal Benign Cancer
Sn Sp Sn Sp Sn Sp
Cal PLS DA 0.8611 0.9181 0.7391 0.8720 0.7600 0.7931

CLoVA based-PLSDA(S3,2 of p = 3) 1.000 0.9894 0.9348 0.9302 0.8400 0.9483
CLoVA based-ECVA(S3,3 of p = 5) 1.000 1.000 1.000 1.000 1.000 1.000
Pred PLS DA 0.8889 0.9086 0.7255 0.7391 0.5263 0.7846
CLoVA based-PLSDA 0.9630 0.900 0.8627 0.8913 0.7895 0.8620
CLoVA based-ECVA 1.000 1.000 0.9020 0.9565 0.8947 0.9359
a
Class sensitivity (Sn) describes the model ability to correctly recognize samples belonging to the gth class, i.e. if all the samples belonging to g are correctly assigned, Sn
is equal to 1.
b
Class specificity (Sp) describes the model ability to reject samples of all the other classes from class g, i.e. if samples not belonging to g are never assigned to g, Sp is equal
to 1.
explain that in spite of significant improvement in classification resulted in zero misclassification error for calibration and cross-
accuracy achieved by CLoVA, the suggested model is associated validation. The variables of this cluster were also optimum for test
with small misclassification errors (especially for test set samples) samples and the least misclassification error has been obtained for
due to complexity of the investigated data set. these samples.
The distribution of the samples in the two-dimensional space of
4.2.3. Classification by ECVA extended canonical variates space (calculated from the variables of
It has been shown previously that ECVA-DA performs better cluster S3,3 of 5 × 5 self-organizing map) is shown in Fig. 4. A com-
than PLS-DA in different classification problems [33–35]. So, to parison between PLS-DA and CLoVA-based PLS-DA confirms larger
improve the classification result for prostate data set, the com- separations between data points of different groups which have
bination of CLoVA and ECVA has been performed. In CLoVA step, been obtained by CLoVA-based ECVA-DA. The statistical parame-
among the different number of nodes which is examined in self- ters of suggested CLoVA-based ECVA-DA model are given in the last
organizing map network (2–9 nodes), the number of node of 5 led row of Tables 2 and 3. The reported data suggest that this model
to the best classification results (the least classification error). It has the best sensitivity and selectivity with respect to PLS-DA and
should be noted the number of PLS component, which is used as CLoVA-based PLS-DA. However, some misclassification errors are
an inner relation in ECVA, has been obtained by 10-segment con- still observed for the test samples. All normal samples have been
tiguous block CV, similar to that used for PLS-DA (see Fig. S8 of correctly assigned but a few benign and cancerous samples have
supporting information). By ECVA-DA analysis of the variables in been predicted interchangeably.
different clusters of (5 × 5) self-organizing map, the cluster S3,3 Interestingly, a close look at the extended canonical variates
shows how these variables results in such class prediction. The
number of canonical directions is always one less than the num-
ber of groups in the data set. For 3-groups case, this means that the
(a)
200 solution is two-dimensional. As it is shown in Fig. S9, the canon-
ical variates in the first direction clearly show the discrimination
PLS component 3
100 of normal samples from other groups whereas those in the sec-
ond direction shows the discrimination of cancer and benign. For
0 example, those have positive variates in the first direction and
small negative variates in the second direction are normal and
-100 those have negative variates in both directions are the most prob-
able benign samples. On the other hand, those samples that their
300 200 400 respective canonical variates in the first and second directions are
100 200
0-100 0
0.4
(b) 0.3
50
PLS component 3
0.2
ECV# 2
0 0.1
0
-50
-0.1
-0.2
0 -50
-50 0
-100 50
-150 100 -0.3
PLS component 2 -0.2 0 0.2 0.4 0.6 0.8
PLS component 1
ECV# 1
Fig. 3. Distribution pattern of the prostate samples in the three-dimensional PLS-
DA based factor for SELDI-TOF spectra: (a) whole region and (b) cluster S3,2 . (, ) Fig. 4. Distribution pattern of the prostate cancer samples in the two-dimensional
control, (夽, ⋆) benign and (▽, ) cancer: filled and open markers denote calibration extended canonical variate (ECVA) space of their SELDI-TOF spectra for cluster (S3,3 )
and prediction samples, respectively. of network size (n = 5). The markers are the same as describe in Fig. 3.
(a) (15,154) m/z value for analysis. The performance of this method
0
has been validated by analysis of the MALDI-TOF spectra of ovar-
ian data set (for classification of normal from cancerous samples)
Selectivity Ratio
-1
and prostate data set (for classification of normal versus benign
and cancerous samples). For first data set (ovarian cancer) CLoVA
-2 based PLS-DA, reveals better results than conventional PLS-DA
whilst for prostate dataset which is more complex than first ones,
-3 CLoVA based ECVA exhibits better prediction result than PLS-DA,
and CLoVA based PLS-DA. Identification of the most informative m/z
-4 regions by CLoVA based PLS-DA and CLoVA based ECVA are simpler
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 and more straightforward than conventional pattern recognition
M/Z value methods because we can easily focus on selected regions instead
of whole region. The proposed method can be considered as an
alternative to variable selection method for PLS-DA and ECVA-
(b) 3 DA.
Selectivity Ratio
2
Acknowledgment
1
0 Financial support of this project by Shiraz University Research

Council is appreciated.
-1
-2 Appendix A. Supplementary data
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Supplementary data associated with this article can be found, in
M/Z value the online version, at http://dx.doi.org/10.1016/j.aca.2012.12.050.
Fig. 5. The selectivity ratio plot of CLoVA-ECVA model of SELDI-TOF spectra of
prostate data set for (a) cluster S3,3 of network size of 5 and (b) whole region. References
[1] F. Kong, C. Nicole White, X. Xiao, Y. Feng, C. Xu, D. He, Z. Zhang, Y. Yu, Oncology
100 (2006) 247–253.
positive can be considered as the most probable cancerous sam- [2] J.M. Schildkraut, W.D. Thompson, Am. J. Epidemio. 128 (1988) 456–466.
ple. [3] D.L. Meany, Z. Zhang, L.J. Sokoll, H. Zhang, D.W. Chan, J. Proteome Res. 8 (2008)
Now, it is the time to identify the discriminating m/z vari- 613–619.
[4] D. Donald, T. Hancock, D. Coomans, Y. Everingham, Chemom. Intell. Lab. Syst.
ables for prostate cancer. We give our discussion on this issue 82 (2006) 2–7.
for the CLoVA-based ECVA-DA, from which the least classi- [5] L. Chen, Chemom. Intell. Lab. Syst. 84 (2008) 123–130.
fication errors have been obtained. In the same manner as [6] T. Rajalahti, R. Arneberg, F.S. Berven, K.M. Myhr, R.J. Ulvik, O.M. Kvalheim,
Chemom. Intell. Lab. Syst. 95 (2009) 35–48.
ovarian data set, SR plot and F-ratio statistics have been used [7] N. Jeffries, Bioinformatics 21 (2005) 3066–3077.
to identify the significant variables. As it was mentioned pre- [8] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Koong, Q.T. Le, Bioin-
viously, ECVA uses a PLS model in the inner relation in order formatics 20 (2004) 3034–3044.
[9] E.F. Petricoin, L.A. Liotta, Clin. Chem. 49 (2003) 533–534.
to solve the collinearity in data matrices. We used the regres- [10] Iii.E.F. Petricoin, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Stein-
sion vector of this PLS model for calculating SR plot. The berg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn, Lancet 359 (2002)
selectivity ratio plot for whole region of prostate samples SELDI- 572–577.
[11] Y. Qu, B.L. Adam, Y. Yasui, M.D. Ward, L.H. Cazares, P.F. Schellhammer, Z. Feng,
TOF spectra and selected cluster (S3,3 ) are shown in Fig. 5. O.J. Semmes, G.L. Wright Jr., Clin. Chem. 48 (2002) 1835–1843.
Based on the SR criteria of cluster (S3,3 ) it is observed that [12] G. Mor, I. Visintin, Y. Lai, H. Zhao, P. Schwartz, T. Rutherford, L. Yue, P. Bray-Ward,
three main regions (450–460, 504–505 and 7160–7384m/z) D.C. Ward, Proc. Natl. Acad. Sci. U.S.A. 102 (2005) 7677–7682.
[13] T.C.W. Poon, T.T. Yip, A.T.C. Chan, C. Yip, V. Yip, T.S.K. Mok, C.C.C.Y. Lee, T.W.T.
are responsible for class separation of prostate samples. Crit-
Leung, S.K.W. Ho, P.J. Johnson, Clin. Chem. 49 (2003) 752–760.
ical F-value (1.25) is shown that 172m/z channels (out of [14] A. Valerio, D. Basso, S. Mazza, G. Baldo, A. Tiengo, S. Pedrazzoli, R. Seraglia, M.
1832m/z channels of cluster S3,3 and 15,154m/z of total vari- Plebani, Rapid Commun. Mass Spec. 15 (2001) 2420–2425.
ables) have been selected as the most significant discriminatory [15] W. Zhu, X. Wang, Y. Ma, M. Rao, J. Glimm, J.S. Kovach, Proc. Natl. Acad. Sci. U.S.A.
100 (2003) 14666–14671.
variables. The selected m/z values are reported in Table (S4) of [16] Y. Hu, S. Zhang, J. Yu, J. Liu, S. Zheng, Breast 14 (2005) 250–255.
supporting information. [17] T.D. Veenstra, D.R.A. Prieto, T.P. Conrads, Drug Discov. Today 9 (2004) 889–897.
[18] B.L. Adam, Y. Qu, J.W. Davis, M.D. Ward, M.A. Clements, L.H. Cazares,
O.J. Semmes, P.F. Schellhammer, Y. Yasui, Z. Feng, Cancer Res. 62 (2002)
5. Conclusion 3609–3614.
[19] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, Handbook of Chemometrics
and Qualimetrics, Elsevier Science, Amesterdam, 1997.
The presented results in the current study, demonstrate that [20] T. Rajalahti, R. Arneberg, A.C. Kroksveen, M. Berle, K.M. Myhr, O.M. Kvalheim,
it is possible to split the information in the scores and canoni- Anal. Chem. 81 (2009) 2581.
cal variates of PLS-DA and ECVA into informative and irrelevant [21] B. Hemmateenejad, S. Karimi, J. Chemometr. 25 (2011) 139–150.
[22] B. Hemmateenejad, M. Elyasi, Anal. Chim. Acta 646 (2009) 30–38.
variables. Then by selecting the informative parts, a multivariate
[23] B. Hemmateenejad, R. Miri, M. Elyasi, J. Theor. Biol. 305 (2012) 37.
pattern recognition model is obtained which has higher predic- [24] L. Nørgaard, G. Sölétormos, N. Harrit, M. Albrechtsen, O. Olsen, D. Nielsen, K.
tion ability with respect to conventional PLS-DA and ECVA. In ‘Kampmann, R. Bro, J. Chemometr. 21 (2007) 451–458.
[25] B.K. Lavine, C.E. Davidson, D.J. Westover, J. Chem. Inf. Comp. Sci. 44 (2004)
this method, called CLoVA based PLS-DA and CLoVA based ECVA,
1056–1064.
the original variables are clustered into different clusters by self- [26] E.F. Petricoin, D.K. Ornstein, C.P. Paweletz, A. Ardekani, P.S. Hackett, B.A. Hitt,
organizing map and then PLS-DA and ECVA is performed on data A. Velassco, C. Trucco, L. Wiegand, K. Wood, J. Natl. Cancer Inst. 94 (2002)
variables of each cluster separately. The advantage of presented 1576–1578.
[27] R.K.H. Galvão, M.C.U. Araujo, G.E. José, M.J.C. Pontes, E.C. Silva, T.C.B. Saldanha,
approach is that it used the small part of m/z value (based on clus- Talanta 67 (2005) 736–740.
tering of self-organizing map) instead of entire proteomics profile [28] Z. Ramadan, D. Jacobs, M. Grigorov, S. Kochhar, Talanta 68 (2006) 1683–1691.
[29] A.N. Zira, S.E. Theocharis, D. Mitropoulos, V. Migdalis, E. Mikros, J. Proteome [33] C.L. Hansen, F. den Berg, M.A. Rasmussen, S.B. Engelsen, S. Holroyd, Chemom.
Res. 9 (2010) 4038–4044. Intell. Lab. Syst. 104 (2010) 243–248.
[30] H. Yu, J.F. MacGregor, Chemom. Intell. Lab. Syst. 73 (2004) 199–205. [34] H. Winning, N. Viereck, T. Salomonsen, J. Larsen, S.B. Engelsen, Carbohyd. Res.
[31] S. Wold, H. Antti, F. Lindgren, J. Öhman, Chemom. Intell. Lab. Syst. 44 (1998) 344 (2009) 1833–1841.
175–185. [35] N. Mobaraki, B. Hemmateenejad, Chemom. Intell. Lab. Syst. 109 (2011)
[32] O.P. Whelehan, M.E. Earll, E. Johansson, M. Toft, L. Eriksson, Chemom. Intell. 171–177.
Lab. Syst. 84 (2006) 82–87.

Identification of Discriminatory Variabl

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Identification of Discriminatory Variabl

Uploaded by

Copyright:

Available Formats

Analytica Chimica Acta 767 (2013) 35–43

Contents lists available at SciVerse ScienceDirect

Analytica Chimica Acta