Chemometricsanalbiochem

Anal Bioanal Chem (2004) 380: 419–429
DOI 10.1007/s00216-004-2783-y
R EV IE W
Lennart Eriksson Æ Henrik Antti Æ Johan Gottfries

Elaine Holmes Æ Erik Johansson Æ Fredrik Lindgren
Ingrid Long Æ Torbjörn Lundstedt Æ Johan Trygg
Svante Wold
Using chemometrics for navigating in the large data sets of genomics,

proteomics, and metabonomics (gpm)
Received: 25 June 2004 / Accepted: 22 July 2004 / Published online: 22 September 2004
Springer-Verlag 2004
Abstract This article describes the applicability of

multivariate projection techniques, such as principal- Introduction
component analysis (PCA) and partial least-squares
(PLS) projections to latent structures, to the large-vol- The ‘‘omics’’ world
ume high-density data structures obtained within ge-
nomics, proteomics, and metabonomics. PCA and PLS, In today’s competitive pharmaceutical industry the
and their extensions, derive their usefulness from their search for promising new biologically active compounds
ability to analyze data with many, noisy, collinear, and is intense, and to accelerate this discovery process even
even incomplete variables in both X and Y. Three further, chemical, physical, biological, and ‘‘omics’’ data
examples are used as illustrations: the first example is a are compiled at an ever increasing pace. Interestingly,
genomics data set and involves modeling of microarray the Royal Swedish Academy of Sciences, in their
data of cell cycle-regulated genes in the microorganism Advanced information on the Nobel Prize in Chemistry
Saccharomyces cerevisiae. The second example contains 2002—a prize which was awarded to the originators of
NMR-metabonomics data, measured on urine samples MS and NMR of biosamples—recognizes that [1]: ‘‘The
of male rats treated with either of the drugs chloroquine last five years have seen the appearance of the ‘‘omics
or amiodarone. The third and last data set describes world’’ in life sciences, exemplified by concepts such as
sequence-function classification studies in a set of G- genomics, proteomics or metabonomics. The new as-
protein-coupled receptors using hierarchical PCA. pects of these concepts is the global view and the large-
scale investigations, in contrast to the problem-oriented
Keywords PCA Æ PLS Æ Hierarchical modeling Æ reductionistic view prevailing in earlier studies. It is now
Multivariate analysis Æ Omics data analysis possible to describe the whole genome of an organism.
Similarly, the whole set of proteins that appear at a
certain stage in a living cell can at least be considered,
L. Eriksson (&) Æ E. Johansson even if not quantitatively described, and the same should
Umetrics AB, POB 7960, 907 19 Umeå, Sweden
in principle hold for the total flow of metabolic prod-
H. Antti Æ E. Holmes ucts. These new possibilities are in part due to the
Biological Chemistry, Biomedical Sciences Division,
Faculty of Medicine, Imperial College of Science Technology and
development of new methodologies, of which mass
Medicine, Sir Alexander Fleming Building, South Kensington, spectrometry and NMR applied to biological macro-
London, SW7 2AZ, UK molecules are important examples.’’ Thus, ‘‘omics’’
J. Gottfries including genomics, proteomics, and metabonomics
AstraZeneca, R&D Mölndal, 431 83 Mölndal, (GPM) is a rapidly growing area with constantly grow-
Sweden ing data volumes [1–3].
H. Antti Æ J. Gottfries Æ J. Trygg Æ S. Wold
Institute of Chemistry, Umeå University,
901 87 Umeå, Sweden Information recovery from omics data
F. Lindgren The automated acquisition of large amounts of omics-
Umetrics AB, Malmö Office, Stortorget 21,
21134 Malmö, Sweden data results in exploratory and interpretative challenges.
The abundance of data is not in itself a guarantee of
I. Long Æ T. Lundstedt
Department of Pharmaceutical Chemistry,
obtaining useful information on major events taking
Uppsala University, Box 574, 741 23 Uppsala, place in an investigated system. On the contrary, data
Sweden from the omics field need to be processed, analyzed, in
420
order to highlight the useful information among the The data are taken from the web-site http://cellcycle-
measurements. Since these data are highly multivariate www.stanford.edu (cited 25 March 2003) [19]. It is of
in nature, one must use data analytical techniques which interest to model genes with periodicity in expression
are able to cope with the challenges inherent in masses of levels, i.e. fluctuations which could be connected to the
data, notably noise, collinearities, and missing data. cell cycle (formation and division).
Only with careful data analysis will we be able to address To investigate changes of mRNA levels during the
central questions such as how to modify drug chemical cell cycle in the organism studied, cell cultures must be
structure in order to improve drug performance, or to synchronized such that a population of cells is coherent
understand why a certain test creature is a slow re- and thus stays in the same phase of the cell cycle at the
sponder in a metabonomics assay. same time. This can be achieved by inhibiting various
This article describes a remarkably simple approach to steps essential for cell-cycle progression.
the analysis of masses of omics data based on so called This data set consists of three experiments in which
multivariate projection methods. This approach repre- different techniques were used to achieve synchroniza-
sents the table of observations data, X, from body fluid tion [20, 21]. Observations from the three experiments
samples, gene chips, or GPCR-sequences, etc., as a swarm are labeled alpha, cdc15, and cdc28 respectively. The
of points in K-dimensional space (K=number of vari- time point at which samples were taken is indicated in
ables), and then projects the point swarm down on to a the name of the observation. The total number of
low-dimensional plane or hyper-plane, a low-dimensional observations is 59 and the number of variables (genes)
subspace. The coordinates of the points on this hyper- with a diagnosed periodic behavior is 800.
plane provide a compressed representation of the obser- We note that this data set has previously been ana-
vations, and the direction vectors of the hyper-plane lyzed using PCA [22]. Only 104 genes were analyzed and
provide the corresponding representation of the variables. data were column-scaled to unit variance. In the current
The projection approach can be adapted to a variety study, however, the scope of the analysis is widened to
of data-analytical objectives, i.e. (1) summarizing and encompass 800 genes with a diagnosed periodicity and
visualizing a data set, (2) classification and discriminant data are column-scaled using Pareto-scaling (see further
analysis, and (3) finding quantitative relationships discussion in the section PCA of 59 by 800 data set. The
among the variables. This applies to any shape of a use of more genes is expected to bring about a more
multivariate data-set, with many or few variables, many stable model of the cell cycle phenomenon, and experi-
or few observations, and complete or incomplete data ence shows that Pareto-scaling is a viable alternative to
tables. In particular, projections handle matrices with unit variance scaling because the risk of blowing up
more variables than observations very well, and the data noisy patterns in data is greatly reduced.
can be noisy and highly collinear.
Methods used in this context are principal-compo-
nent analysis (PCA) [4–6] for projecting X down on to a Data set II (metabonomics): a metabolic investigation
few scores, also called latent variables, giving a summary of phospholipidosis
of X, soft independent modeling of class analogy
(SIMCA) [7], and partial least-squares discriminant The second example deals with male rats treated with the
analysis (PLS-DA) [8] for classification, and principal two drugs chloroquine and amiodarone (‘‘c’’ or ‘‘a’’),
component regression (PCR) [5, 9] and PLS [5, 10–12] both of which are known to induce phospholipidosis
for latent variable regression. [23]. The drugs were administered to two different
Furthermore, hierarchical PLS and PCA are two re- strains of rat, i.e. Sprague–Dawley and Fisher (‘‘s’’ or
cent modifications, which simplify interpretation in ‘‘f’’) [23]. Sprague–Dawley rats are a standard labora-
applications involving very many variables [13–18]. In tory animal model whereas Fisher rats are more sus-
such a situation, plots and lists of loadings, weights, and ceptible to certain types of drug exposure and hence it is
coefficients tend to become messy and the results are easier to detect drug effects. The objective was to
often difficult to overview. Instead of reducing the investigate whether 1H NMR data measured on rat ur-
number of variables, and thus reducing the validity of ine samples could be used to distinguish control rats and
the modeling, a better alternative is often to divide the animals exposed to known toxic compounds.
variables into conceptually meaningful blocks and apply The data set contains N=57 observations (rats) and
hierarchical PCA or PLS. K=194 variables (1H NMR chemical shift regions). The
observations (rats) are divided in six groups (‘‘classes’’):
‘‘s’’ control Sprague–Dawley, ten rats; ‘‘sa’’ Sprague–
Example data sets Dawley treated with amiodarone, eight rats; ‘‘sc’’
Sprague–Dawley treated with chloroquine, 10 rats; ‘‘f’’
Data set I (genomics): cell-cycle-regulating genes = control Fisher, ten rats; ‘‘fa’’ Fisher treated with
of Saccharomyces cerevisiae amiodarone, ten rats; ‘‘fc’’ Fisher treated with chloro-
quine, nine rats.
The first example relates to expression levels of genes in The applicability of PCA and SIMCA for classifica-
the microorganism Saccharomyces cerevisiae (a yeast). tion and outlier detection in a two-class problem has
421
already been reported [24]. In the current publication the

three-group discrimination problem is addressed using Data analytical methods
PLS-DA and the first three groups. The three-group
problem involves not only investigation of how drug- In this paper we have used the software SIMCA-P,
exposed animals relate to controls but also of how the version 10 [28], and its implementation of standard and
two classes of toxicant-dosed rats relate to each other. hierarchical PCA and PLS. To estimate the number of
PCA or PLS components, cross-validation with seven
exclusion groups was used [29]. The data were pre-pro-
Example III (proteomics): hierarchical modeling cessed by means of mean-centering and scaling to unit
of G-protein-coupled receptors variance, unless otherwise stated.
Our third illustration relates to a comparison of different

classes of G-protein-coupled receptors (GPCR) using a Principal-component analysis
physicochemical description of the amino acid sequences.
GPCR are cell membrane proteins that are used for Principal-component analysis is a basic workhorse in
communication between the outside and inside of the chemometrics [4, 6, 7, 13]. Its objective is to summarize
cell. This is a very large and varied family of proteins the variation in a data matrix X, consisting of N rows
consisting of seven transmembrane (TM) alpha-helices (observations) and K columns (variables), in terms of a
connected by intracellular and extracellular loops. few underlying and informative scores, or latent vari-
Many attempts have been made to determine the ables. The X-matrix is decomposed as the product of
structure of various types of GPCR, but have at best two matrices, the (N·A) score matrix T and the (A·K)
achieved very limited results. In an attempt to learn loading matrix P¢, where A is the number of principal
more about GPCR, Gunnarsson et al. [25] applied components, plus a (N·K) ‘‘noise’’ matrix of residuals,
multivariate data analysis to a set of 897 receptors, E:
divided into 12 classes on the basis of on their function. X ¼ TP0 þ E ð1Þ
To describe the sequence properties of the 897
GPCR, Gunnarsson et al. used the five zz-descriptors for where T is the score matrix summarizing the X-variables,
amino acids, as reported by Sandberg et al. [26]. These and P¢ is the loading matrix showing the influence of the
descriptors were used to encode sequence variations in variables on the projection model. E is the residual
the seven TM-regions (i.e. the loops were ignored), here matrix expressing the deviations between the original
denoted A, B, C, D, E, F, and G. The number of values and the projections. The residual standard devi-
parameterized amino acid positions is 134. Therefore, ation (RSD) can be computed for observations and
the total number of variables is 134·5=670. variables. The RSD of an observation (a row in E) is
In Ref. [25], the 897 GPCR were analyzed all to- also called the observation distance to the PC model
gether, and no distinction was made between a training (DModX), because the RSD can be seen as a distance
set and a test set. Also, because of the rather varied size measure [13].
of the different classes (between 4 and 302 members),
there was a substantial risk that not all classes would
indeed have a fair chance to influence the multivariate Soft independent modeling of class analogy
model. Hence, we decided to modify the initial study
reported in [25] by considering only the five largest Data that are observed on a class of similar observations
classes and using pre-defined and well-balanced training can always be well approximated by a few-component
and test sets. PCA or PLS model, provided that most of the variables
In this example we will focus on the following five indeed express this similarity. This is the basis of a
classes of GPCR: amine- (am), peptide- (pe), rhodopsin- method known as soft independent modeling of class
(op), olfactory- (ol), and orphan- (or) GPCR. The analogy, or SIMCA for short [7, 13]. The objective of
training set consists of 200 receptors (5 classes·40 SIMCA is to develop local PCA or PLS models, one for
receptors) and the test set consists of 100 receptors each class of similar observations, and later use these
(5 classes·20 receptors). Both of these sets were identi- models for interpretation and classification of new
fied using statistical molecular design [27], i.e. through observations. This allows probabilistic class boundaries
local PCA of each class, followed by selection of to be defined, which helps uncover outliers, detect the
receptors showing a good spread in plots of the first few importance of variables, etc. [7, 13].
score vectors.
We will use this data set to exemplify the utility of
hierarchical multivariate modeling. Each of the seven Partial least-squares projections to latent structures
trans-membrane regions will be used as a separate block in
the base-level PCA modeling. The top-level model will be Partial least-squares is the regression extension of PCA
based on the scores from the base-level models (see also [5, 7, 10, 13], working with two matrices, X and Y. It has
the section Hierarchical PCA and PLS models, below). two objectives, to well approximate X and Y and to
422
model the relationship between them. The predictor interpretability. Both these methods operate on two or
block (X) is summarized by the A X-scores, T, and the more levels. On the lower level the details of each block
corresponding variation in the response block (Y) is are modeled. This analysis provides the block score
described by the A Y-scores, U. As with PCA, the scores vectors (‘‘super variables’’), which are used to construct
(T and U) express relationships among the observations the data matrices of the upper level. On the upper level,
(samples). a relationship between relatively few super variables is
Basically, PLS maximizes the covariance between T developed.
and U [30]. For each model dimension, a weight vector, On each level, ‘‘standard’’ PLS or PCA scores and
w, is computed, which reflects the partial contribution of loading plots, and residuals and their summaries such as
each X-variable to the modeling of Y. The resulting DModX, are available for model interpretation. This
(A·K) X-weight matrix, W, hence reflects the structure enables an interpretation focused on pertinent blocks
in X that maximizes the covariance between T and U. and their dominating variables. Further details are given
The corresponding matrix of Y-weights is designated C. in the literature [13–18].
Additionally, a matrix of X-loadings, P, is calculated in
order to deflate X appropriately. It expresses the corre-
lation structure between the X-variables.
Results for data set I
The decomposition in PLS of X and Y can be de-
PCA of 59 by 800 data set
scribed as:
X ¼ TP0 þ E; Y ¼ TC0 þ F ð2Þ For scaling of ‘‘gene-spectra’’ we found Pareto scaling
useful [13, 33]. Here the data matrix X is first centered by
The set of PLS regression coefficients can be computed subtracting the mean of each column and then scaled by
according to: dividing by the square root of the standard deviation of
each column. If the variables are on different scales (i.e.
B ¼ WðP0 WÞ1 C0 ð3Þ
you are comparing chalk with cheese) scaling to unit
Subsequently, a prediction ^
y is obtained from the x- variance is recommended.
vector as: A PCA model was fitted to all 59 observations. Using
the first two components for overview purposes gives
^y ¼ x0 WðP0 WÞ1 C0 ¼ x0 B ð4Þ R2X=0.42 and Q2X=0.36. There is a uniform spread of
the observations with no apparent separation among the
three classes and no strong outliers (Fig. 1a). However, a
cyclical patterns is apparent within each class (Fig. 1).
PLS discriminant analysis The cdc15 class shows two cell cycles while the other
two classes contain almost two. The score plots in Fig. 1
Partial least-squares discriminant analysis [8, 13, 31] is suggests that cell cycle duration for each class is
a modification of PLS targeting classification and dis- approximately 60, 110, and 80–90 min, respectively. To
crimination problems. Unlike SIMCA, however, where interpret this model further, we could look at a plot of
a separate model is made for each class, PLS-DA makes the PCA loadings (no plot shown). Genes with similar
one model covering many classes. The resulting projec- expression profiles are grouped together [33].
tion model gives latent variables that focus on maximum
separation (‘‘discrimination’’) rather than maximum
variation (‘‘optimal class modeling’’). Conclusions for example I
In PLS-DA, a Y-matrix of dummy variables describes
the class membership of each observation in the training As shown by this example, analytical bioinformatics
set. The dummy matrix Y has G columns (for G classes) data can be visualized using PCA. This gives an over-
with ones and zeros, such that the entry in the gth column view of the data and highlights experimental variations,
is unity and the entries in other columns are zero for trends, and outliers. In this example the cyclic behavior
observations of class g. An interesting discussion of PLS of these data relates to cell-cycle duration.
and its use in discrimination modeling is found in [32].
With G well conditioned and well separated classes,
one expects G-1 significant PLS components. If one Results for data set II
finds more components this indicates the presence of
some non-linearities such as sub-clusters. If one finds PLS-DA for Sprague–Dawley animals
fewer components this suggests incomplete separation.
To illustrate the utility of PLS-DA we are going to
focus on the differences between the groups ‘‘s’’
Hierarchical PCA and PLS models (Sprague–Dawley controls), ‘‘sc’’ (SD rats treated with
chloroquine), and ‘‘sa’’ (SD rats exposed to amioda-
The idea with hierarchical PCA and PLS is to block rone). Before analysis data were mean-centered and
the variables in order to improve transparency and Pareto-scaled.
423
GENEGRID II.M5 (PCA-X), Par All Obs 800R Class 1

t[Comp. 1]/t[Comp. 2] Class 2 GENEGRID II.M5 (PCA-X), Par All Obs 800R
Colored according to classes in M5 Class 3
t[Comp. 1]/t[Comp. 2]
Colored according to classes in M5
20
20
_120 _230
_90 _130
70
10 7_240_10
_0 _250 70
_110_80_160630 _10 _100_140 _30
10 7
_220 _270 7714 630
119 77
14
_70 _150 _290 119
0 _100_210
56 84 _110
_140 21 56
t[2]
t[2]
0 84
112 91 _150 21
_60 105 _120 _20 112 91
49
_200 _130 _160 105
49
98 28
_90 98 28
-10 _190_50 35
42 _180 -10 35
_170
_30 42
_80 _40
_50
-20 _70 -20
-20 -10 0 10 20 -20 -10 0 10 20

t[1] t[1]
GENEGRID II.M5 (PCA-X), Par All Obs 800R GENEGRID II.M5 (PCA-X), Par All Obs 800R
t[Comp. 1]/t[Comp. 2] t[Comp. 1]/t[Comp. 2]
Colored according to classes in M5 Colored according to classes in M5
20 20
120 230
130 90
10 240 10 250 10
0
110 140 30 80 160 10 100
220 270
290 70 150
100 210 110
t[2]
t[2]
0 0
140
150
60 120 20
200 160 130
90 190
-10 180 -10 50
170 30
80 40
50
-20 70 -20
-20 -10 0 10 20 -20 -10 0 10 20

t[1] t[1]
Fig. 1 Principal-component analysis t1/t2 score plot for data set I. chemical shift regions 1.30, 1.46, 2.46, 3.06, and
Each observation point is one time point in a particular 3.38 ppm (which relate to 2-oxoglutarate and alanine).
experiment. A jointed line can be used to connect each point
according to time. By following such a time series trajectory it is The first component separates animals according to
possible to get an indication of the duration of a cell cycle. For which drug they received and the second component
instance, in the cdc-case it seems reasonable to expect that the contrasts controls with drug-exposed creatures. The
duration is approximately 90 min, because of the pair-wise third component is dominated by the maverick sc-rat.
nearness of time points 0/90, 10/100, 20/110, etc. Top left: plot of
all observations across all three microarray experiments. Top right: Thus, there is really no doubt that the chemical treat-
plot of alpha-experiment observations. Bottom left: plot of cdc15- ment of the rats induces a substantial and characteristic
experiment observations. Bottom right: plot of cdc28-experiment change in their NMR profiles. Additional plots of PLS
observations weights and coefficients might help uncover which
spectral regions contribute to these systematic differ-
The PLS-DA calculations yielded a strongly signifi- ences (no such plots are shown for reasons of brevity).
cant three-component model with R2X=0.73,
R2Y=0.87, and Q2Y=0.79. Because three well-behaved
classes would give two significant components, this Discussion and conclusions for example II
indicates some peculiarities in the data. The score vec-
tors in a 3D scatter graph are seen in Fig. 2. There is one This example shows the power of multivariate statistical
atypical sc-rat (marked by the ellipse). The contribution methods to highlight information residing in large
plot in Fig. 3 reveals how the outlying animal (number amounts of metabonomically interesting measurements.
27) is different from a better representative of the sc-class It is always a good habit to commence the data analysis
(number 28). As seen, the deviation is chiefly in the with a PCA of the entire data set (we did not show
424
with 2–4 classes, but with more classes the SIMCA

method is more tractable [13].
In this exercise, we have focused on the differences
between three classes, i.e. ‘‘s’’, ‘‘sc’’ and ‘‘sa’’ rats. This is
an analysis that will pick-up drug-related effects of the
chloroquine and amiodarone treatments. It should be
noted that other PLS-DA comparisons might be made
apart from just ‘‘s’’ with ‘‘sc’’ and ‘‘sa’’. Other ways of
focusing on drug effects are to compare two classes to-
gether, for instance ‘‘f’’‘‘fa’’, ‘‘f’’‘‘fc’’, ‘‘s’’‘‘sa’’
and ‘‘s’’‘‘sc’’. However, there are also other aspects of
the data analysis, which may reveal interesting infor-
mation. For example, a comparison made between
‘‘f’’‘‘s’’ would indicate straindifferences and perhaps
diet differences. And looking at ‘‘fa’’‘‘sa’’ and
‘‘fc’’‘‘sc’’ might suggest strain-dependent drug effects.
Results for data set III
Base-level PCA models
Fig. 2 Three dimensional scatter plot of the PLS-DA model for

Using the training set of 200 GPCR, we fitted seven
data set II. Each point represents one rat. Open boxes are SD PCA models, one for each TM region. We wanted to
controls. Filled upside-down triangles are SD rats exposed to avoid having too many components in the base-level
chloroquine. Gray dots are SD rats exposed to amiodarone. One models. Our reason for this approach was that we were
serious outlier, captured by the third component, is marked by the after simplicity, i.e. a parsimonious top-level model
ellipse
which would be transparent and easy to interpret. In
addition, we wanted to use the predictively most
results from this phase). This will indicate groups, time meaningful base-level components.
trends, outliers, and other systematic structures. We decided to use the following two decision crite-
To further focus the analysis toward classification or ria. (1) Only components with a positive contribution to
discrimination, techniques like PLS-DA and SIMCA are the cumulative Q2X were used. (2) Regardless of Q2X,
purposeful. A necessary condition for PLS-DA to work the explained variation should exceed 30%, i.e.
reliably is that each class is preferably ‘‘tight’’ and R2X>0.3. In total, the number of lower level compo-
occupies a small and separate volume in the X-space. nents used is 36. Careful examination of the distribu-
Also, the number of modeled classes must not be too tion of receptors in the t1/t2 plane of each PCA model
high. Experience shows that PLS-DA is most practical indicated some weak sub-groupings, but probably the
most important conclusion was that there were no
traces of strong outliers or other anomalies. Hence,
there was no need to refine any local model by drop-
Fig. 3 Contribution plot indicating why and how the outlying SD
rat is different from a normal, good SD rat. The difference is in the ping strange receptors.
chemical shift regions 1.30, 1.46, 2.46, 3.06, 3.38, etc
Metabonomics.M8 (PLS-DA), PLS-DA of s sc and sa

Score Contrib(Obs 28 - Obs 27), Weight=w*[1]-w*[3]
2.46
Score Contrib (Obs 28 - Obs 27), Weight=w *1w*2w*3
7.00
3.06
6.00
5.00
4.00
3.00
2.00
1.00
0.00
-1.00
-2.00
-3.00
-4.00
-5.00
-6.00 1.46
-7.00 3.38 1.3
425
Top-level PCA model coded. The two score plots are strikingly similar, albeit
with a sign inversion for t2. These plots demonstrate that
The top-level PCA model, based on the 36 base-level no vital information is lost in the hierarchical approach.
score vectors, gave a model with five significant com- The big difference between the hierarchical and
ponents (R2X=0.60 and Q2X=0.32). Figure 4 shows a overall PCA models lies in the variable-related param-
scatter plot of the t1/t2 plane of the hierarchical model, eters. The corresponding two p1/p2 loading plots are
and Fig. 5 provides the corresponding plane of the shown in Figs. 6 and 7. Whereas the ‘‘overall’’ loading
overall PCA model based on the 670 original descrip- plot is very cluttered (Fig. 7), the ‘‘hierarchical’’ loading
tors. In both these plots class membership is symbol- plot is clean and easier to interpret (Fig. 6). Evidently
the combination high TM-E t2, TM-F t1, TM-G t1, and
low TM-A t2, TM-B t1, and TM-C t1 is characteristic of
HI-GPCR Primary.M9 (PCA-X), Top model; t[Comp. 1]/t[ Comp. 2] olfactory GPCR and the combination high TM-A t1,
TM-C t2, TM-F t2, TM-G t2, and low TM-E t1 (and to
am ol op some extent low TM-B t3 and TM-D t1), indicates that
or pe
GPCR of the rhodopsin-type are involved.
6
HI-GPCR Primary.M9 (PCA-X), Top model

4 p[Comp. 1]/p[Comp. 2]
2 $M2.t3
0.30
$M5.t1
t[2]
0
0.20 $M1.t2
$M4.t1 $M3.t1
-2 0.10
$M5.t3
$M3.t3 $M1.t4
$M4.t5
$M2.t4
$M3.t5
$M5.t5
$M3.t4
$M6.t3
$M2.t5 $M2.t1
p[2]
0.00 $M1.t3
-4 $M4.t4
$M2.t6
$M6.t1 $M6.t6
$M6.t5
$M5.t4
$M3.t6
-6
-0.10
$M5.t2 $M1.t5
$M6.t4
$M2.t2
$M7.t1 $M7.t3
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 -0.20 $M4.t3
t[1]
$M4.t2 $M1.t1
-0.30
$M7.t2
$M3.t2
Fig. 4 Principal-component analysis t1/t2 score plot of the hierar- $M6.t2
chical PCA model of data set III. The five classes of GPCR are
-0.40 -0.30 -0.20 -0.10 0.00 0.10 0.20 0.30 0.40
indicated by different symbols, as seen in the top part of the plot
p[1]
Fig. 6 Hierarchical PCA p1/p2 loading plot corresponding to

HI-GPCR Primary.M8 (PCA-X), All TM's; t[Comp. 1]/t[C omp. 2] Fig. 4. $M1.t1 means base-level model M1, for TM region A,
and component 1, etc
am ol op
or pe
HI-GPCR Primary.M8 (PCA-X), All TM's

p[Comp. 1]/p[Comp. 2]
G6t3
-10.00 -0.10 A14t3 C8t2 G8t2
F16t2D13t5
A14t1 G10t1
C2t1
G11t5
A3t5 E2t1
C10t3 C2t5
E14t3
B3t2
F4t5
C4t2
B20t4
B3t3 D9t5
D2t2
C13t5C10t1 A5t1
F2t3
E8t2
E14t4 G4t5
F2t1
F5t2
F19t2
D6t4
B20t2G10t3
G7t4
B11t1 E14t2
B14t5
C14t2
C1t2
G17t1 E3t1
B3t4 E6t5 F1t5
C16t3
C16t2
C14t4
A19t1
D15t2 C19t2
G2t2 D10t2
-0.05 B20t5
B20t3
C11t1
C7t2 D5t1B18t2
C17t1
E3t5
A12t5B17t5
B2t5
G17t3
E1t5 A4t5 C19t1
F5t4
B11t3
C17t3
C3t5 G8t5G9t1
G8t1
C20t5
G6t2
E8t4
F13t2
F6t1 B15t5
A6t5
A18t5 E2t5
D1t5 A19t3
E14t1
D10t1
E16t5
E19t2
G3t4
C9t4
E11t1
E16t1 C8t4
C10t5 B19t4
D3t2
C11t3B13t5
C4t1
0.00 A20t5
F15t5
E17t5
A18t4A8t3
C5t4
A16t5
B15t1
E15t5 D2t4
B15t3
B11t2C1t5
A19t4
F5t3
G15t3
G5t1
C4t3 C8t3
E1t1
B5t4
B4t1
F16t4
D14t2
C13t1 G17t4
D10t3
B9t1 E7t2
F15t3
C11t2 F6t3
C6t1
E5t5
G9t3
C6t3 E12t4
E20t4
C16t4
G15t4
B19t5
A10t4
E9t4
D3t3
A8t1
G3t1
D5t5 A10t3
G8t4
F15t4
A10t2
E4t4
E4t2
B16t5
C6t4
C5t2
F16t3
E11t3
D8t1 E9t5 F10t2
D18t4
C7t3
B2t1 F14
F12t4
B8t4F12
E7t4
G8t3
F14t3
F14t4
G9t2 A6t1
A11t2
B5t3
A5t5
F3t1
E7t5 C16t1
E9t3
G18t5 F13t4
C20t1
D6t5
A8t4 A9t5
E9t2
D3t1 E8t3
C17t4
G15t1
E19t4
F2t4
D14t1 E15t1
D8t5
B18t5
F18t5
G14t5
E2t3
C10t4 A16t1
F10t4
G18t2
C19t3
D18t3
G15t2
A20t1
B10t2E18t1
E18t5
F13t5
F11t2 B10t1
F9t5
B8t5
F10t5
B6t3
B10t5 D7t4
F11t4
D7t2 F11t3
C18t1
F10t3
B5t1
F18t1 F3t3
E4t3
B9t3
D8t3
D12t2F13t3
A7t1E12t5
C7t5
D11t4
F17t4
B7t2
D11t3
G17t2
B6t4
E13t3 G16t5
B7t4
D14t3
E13t2
D3t4
C17t5
A2t2
G17t5
B7t3
A17t3
F8t5E 13t4
C15t5
F18t3
B4t4
D11t1
E13t5 A17t2
F8t3
F3t5
B17t2
E5t1
F6t4
B1t3
B3t1
D7t3 A5t3
F14t2
B15t4
E10t4
D18t2
A4t1 F12t3
G5t3 B1t1
E6t1
C7t1
A1t5
B16t1 C7t4
G14t2 F3t4
B12t1
C6t2
C9t3
F2t2
F9t1 G13t1
G3t2
A3t1
G12t1
A10t1
E19t3
C17t2
A11t5
A9t3
E12t2 G7t2
A20t4
B12t5 A7t5
G18t1
E1t3B10t3 B1t5
F15t2
p[2]
0.00 F1t2
A1t2
A8t2
G10t4
D2t5
D18t1 G14t3
F2t5
A1t4
D17t1 B5t2
A17t1
C1t4
C10t2
F18t4
B1t4
F8t1
C14t5A9t1
A12t1
G12t4
G12t3
E5t4
D8t4
D9t1B7t1
A17t5
F14t1
E12t3
F5t5
D15t4
G15t5
B7t5
G3t5
A9t2
B4t3 E10t2
C9t2
B9t5
D4t1
A17t4
C3t1 B11t5
D4t3 E17t4
D18t5
D15t3 D16t5
C11t5
A11t4
G13t2G13t3G13t4G13t5
E18t3
E17t2
G16t1
G12t2
G14t4
C12t4 E4t5 E10t5
F19t1
E5t3
A14t4 B18t4
D16t1 D10t5
B19t3 E7t3
C18t4
C12t1 B8t3
A7t3
G10t2
C18t3 A4t2
A7t2
E9t1
D7t5
F18t2
D7t1
E4t1
G9t5
B8t2
G4t1 E11t4
E13t1
D1t1
D4t2
F19t4 E20t5
G12t5
A9t4
A13t1
C5t3
B4t2
E18t4
F8t4
G16t4
B12t3
F6t5
B19t1
C5t5
F11t1
C15t1
C9t1 F8t2
A2t3
G16t2
G16t3
E18t2
E11t2
E20t3D8t2 G14t1
E20t1
D4t4
E1t4
A19t2
A2t1
E10t3
D3t5 F7t4
C15t4 A8t5
D16t2
D16t4
D16t3
G5t4C14t1
C4t5
A1t1
C12t5
C14t3
F12t5
B9t2
F10t1F12t1
A7t4
E10t1
F9t4
B9t4
G1t1
F11t5 D11t2
B15t2
A1t3E20t2
B12t4
D14t4
G3t3
A2t4
E5t2
C9t5
C15t2 E2t2
C15t3
G2t4
F17t2
E19t5
D12t4 E12t1
F5t1 F13t1
C2t3 E6t3
B12t2
G1t2
B10t4
B6t1
F16t1
G18t4 G7t3
G18t3
A20t3
D11t5
D5t3 D2t3
G11t1 C18t5
F7t1
10.00 F1t4
B8t1
F9t2
C18t2
C12t3 D6t3
C20t3
B13t1
A18t2
F7t2
B1t2
D1t2
C12t2 D14t5
F7t5
C8t5
B17t1
E8t5
F17t5
F3t2 E19t1
E3t3
A18t3E11t5
G7t5
G10t5
D4t5 G2t1
B17t4
D12t5 C11t4
B19t2
B16t4
F1t3 F4t1
D6t2F9t3
D17t4
E7t1
D17t2 F15t1
B16t3D15t5
B5t5
C3t4
B13t4 A14t2
A10t5
D12t3 E1t2
B13t2
B14t1
D1t3 A11t3
D17t3 F16t5
G9t4B14t2
F19t3 B2t3
C13t3
A6t3
D13t2
C19t5
C20t4 C6t5 E15t3
D12t1
A11t1
B17t3
D15t1
A12t4
G11t4
A14t5
B3t5 F7t3
G6t4
G2t3
A20t2
A16t3
D17t5
C1t1
0.05 F1t1 C20t2 C3t3
A4t3
B4t5
F6t2
C3t2 A5t4
E16t4
C1t3
E8t1
B14t4 B18t3
D10t4
A2t5
D1t4
B6t5 A13t2
B16t2
C5t1
A6t4
D9t3
D5t4
G5t5
E15t4 E3t2
G1t4
B13t3 G5t2
B14t3
D13t1
C4t4 D13t4
C8t1 A13t5
A16t2
B2t2
B2t4 A13t4
A12t3
A12t2
A3t3
E3t4 D2t1
A13t3
C16t5
F19t5
C19t4
D6t1 E6t2
A19t5
G1t3
B6t2 A16t4 B11t4
D9t2
A6t2
A3t4
E16t3
B20t1
E16t2
E2t4
A5t2 G7t1
A18t1
G11t3
F4t2
E17t1E14t5
C2t4G4t4E17t3
B18t1
G2t5 D5t2
E15t2 F17t1
F17t3
-20.00 -10.00 0.00 10.00 20.00 C2t2
G4t2G4t3
G6t5G1t5 A3t2
C13t2
G6t1
D9t4
0.10 A4t4E6t4
F4t3 G11t2
C13t4
t[1] D13t3 F4t4
-0.10 -0.05 0.00 0.05 0.10
Fig. 5 Principal-component analysis t1/t2 score plot of the overall
PCA model of data set III. The five classes of GPCR are indicated p[1]
by different symbols, as seen in the top part of the plot. Note the
sign inversion of t2 (on the vertical axis) Fig. 7 Overall PCA p1/p2 loading plot corresponding to Fig. 5
426
HI-GPCR Primary.M9 (PCA-X), Top model

Score Contrib(Obs 123 - Average), Weight=p[1]p[2]
Score Contrib(Obs 123 - Average), Weight=p1p2 5.00
4.00
3.00
2.00
1.00
0.00
-1.00
-2.00
-3.00
$M1.t1
$M1.t2
$M1.t3
$M1.t4
$M1.t5
$M2.t1
$M2.t2
$M2.t3
$M2.t4
$M2.t5
$M2.t6
$M3.t1
$M3.t2
$M3.t3
$M3.t4
$M3.t5
$M3.t6
$M4.t1
$M4.t2
$M4.t3
$M4.t4
$M4.t5
$M5.t1
$M5.t2
$M5.t3
$M5.t4
$M5.t5
$M6.t1
$M6.t2
$M6.t3
$M6.t4
$M6.t5
$M6.t6
$M7.t1
$M7.t2
$M7.t3
Var ID (Primary)
Fig. 8 Score-contribution plot of hierarchical PCA model showing Moreover, in a similar manner it is possible to ana-
how observation 123 is different from the average GPCR lyze receptors that do not fit the hierarchical model well,
observation. The score variable $M6.t1 (TM region F, component
1) seems relevant for this difference i.e. those that are far out in a plot of DModX (not
shown). It is possible to find which score variables
contribute to the unexplained variation of an observa-
In a hierarchical model it is just a matter of a double-
tion, and also which are the underlying problematic
click on any selected receptor point to create a score-
amino acid positions and descriptor variables.
contribution plot showing which variables are influential
for that particular receptor. Figure 8 relates to receptor
123, a member of the olfactory class, and how it is dif-
Predictions for the prediction set
ferent from the average receptor. One very important
variable for this observation is TM-F t1 ($M6.t1).
In the next step, the hierarchical PCA model was used to
A second double-click on the TM-F t1 ($M6.t1) bar
classify the 100 receptors in the prediction set. Score
will then open up another score-contribution plot
plots of the training set (Fig. 4) and prediction set
(Fig. 9) uncovering which amino acid positions have a
(Fig. 10) are very similar. Hence, we infer predictive
strong impact on the localization of this receptor in the
power is excellent.
score plot. We can see that amino acid positions 1, 9, 10,
12, 14, and 15 are strong contributors to the positioning
of this receptor. For instance, in position 1 in TM-region
Discussion and conclusion of data set III
F receptor 123 has higher values in four out of the five
zz-scales than the average receptor, etc.
The main objective of hierarchical modeling is to achieve
Fig. 9 Contribution plot showing the variables underlying the simplicity. The results presented here show how inter-
score variable $M6.t1 (in Fig. 8), and which positions in TM region pretability of complicated data sets can be facilitated
F and which amino acid property descriptors are characteristic for (compare Figs. 6 and 7). A comparison of the hierar-
GPCR 123 compared with the average GPCR observation in the chical PCA model and the overall PCA model shows
training set
427
HI-GPCR Primary.M9 (PCA-X), tPS[Comp 1]/tPS[Comp 2] variables, well beyond the reach of intuitive compre-
hension. For instance, typical ‘‘gene-spectra’’ obtained
am ol op from a set of microarray experiments are difficult to
or pe overview and compare. For these and similar types of
data-set, data-driven analysis by means of multivariate
6 projection methods greatly facilitates the understanding
5
of complex data structures and thereby complements
hypothesis-driven research.
4
An attractive property of PCA, PLS, and their
3 extensions is that they apply to almost any type of data
2 matrix, e.g. matrices with many variables (columns),
many observations (rows), or both. The precision and
1
reliability of the projection model parameters related to
tPS[2]
0 the observations (scores, DModX) improve with

-1 increasing number of relevant variables. This is readily
-2 understood by realizing that the ‘‘new variables’’, the
scores, are weighted averages of the X-variables. Any
-3
(weighted) average becomes more precise the more
-4 numerical values are used as its basis. Hence, multivar-
-5 iate projection methods work well with ‘‘short and
-6
wide’’ matrices, i.e. matrices with many more columns
than rows. The data-sets treated here are predominantly
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
short and wide (micro-array data and metabonomics
tPS[1] example), but occasionally long and lean (hierarchical
Fig. 10 Related to Fig. 4, but shows predicted locations for the 100 proteomics data set).
GPCR of the test set The great advantage of PCA, PLS, and similar
methods is that they provide powerful views of your
data, compressed to two or three dimensions. Initial
there is practically no loss of information using the inspection of score and loading plots might reveal
former approach (compare Figs. 4 and 5). Moreover, we groups in the data that were previously unknown or
also used the test set to verify the predictive power uncertain. For example, Figs. 4 and 5, unequivocally tell
(compare Figs. 4 and 10) of both models, although re- a story of strong groupings among the various classes of
sults are only reported for the hierarchical PCA model. GPCR. Hence, there are systematic differences in the
An interesting result is that the base-level models do properties among these classes of GPCR, and some
not need overly many components in order to capture classes are more inherently similar than others.
the dominant patterns in a data set. The current data set To interpret the patterns of a score plot one can
can be understood as a hierarchical set-up in three lay- examine the corresponding loading plot. In PCA there is
ers. The top layer highlights lucid relationships and a direct score-loading correspondence, whereas the
shows how the various trans-membrane regions influ- interpretation of a PLS model might be slightly more
ence the spread of the receptors. On the middle layer one difficult if the X-matrix contains structured noise that is
can then identify which amino acid positions are unrelated to Y [34]. Looking further at how each vari-
responsible for the patterns seen among the receptors. able contributes to the separation in each dimension
Finally, on the lowest level, which we have not consid- gives insight into the relative importance of each vari-
ered in this paper, inspection of the zz-scales will then able.
highlight which amino acid properties in the influential Moreover, DModX and other residual plots can un-
positions are most important. Reference [26] provides cover moderate outliers in data, i.e. samples where the
the zz-scales. signatures in their variables are different from the
majority of observations. Serious outliers have a more
profound impact on the model and they therefore show
Discussion up as strong outliers in a score plot. The PLS-DA score
plot in Fig. 2, for example, highlights the existence of
The need for multivariate projection methods one deviating SD rat subject to chloroquine exposure. In
this situation, contribution plotting (see Fig. 3) is a
Analytical and bioanalytical data acquisition techniques helpful approach in order to delineate why and how this
like cDNA arrays, 2D-electrophoresis-based proteo- outlier is different.
mics, and metabonomics studies by HPLC, NMR, and/ Plotting of scores and residuals versus time, or some
or mass spectroscopy, yield a wealth of data about every other external order of data collection (e.g. geographical
sample. Resulting high-density high-volume data-sets coordinates) is also informative. Such plots can reveal
can readily exceed thousands of observations and (unwanted) trends in the data. The quartet of score plots
428
presented in Fig. 1 demonstrates the periodicity in the expensive experimentation. Then other designs, such as
microarray experiments; this, in turn, suggests the onion designs, are of relevance [38].
duration of the respective cell-cycle. The color-coding of Irrespective of its origin and future use, any DOE
samples by sex or other group of interest provides an protocol will benefit from analysis using multivariate
indication of whether such issues affect grouping within projection methods, especially if the protocol has paved
your set. Coding by analytical number or by sampling the way for extensive and multivariate measurements on
order might reveal analytical drift, which might be a its experimental trials. Thus DOE [37, 39] and its
serious problem with complex analytical and bioana- extension for design in molecular properties [5] are
lytical techniques. When samples of interest that expeditious routes toward obtaining informative, reli-
‘‘should’’ be grouped are not, PCA, PLS, etc., give able, and useful multivariate models.
warning of a problem in understanding, or previously Furthermore, there are many tools ‘‘surrounding’’
hidden complexity within the data. multivariate projection methods, jointly striving to im-
One of the great assets of multivariate projection prove their utility, flexibility, and versatility. Thus, there
methods is the plethora of available model parameters are many methods of pre-processing multivariate data,
and other diagnostic tools—and plots and lists there- all trying to reshape the data to be better suited for the
of—which aid in obtaining fundamental insights into the subsequent analysis. Common techniques for pre-pro-
data generation process, even when there are substantial cessing of multivariate data include methods for scaling
levels of missing data. These include (quotes from Ref. and centering, transformation, expansion, and signal
[35]): correction and compression [13]. Here we merely men-
tion that neglecting proper pre-processing may make the
– Discovering (often unexpected) groupings in the data.
multivariate analysis fruitless. Methods of particular
– Seeing discontinuous relationships in the data.
interest in analysis of analytical and bioanalytical omics-
– Seeing relationships between variables.
data are filters obtained from the orthogonal signal
– Identifying variables separating two or more classes.
correction family [34, 40] and compressors drawn from
– Classifying unknown observations into known classes.
the wavelet function family [41].
– Building mathematical models of large datasets.
We would also like to point out that PCA, PLS, and
– Compressing large datasets into smaller, more infor-
the like are not restricted to the analysis of linear
mative datasets
problems arising from two-way X–Y data matrices. On
the contrary, great flexibility and versatility is ensured
through a vast series of modifications and extensions of
Reliability, flexibility, versatility and scalability basic algorithms. Among the many extensions is a whole
of multivariate projection methods set of methods for data that are not linear. These
methods include quadratic PLS [42], spline PLS [43],
Many data mining and analytical techniques are avail- GIFI-PLS [44], and implicit non-linear latent variable
able for processing and overviewing multivariate data. regression [45].
However, we believe that latent variable projection Apart from the non-linearly oriented methods, tech-
methods are particularly apt at handling the data-ana- niques like bifocal PLS [46, 47], batch-wise PLS (and
lytical challenges arising from analytical and bioanalyt- PCA) [48, 49], hierarchical PLS (and PCA) [13–18], and
ical omics data. Projection-based methods are designed multiway PCA and PLS [50, 51] extend the multivariate
to effectively handle the hugely multivariate nature of projection approach to encompass three-way matrices
such data. In this paper we have presented PCA, SIM- and situations where more than two matrices are con-
CA, PLS, PLS-DA, and hierarchical modeling for sidered. Because of ever-growing data bases we believe
analysis of the three example data-sets. However, as that the hierarchical problem architecture is most useful
discussed below, there are other twists and aspects of and has great promise. Hierarchical models can be set up
these techniques, contributing to their general applica- in more than two layers and this enables an interesting
bility and increasing popularity. approach to scalability with which it is possible to zoom
An often overlooked question is how to design an on to many different model levels and thereby address
experiment to make sure it contains a maximum amount model interpretation at different levels of detail.
of information. Design of experiments (DOE) generates
a set of representative, informative, and diverse experi-
ments [36, 37]. Because one objective is to be restrictive Concluding remarks
with experiments, a DOE protocol is normally tailored
toward the precise needs of the on-going investigation. Multivariate projection methods are a useful and ver-
This means that in a screening application many factors satile technology for modeling, monitoring, and predic-
are studied in a design with few experiments whereas in tion of the often complex problems and data structures
an optimization study few factors are investigated in encountered within the omics discipline. The results can
detail using rather many experimental trials. With ana- be displayed graphically in many different ways and this
lytical and bioanalytical data the number of samples all works because the methods capture the dominant,
(‘‘experiments’’) is often not such a serious issue as in latent properties of the system under study. It is our
429
belief that as multivariate chemometric methods evolve 25. Gunnarsson I, Andersson PM, Wikberg J, Lundstedt T (2003)
and develop, this will involve applications to omics data. J Chemom 17:82–92
26. Sandberg M, Eriksson L, Jonsson J, Sjöström M, Wold S
Hence, we look forward to an interesting future for (1988) J Med Chem 41:2481–2491
‘‘multivariate omics data analysis’’ and the many new 27. Eriksson L, Andersson PM, Johansson E, Lundstedt T (2002)
innovative ideas that will probably be seen in the near Statistical molecular design—a core concept in multivariate
future. qsar and combinatorial technologies. Part I—Basic principles
and application to lead optimization. Part II—QSAR appli-
cations. Part III—QSAR-directed virtual screening. Part
IV—SMD: an integral part of combC and HTS. Part V—Some
References extensions and recent developments. http://www.acc.umu.se/
%7Etnkjtg/chemometrics/editorial/. cited 19 December 2003
1. http://www.nobel.se/chemistry/laureates/2002/chemadv02.pdf 28. http://www.umetrics.com
(Cited 5 December 2003) 29. Wold S (1978) Technometrics 20:397–405
2. Lockhart DJ, Winzeler EA (2000) Nature 405:827–836 30. Trygg J (2001) PhD Thesis. Umeå University,
3. Nicholson JK, Connelly J, Lindon JC, Holmes E (2002) Me- 31. Ståhle L, Wold S (1987) J Chemom 1:185–196
tabonomics: A Platform for Studying Drug Toxicity and Gene 32. Barker M, Rayens W (2003) J Chemom 17:166–173
Function. Nat Rev 1:153–161 33. Atif U, Earll, Eriksson L, Johansson E, Lord P, Margrett S
4. Jackson JE (1991) A user’s guide to principal components. (2002) Analysis of gene expression datasets using partial least-
Wiley, New York (ISBN 0-471-62267-2) squares discriminant analysis and principal-component analy-
5. Martens H, Naes T (1989) Multivariate calibration. Wiley, NY, sis. In: Martyn Ford, David Livingstone, John Dearden and
ISBN 0-471-90979-3 Han Van de Waterbeemd (eds) Euro QSAR 2002 designing
6. Wold S, Esbensen K, Geladi P (1987) Chemom Intel Lab Syst drugs and crop protectants: processes, problems and solutions.
2:37–52 Blackwell, Oxford, pp 369–373 ISBN 1-4051-2561-0.
7. Wold S, Albano C, Dunn WJ, Edlund U, Esbensen K, Geladi 34. Wold S, Trygg J, Berglund A, Antti H (2001) Chemom Intell
P, Hellberg S, Johansson E, Lindberg W, Sjöström M (1984) Lab Syst 58:131–150
In: Kowalski BR (ed) Chemometrics: mathematics and statis- 35. Kristal BS (2002) Practical considerations and approaches for
tics in chemistry, D. Reidel Publishing Company, Dordrecht entry-level megavariate analysis. http://mickey.utmem.edu/pa-
8. Sjöström M, Wold S, Söderström B (1985) PLS Discriminant pers/bioinformatics_02/pdfs/Kristal.pdf. cited 5 February 2004
Plots. In: Proceedings of PARC in Practice, Amsterdam 36. Box GEP, Hunter WG, Hunter JS (1978) Statistics for exper-
9. Kalivas JH (1999) J Chemom 13:111–132 imenters. Wiley, New York
10. Wold S, Johansson E, Cocchi M (1993) In: Kubinyi H (ed) 3D- 37. Eriksson L, Johansson E, Kettaneh-Wold N, Wikström C,
QSAR in drug design, theory, methods, and applications. Wold S (2000) Design of experiments—principles and appli-
ESCOM Science Publishers, Leiden, pp 523–550 cations. Umetrics AB, 2000. ISBN 91-973730-0-1
11. Burnham, AJ, Viveros R, MacGregor JF (1996) J Chemom 38. Olsson I, Gottfries J, Wold S, D-optimal onion design (DOOD)
10:31–45 in statistical molecular design, chemometrics and intelligent
12. Burnham, AJ, MacGregor JF, Viveros R (1999) Chemom Intel laboratory systems. Chemom Intell Lab Syst 73:37–46
Lab Syst 48:167–180 39. Eriksson L, Arnhold T, Beck B, Fox T, Johansson E, Kriegl
13. Eriksson L, Johansson E, Kettaneh-Wold N, Wold S (2001) JM (2004) Onion design and its application to a pharmaceutical
Multi- and megavariate data analysis—principles and applica- QSAR problem. J Chemom 18:188–202
tions. Umetrics AB. ISBN 91-973730-1-X 40. Wold S, Antti H, Lindgren F, Öhman J (1998) Chemom Intell
14. Berglund A, De Rosa MC, Wold S (1997) J Comput Aid Mol Lab Syst 44:175–185
Des 11:601–612 41. Trygg J, Wold S (1998) Chemom Intell Lab Syst 42:209–220
15. Westerhuis J, Kourti T, MacGregor JF (1998) J Chemom 42. Wold S, Kettaneh-Wold N, Skagerberg B (1989) Chemom In-
12:301–321 tell Lab Syst 7:53–65
16. Wold S, Kettaneh N, Tjessem K (1996) J Chemom 10:463–482 43. Wold S (1992) Chemom Intell Lab Syst 14:71–84
17. Janné K, Pettersen J, Lindberg NO, Lundstedt T (2001) 44. Eriksson L, Johansson E, Lindgren F, Wold S (2000) Quant
J Chemom 15:203–213 Struct Act Relat 19:345–355
18. Eriksson L, Johansson E, Lindgren F, Sjöström M, Wold S 45. Berglund A, Wold S (1997) J Chemom 11:141–156
(2002) J Comput Aided Mol Des 16:711–726 46. Wold S, Hellberg S, Lundstedt T, Sjöström M, Wold H (1987)
19. The data are taken from the web-site http://cellcycle- PLS modeling with latent variables in two or more dimensions.
www.stanford.edu. Cited 25 March 2003 In: Proceedings Frankfurt PLS-meeting, September
20. Spellman et al. (1998) Mol Biol Cell 9:3273–3297. 47. Eriksson L, Damborsky J, Earll M, Johansson E, Trygg J,
21. Cho RJ et al. (1998) Mol Cell 2:65–73 Wold S (2004) SAR & QSAR Env. Res. 15 ( In press)
22. Johansson D, Lindgren P (2002) Masters Thesis in Bioinfor- 48. Wold S, Kettaneh N, Fridén H, Holmberg A (1998) Chemom
matics, Umeå University, Intell Lab Syst 44:331–340
23. Espina JR, Shockcor JP, Herron WJ, Car BD, Contel NR, 49. Antti H, Bollard ME, Ebbels T, Keun H, Lindon JC, Nichol-
Ciaccio PJ, Lindon JC, Holmes E, Nicholson JK (2001) Magn son JK, Holmes E (2002) J Chemom 16:461–468
Reson Chem 39:559–565 50. Wold S, Geladi P, Esbensen K, Öhman J (1987) J Chemom
24. Eriksson L, Antti H, Holmes E, Johansson E Multi- and 1:41–56
Megavariate Data Analysis: Finding and Using Regularities in 51. Nomikos P, MacGregor JF (1995) Chemom Intell Lab Syst
Metabonomics Data. In: Robertson DG (Ed) Toxicological 30:97–108
metabonomics: the use of NMR spectroscopy and multivariate
statistics in drug safety evaluation. Kluwer, Dordrecht

Chemometricsanalbiochem

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chemometricsanalbiochem

Uploaded by

Copyright:

Available Formats

Anal Bioanal Chem (2004) 380: 419–429

Lennart Eriksson Æ Henrik Antti Æ Johan Gottfries

Using chemometrics for navigating in the large data sets of genomics,

Abstract This article describes the applicability of

already been reported [24]. In the current publication the

Our third illustration relates to a comparison of diﬀerent

GENEGRID II.M5 (PCA-X), Par All Obs 800R Class 1

-20 _70 -20

-20 -10 0 10 20 -20 -10 0 10 20

-20 -10 0 10 20 -20 -10 0 10 20

with 2–4 classes, but with more classes the SIMCA

Results for data set III

Base-level PCA models

Fig. 2 Three dimensional scatter plot of the PLS-DA model for

Metabonomics.M8 (PLS-DA), PLS-DA of s sc and sa

HI-GPCR Primary.M9 (PCA-X), Top model

Fig. 6 Hierarchical PCA p1/p2 loading plot corresponding to

HI-GPCR Primary.M8 (PCA-X), All TM's

HI-GPCR Primary.M9 (PCA-X), Top model

Score Contrib(Obs 123 - Average), Weight=p1p2 5.00

0 the observations (scores, DModX) improve with

You might also like