Lecture 15 - Metabolite Data Analysis March 18 Taken

Metabolomics : SCBT 33312
Lecture 15
Metabolite Data Analysis
G. Prabhakaran1
Learning Objectives
• General Data Analysis Issues in Metabolomics
• Multi / Megavariate Analysis
• Two major Approaches in Data Analysis
• Data Analysis (Software) tools
• Metabolite Databases
• Directions and potential benefits of metabolomic research
2
The structural similarities between marketed drugs (‘drugs’) and endogenous natural human
metabolites (‘metabolites’ or ‘endogenites’), using ‘fingerprint’ methods in common use. 3
Different Procedures for data analysis
(Exploratory)
Selection ->Sampling-> Abstracting --------> Data reduction -> data organisation -> data
explanation
4
The Multivariate platform examines multiple variables to see how they relate to each other
General Data Analysis Issues and Approaches
in Metabolomics
Statistics
What is the probability that what was

observed occurred by chance?
Too Much Process Data

5
Data
Information
- convert data into information and knowledge, and
- explore the relationship between variables
Too Much Process Data

Multi / Megavariate Analysis
1. Clustering
2. Principal components
3. Pattern recognition 6
Too Much Process Data…
Drowning in data!
A typical industrial plant has hundreds of control loops, and
thousands of measured variables, many of which are updated every
few seconds.
This situation generates tens of millions of new data points each

day, and billions of data points each year. Obviously, this is far too
much for a human brain to absorb. Because of the way we visualise
things, we are basically limited to looking at only one or two
variables at a time:
12
10
0
1 2 3 4 5 6 7
Data-Rich but Knowledge-Poor
As a result, we have become “data-rich but knowledge-poor”.
The biggest problem is that interesting, useful patterns and

relationships which are not intuitively obvious lie hidden inside
enormous, unwieldy databases. Also, many variables are correlated.
This has led to the creation of “data-mining” techniques, aimed at

extracting this useful knowledge. Some examples are:
•Neural Networks
•Multiple Regression
•Decision Trees
•Genetic Algorithms
•Clustering
•MVA (Multi / Megavariate Analysis)
“Mining” data
Data  Information  Knowledge
The aim of data-mining can Connectedness Scientific

be illustrated graphically as principles
follows:
KNOWLEDGE
• Data
– unrelated facts Observed
+ patterns
associations
• Information
– facts plus relations INFORMATION
• Knowledge
– information plus patterns + relations
Raw Numbers
DATA
Understanding
What is MVA?
Multivariate analysis (MVA) is defined as the simultaneous analysis of
more than five variables. Some people use the term “megavariate”
analysis to denote cases where there are more than a hundred variables.
MVA uses ALL available data to capture the most information possible.
The basic principle is to boil down hundreds of variables down to a
mere handful.
MVA

“Don’t make things more complicated than they need to be.” Nature is simple…
Example: Apples and Oranges
A good example of these ideas is “Apple versus Orange”.
Clever scientists could easily come up with hundreds of different

things to measure on apples and oranges, to tell them apart:
1. Colour, shape, firmness, reflectivity,…
2. Skin: smoothness, thickness, morphology,…
3. Juice: water content, pH, composition,…
4. Seeds: colour, weight, size distribution,…
– etc.
+1 -1
However, there will never be more than one difference: is it an
apple or an orange? In MVA parlance, we would say that there is
only one latent attribute (hidden variables , as opposed to
observable variables)
Graphical Representation of MVA
The main element of MVA is the reduction in dimensionality. Taken to its extreme,
this can mean going from hundreds of dimensions (variables) down to just two,
allowing us to create a 2-dimensional graph.
Using these graphs, which our eyes and brains can easily handle, we
are able to ‘peer’ into the database and identify trends and
correlations.
This is illustrated on
the next page…
‘Peering” into the data

Graphical representation of MVA
Statistical Model (internal
Tmt X1 X4 X5 Rep Y avec Y sans
to
1 -1 -1 -1 1 2.51 2.74 .
1 -1 -1 -1 2 2.36 3.22 . .. . software)
1 -1 -1 -1 3 2.45 2.56
..
2 -1 0 1 1 2.63 3.23
. .
2 -1 0 1 2 2.55 2.47 .
2 -1 0 1 3 2.65 2.31 . .
3
3
-1
-1
Raw Data:
1
1
0
0
1
2
2.45
2.6
2.67
2.45
4
-1
0
impossible to
1
-1
0
1
3
1
2.53
3.02
2.98
3.22
4 0 interpret
-1 1 2 2.7 2.57
4 0 -1 1 3 2.97 2.63
5 0 0 0 1 2.89 3.16 Y
5 0 0 0 2 2.56 3.32 trends
5 0 0 0 3 2.52 3.26
6 0 1 -1 1 2.44 3.1
trends X
6 0 1 -1 2 2.22 2.97
X trends
6 0 1 -1 3 2.27 2.92
X
X
hundreds of columns
thousands of rows 2-D Visual Outputs
Illustrative Data Set: Food Consumption
in European Countries
To illustrate these concepts, we take an easy-to-understand
example involving food.
Data on food preferences in 16 different European countries are

considered, involving the consumption patterns for 18 different
food groups.
Look at the table on the following page. Can you tell anything
from the raw numbers? Of course not. No one could.
Data Table: Food Consumption in
European Countries
Notethat
Note thatMVA
MVAcan
canhandle
handle
Courtesy of Umetrics corp.
upto
up to10-20%
10-20%missing
missingdata
data
Score Plot
The MVA software generates two main types of plots to represent
the data: Score plots and Loadings plots.
The first of these, the Score plot, shows all the original data points
(observations) in a new set of coordinates or components. Each
score is the value of that data point on one of the new component
dimensions:
The Score Plot is the
. . projection of the original
.. . data points onto a plane
.
. . defined by two new
. . components.
A score plot shows how the observations are arranged in the new
component space. The score plot for the food data is shown on
the next page. Note how similar countries cluster together…
Score Plot for Food Example
95% Confidence interval
(analogous to t-test)
Score Plot =
observations
Loadings Plot
The second type of data plot generated by the MVA software is the
Loadings plots. This is the equivalent to the score plot, only from the
point of view of the original variables.
Each component has a set of loadings or weights, which express the

projection of each original variable onto each new component.
Loadings show how strongly each variable is associated with

each new component. The loadings plot for the food example is
shown on the next page. The further from the origin, the more
significant the correlation.
Note that the quadrants are the same on each type of plot. Sweden
and Denmark are in the top-right corner; so are frozen fish and
vegetables. Using both plots, variables and observations can be
correlated with one another.
Use of loadings (illustration)
Projection
of old
variabiles
onto new
Loadings Plot =
variables
To MVA, Data Overload is Good!
One great advantage of MVA is that the more data are available,
the less noise matters (assuming that the noise is normally
distributed). This is one of the reasons MVA is used to mine huge
amounts of data.
This is analogous to NMR measurements in a laboratory. The

more trials there are, the clearer the spectrum becomes:
After
1. 1500
trials
2.

3. Not random at all
(+ve and –ve noise
Looks random cancels out)
Too Much Data is Good!
METABOLITE ANALYSIS
– Metabolomics projects generate a large sets of data.
– Analysis of the large data sets are done using

multivariate tools.
– Multivariate tools give the possibility to interpret

differences between samples is dependent on describing
metabolic differences in visually simple ways, relating
relative differences with known metabolic pathways to
pinpoint impacted steps.
independent variable vs dependent variable

21
Multivariate data analysis
Analytical methods globally describing systems

constituted by many variables calculated or
measured for a set of individualities
Multivarite data analysis is defined as a set of methods

allowing the simultaneous treatment of variables
being numerous, approximated, each bearing limited
information, discrete or continuous, heterogeneous,
qualitative and quantitative
Data analysis answers but also gives direction for future data collection
23
Multivariate Analysis Methods
Two general types of MVA technique
1. Analysis of dependence
• Where one (or more) variables are dependent variables,

to be explained or predicted by others
– E.g. Multiple regression, PLS, MDA
2. Analysis of interdependence
• No variables thought of as “dependent”

• Look at the relationships among variables, objects or
cases
– E.g. cluster analysis, factor analysis
24
PLS: Partial least squares regression is a linear regression method
25
26
27
Data
Information
- convert data into information and knowledge, and
- explore the relationship between variables

1. Clustering
3. Pattern recognition 30
Can you group these?
31
Partitional Clustering
32
Can you group these? 33
Hierarchical
Clustering
Family tree : Dendogram
34
Cluster Analysis
• Techniques for identifying separate groups of

similar cases
– Similarity of cases is either specified directly in a
distance matrix, or defined in terms of some distance
function
• Also used to summarise data by defining
segments of similar cases in the data
– This use of cluster analysis is known as “dissection”
Cluster Analysis: Techniques for identifying separate groups of similar cases
36
37
Principal Component Analysis
Principal Components Analysis is a method that reduces data dimensionality by

performing a covariance analysis between factors.
PCA is a Multi / Megavariate Analysis
When plotting the five samples a 3 dimensional space is required - and the axes
would be the three variables. The PCA describes the relationship between 38
variables.
39
Principal Components Analysis ( PCA)
• An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
• Can be used to:
– Reduce number of dimensions in data
– Find patterns in high-dimensional data
– Visualize data of high dimensionality
• Example applications:
– Face recognition
– Image compression
– Gene expression analysis
40
PCA on all Genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes reduced to 2
41
PCA on 100 top significant genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 100 genes reduced to 2
42
Given experience, what can we
know about unknowns
Probably Sad Probably Happy

43
Pattern Recognition
44
1. Clustering
3. Pattern recognition
Diversity Analysis :
Grouping (clustering) ---- degree of relationship within the group - distance between clusters
45
What is Multi / Megavariate Analysis?
• Simplifying large data sets for human consideration
– Clustering and Principal Components
• Pattern Recognition:
– Classifying unknowns into previously defined groups
46
DATA ANALYSIS (SOFTWARE) TOOLS
49
Thank you
WISH YOU A BRIGHT CAREER
53
54
Basic Statistics
It is assumed that the student is familiar with the following basic
statistical concepts:
• Mean / median / mode

• Standard deviation / variance
• Normality / symmetry
• Degree of association
– Correlation coefficients
• Degree of explanation
– R2, F-test
• Significance of differences
– t-test, Chi-square
If not, or if it’s been a while, it is advisable to consult an introductory

statistics text and do a cursory review.
NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

Statistical Tests
Classical statistics
Classical statistics isis severely
severely
hampered by
hampered by certain
certain
assumptions about
assumptions about data:
data:
--All
All values
values are
are accurate
accurate
--All
All variables
variables are
are uncorellated
uncorellated
--There
There are
are no
no missing
missing datadata
For real
For real process
process data,
data, such
such
assumptions are
assumptions are totally
totally
unrealistic.
unrealistic.
Statistical tests help characterise an
existing dataset. They do NOT enable
you to make predictions about future
data. For this we must turn to
regression techniques…

Regression
Regression can be summarised as follows:
• Take a set of data points, each described by a vector of values

(y, x1, x2, … xn)
• Find an algebraic equation
y = b1x1 + b2x2 + … + bnxn + e
that “best expresses” the relationship between y and the xi’s.
• This equation can be used to predict a new y-value given new xi’s.

Independent vs. Dependent
Variables
• The xi’s in the preceding equation are called independent
variables. They are used to predict y.
• Y is called the dependent variable, because the way the

equation is written, its value depends on the xi’s.
X X X Y
Y Y
X XX Y
X

Simple vs. Multiple
Regression
• Simple regression has only one x:
y = bx + e
• Multiple regression has more than one x:

y = b1x1 + b2x2 + … + bnxn + e
X X
X X
XX
X
X
Linear vs. Nonlinear Regression
• Linear regression involves no powers of xi (square, cube etc.)

and no cross-product terms of form xixj
• If such terms are present, we are dealing with nonlinear

regression.
XiXj X 3
X 2

The Error Term e
• The error term expresses the uncertainty in an empirical predictive
equation derived from imperfect observations.
• Factors contributing to the error term include:

– measurement error
– measurement noise
– unaccounted-for natural variations
– disturbances to the process being measured

The Least Squares Principle
• Regression tries to produce a

“best fit equation” --- but what is
“best” ?
• Criterion: minimize the sum of

squared deviations of data
points from the regression line.


Lecture 15 - Metabolite Data Analysis March 18 Taken

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 15 - Metabolite Data Analysis March 18 Taken

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 15 - Metabolite Data Analysis March 18 Taken

Uploaded by

Copyright:

Available Formats

Metabolomics : SCBT 33312

Metabolite Data Analysis

• General Data Analysis Issues in Metabolomics

• Multi / Megavariate Analysis

• Two major Approaches in Data Analysis

• Data Analysis (Software) tools

• Directions and potential benefits of metabolomic research

What is the probability that what was

Too Much Process Data

Too Much Process Data

This situation generates tens of millions of new data points each

The biggest problem is that interesting, useful patterns and

This has led to the creation of “data-mining” techniques, aimed at

The aim of data-mining can Connectedness Scientific

Clever scientists could easily come up with hundreds of different

‘Peering” into the data

Data on food preferences in 16 different European countries are

Each component has a set of loadings or weights, which express the

Loadings show how strongly each variable is associated with

This is analogous to NMR measurements in a laboratory. The

– Metabolomics projects generate a large sets of data.

– Analysis of the large data sets are done using

– Multivariate tools give the possibility to interpret

independent variable vs dependent variable

Analytical methods globally describing systems

Multivarite data analysis is defined as a set of methods

• Where one (or more) variables are dependent variables,

• No variables thought of as “dependent”

- explore the relationship between variables

Multi / Megavariate Analysis

• Techniques for identifying separate groups of

Principal Components Analysis is a method that reduces data dimensionality by

Probably Sad Probably Happy

• Simplifying large data sets for human consideration

– Clustering and Principal Components

– Classifying unknowns into previously defined groups

WISH YOU A BRIGHT CAREER

• Mean / median / mode

If not, or if it’s been a while, it is advisable to consult an introductory

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

• Take a set of data points, each described by a vector of values

• Find an algebraic equation

y = b1x1 + b2x2 + … + bnxn + e

that “best expresses” the relationship between y and the xi’s.

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

• Y is called the dependent variable, because the way the

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

• Multiple regression has more than one x:

• Linear regression involves no powers of xi (square, cube etc.)

• If such terms are present, we are dealing with nonlinear

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

• Factors contributing to the error term include:

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0

• Regression tries to produce a

• Criterion: minimize the sum of

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0