Professional Documents
Culture Documents
Lecture 15 - Metabolite Data Analysis March 18 Taken
Lecture 15 - Metabolite Data Analysis March 18 Taken
Lecture 15 - Metabolite Data Analysis March 18 Taken
Lecture 15
G. Prabhakaran1
Learning Objectives
• Metabolite Databases
2
The structural similarities between marketed drugs (‘drugs’) and endogenous natural human
metabolites (‘metabolites’ or ‘endogenites’), using ‘fingerprint’ methods in common use. 3
Different Procedures for data analysis
(Exploratory)
Selection ->Sampling-> Abstracting --------> Data reduction -> data organisation -> data
explanation
4
The Multivariate platform examines multiple variables to see how they relate to each other
General Data Analysis Issues and Approaches
in Metabolomics
Statistics
Information
- convert data into information and knowledge, and
- explore the relationship between variables
10
0
1 2 3 4 5 6 7
Data-Rich but Knowledge-Poor
As a result, we have become “data-rich but knowledge-poor”.
Raw Numbers
DATA
Understanding
What is MVA?
Multivariate analysis (MVA) is defined as the simultaneous analysis of
more than five variables. Some people use the term “megavariate”
analysis to denote cases where there are more than a hundred variables.
MVA uses ALL available data to capture the most information possible.
The basic principle is to boil down hundreds of variables down to a
mere handful.
MVA
“Don’t make things more complicated than they need to be.” Nature is simple…
Example: Apples and Oranges
A good example of these ideas is “Apple versus Orange”.
+1 -1
However, there will never be more than one difference: is it an
apple or an orange? In MVA parlance, we would say that there is
only one latent attribute (hidden variables , as opposed to
observable variables)
Graphical Representation of MVA
The main element of MVA is the reduction in dimensionality. Taken to its extreme,
this can mean going from hundreds of dimensions (variables) down to just two,
allowing us to create a 2-dimensional graph.
Using these graphs, which our eyes and brains can easily handle, we
are able to ‘peer’ into the database and identify trends and
correlations.
This is illustrated on
the next page…
3
-1
-1
Raw Data:
1
1
0
0
1
2
2.45
2.6
2.67
2.45
4
-1
0
impossible to
1
-1
0
1
3
1
2.53
3.02
2.98
3.22
4 0 interpret
-1 1 2 2.7 2.57
4 0 -1 1 3 2.97 2.63
5 0 0 0 1 2.89 3.16 Y
5 0 0 0 2 2.56 3.32 trends
5 0 0 0 3 2.52 3.26
6 0 1 -1 1 2.44 3.1
trends X
6 0 1 -1 2 2.22 2.97
X trends
6 0 1 -1 3 2.27 2.92
X
X
hundreds of columns
thousands of rows 2-D Visual Outputs
Illustrative Data Set: Food Consumption
in European Countries
To illustrate these concepts, we take an easy-to-understand
example involving food.
Look at the table on the following page. Can you tell anything
from the raw numbers? Of course not. No one could.
Data Table: Food Consumption in
European Countries
Notethat
Note thatMVA
MVAcan
canhandle
handle
Courtesy of Umetrics corp.
upto
up to10-20%
10-20%missing
missingdata
data
Score Plot
The MVA software generates two main types of plots to represent
the data: Score plots and Loadings plots.
The first of these, the Score plot, shows all the original data points
(observations) in a new set of coordinates or components. Each
score is the value of that data point on one of the new component
dimensions:
The Score Plot is the
. . projection of the original
.. . data points onto a plane
.
. . defined by two new
. . components.
A score plot shows how the observations are arranged in the new
component space. The score plot for the food data is shown on
the next page. Note how similar countries cluster together…
Score Plot for Food Example
95% Confidence interval
(analogous to t-test)
Score Plot =
observations
Loadings Plot
The second type of data plot generated by the MVA software is the
Loadings plots. This is the equivalent to the score plot, only from the
point of view of the original variables.
Note that the quadrants are the same on each type of plot. Sweden
and Denmark are in the top-right corner; so are frozen fish and
vegetables. Using both plots, variables and observations can be
correlated with one another.
Use of loadings (illustration)
Projection
of old
variabiles
onto new
Loadings Plot =
variables
To MVA, Data Overload is Good!
One great advantage of MVA is that the more data are available,
the less noise matters (assuming that the noise is normally
distributed). This is one of the reasons MVA is used to mine huge
amounts of data.
After
1. 1500
trials
2.
3. Not random at all
(+ve and –ve noise
Looks random cancels out)
Too Much Data is Good!
METABOLITE ANALYSIS
Data analysis answers but also gives direction for future data collection
23
Multivariate Analysis Methods
Two general types of MVA technique
1. Analysis of dependence
2. Analysis of interdependence
24
PLS: Partial least squares regression is a linear regression method
25
26
27
Data
Information
- convert data into information and knowledge, and
32
Can you group these? 33
Hierarchical
Clustering
Family tree : Dendogram
34
Cluster Analysis
36
37
Principal Component Analysis
41
PCA on 100 top significant genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 100 genes reduced to 2
42
Given experience, what can we
know about unknowns
44
Multi / Megavariate Analysis
1. Clustering
2. Principal components
3. Pattern recognition
Diversity Analysis :
Grouping (clustering) ---- degree of relationship within the group - distance between clusters
45
What is Multi / Megavariate Analysis?
• Pattern Recognition:
46
DATA ANALYSIS (SOFTWARE) TOOLS
49
Thank you
53
54
Basic Statistics
It is assumed that the student is familiar with the following basic
statistical concepts:
For real
For real process
process data,
data, such
such
assumptions are
assumptions are totally
totally
unrealistic.
unrealistic.
Statistical tests help characterise an
existing dataset. They do NOT enable
you to make predictions about future
data. For this we must turn to
regression techniques…
• This equation can be used to predict a new y-value given new xi’s.
X X X Y
Y Y
X XX Y
X
X X
X X
XX
X
X
NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0
Linear vs. Nonlinear Regression
XiXj X 3
X 2