Molecular Modeling: Statistical Analysis of Complex Data

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 56

Molecular Modeling:

Statistical Analysis of
Complex Data

Terminology

SAR (Structure-Activity Relationships)


Circa 19th century?

QSPR (Quantitative Structure Property


Relationships)
Relate structure to any physical-chemical property of
molecule

QSAR (Quantitative Structure Activity Relationships)


Specific to some biological/pharmaceutical function of
molecule (Absorption, Distribution/Digestion, Metabolism,
Excretion)
Brown and Frazer (1868-9)
constitution related to biological response

LogP

Statistical Models
Simple
Mean, median and variation
Regression

Advanced
Validation methods
Principal components, co-variance
QSAR,QSPR
Multiple Regression

Modern QSAR
Hansch et. Al. (1963)
Activity travel through body
partitioning between varied solvents

C (minimum dosage required)


(hydrophobicity)
(electronic)
Es (steric)

Choosing Descriptors
Buffons Problem

Needle
Needle
Needle
Needle
Needle

Length?
Color?
Compostion?
Sheen?
Orientation?

Choosing Descriptors
Constitutional
MW, Natoms

Topological
Connectivity,Weiner index

Electrostatic
Polarity, polarizability, partial charges

Geometrical Descriptors
Length, width, Molecuar volume

Quantum Chemical

HOMO and LUMO energies


Vibrational frequencies
Bond orders
Energy total

Choosing Descriptors
Constitutional
MW, Natoms of element

Topological
Connectivity,Weiner index (sums of bond distances)
2D Fingerprints (bit-strings)
3D topographical indices, pharmacophore keys

Electrostatic
Polarity, polarizability, partial charges

Geometrical Descriptors
Length, width, Molecular volume

Choosing Descriptors
Chemical

Hydrophobicity (LogP)
HOMO and LUMO energies
Vibrational frequencies
Bond orders
Energy total
GSH

Statistical Methods
1-D analysis
Large dimension sets require
decomposition techniques
Multiple Regression
PCA
PLS

Connecting a descriptor with a


structural element so as to interpolate
and extrapolate data

Simple Error Analysis(1D)


Given N data
points
Mean
Variance

Regression

Cov ( X , Y )
R

Std ( X ) Std (Y )

Simple Error Analysis(1D)


Given N data
points
Regression

y
x

calc

obs

calc

obs

y calc y obs y
y residual
i

Simple Error Analysis(1D)


Given N data points
(Poor 0<R2<1(Good)
R

SSR
Cov ( X , Y )

Std (Y ) Std ( X ) Std (Y )


N

SSR ycalc y
i

1 N
Cov ( X , Y ) X i X Yi Y
N i 1
Correlation between fluctuations

Std (Y )

1
N

Y Y
i 1

Correlation vs.
Dependence?
Correlation
Two or more variables/descriptors may correlate
to the same property of a system

Dependence
When the correlation can be shown due to one
changing due to the change in another

Ex. Elephants head and legs


Correlation exists between size of head and legs
The size of one does not depend on the size of
the other

Quantitative Structure
Activity/Property
Relationships (QSAR,QSPR)

Discern relationships between multiple


variables (descriptors)
Identify connections between structural
traits (type of substituents, bond angles
substituent locale) and descriptor
values (e.g. activity, LogP, %
denaturation)

Pre-Qualifications
Size
Minimum of FIVE samples per
descriptor

Verification
Variance
Scaling
Correlations

QSAR/QSPR
Pre-Qualifications
Variance
Coefficient of Variation

QSAR/QSPR
Pre-Qualifications
Scaling
Standardizing or normalizing
descriptors to ensure they have equal
weight (in terms of magnitude) in
subsequent analysis

QSAR/QSPR
Pre-Qualifications
Scaling
Unit Variance (Auto Scaling)
Ensures equal statistical weights
(initially)

Mean Centering

QSAR/QSPR
Pre-Qualifications
CorrelationsR

i, j

Cov ( X i , X j )

th
th
Correlation between i and j descriptor

Std ( X i ) Std (Y j )

X
M

Ri , j

X
M
k

Remove correlated
descriptors
Keep correlated
descriptors so as to
reduce data set size
Apply math operation
to remove correlation
(PCR)

i ,k

i ,k

X i X j ,k X j

Xi

2 M

X
k

j ,k

X j

QSAR/QSPR
Pre-Qualifications
Correlations

QSAR/QSPR Scheme
Goal
Predict what happens next
(extrapolate)!
Predict what happens between data
points (interpolate)!

QSAR/QSPR Scheme
Types of Variable
Continuous
Concentration, occupied volume,
partition coefficient, hydrophobicity

Discrete
Structural (1-meta substituted, 0-no
meta substitution)

QSAR/QSPR-Principal
Components Analysis
Reduces dimensionality of
descriptors
Principle components are a set of
vectors representing the variance
in the original data

QSAR/QSPR-Principal
Components Analysis
Geometric Analogy (3-D to 2-D
PCA)
y

z
x

QSAR/QSPR-Principal
Components Analysis
Formulate matrix
Diagonalize matrix
Eigenvectors are the principal components
These principal components (new descriptors)
are a linear combination of the original
descriptors

Eigenvalues represent variance


Largest accounts for greatest % of data variance
Next corresponds to second greatest and so on

QSAR/QSPR-Principal
Components Analysis
Formulate matrix (Several types)
Correlation or covariance (N x P)
N is number of molecules
P is number of descriptors

Variance-Covariance matrix (N x N)

Diagonalize (Rotate) matrix

QSAR/QSPR-Principal
Components Analysis
Eigenvectors (Loadings)
Represents contribution from each original
descriptor to PC (new descriptor)
# columns = # of descriptors
# rows = # of descriptors OR # of molecules

Eigenvalues
Indicate which PC most important
(representative of original descriptors)
Benzene has 2 non-zero and 1 zero eigenvalue
(planar)

QSAR/QSPR-Principal
Components Analysis
Scores
Graphing each object/molecule
in space of 2 or more PCs
# rows = # of objects/molecules
# columns = # of descriptors OR #
of molecules
For benzene corresponds to graph in PC1 (x)
and PC2 (y) system

QSAR-PCASYBYL (Tripos I
nc.)

SYBYL (Tripos Inc.)

SYBYL (Tripos Inc.)

10D3D

SYBYL (Tripos Inc.)


Eigenvalues Explanation of
variance in data

SYBYL (Tripos Inc.)

Each point corresponds to column (#


points = # descriptors) in original data

Proximity correlation

SYBYL (Tripos Inc.)

Each point corresponds to row of


original data (i.e. #points =
#molecules) or graph of
molecules
Proximitysimilarity
in PC space
Small acting Big

H2O

He

Napthalene

Molecular Size

SYBYL (Tripos Inc.)


Outlier

SYBYL (Tripos Inc.)

QSAR/QSPR-Regression
Types
Principal
Component
Analysis

QSAR/QSPR-Regression
Types
Principal
Component
Analysis

Non-Linear Mappings
Calculate distance between points in Nd descriptor/parameter space
Euclidean
City-block distances

Randomly assign compounds in set to


points on a 2-D or 3-D space
Minimize Difference (Optimal N-d 2D
plot)

Non-Linear Mappings
Advantages
Non-linear
No assumptions!
Chance groupings unlikely (2D group
likely an N-D group)

Disadvantages
Dependence on initial guess (Use
PCA scores to improve)

QSAR/QSPR-Regression
Types
Multiple Regression
PCR
PLS

QSAR/QSPR-Regression
Types
Linear Regression
Minimize difference between
calculated and observed values
(residuals)
Regression

Multiple

QSAR/QSPR-Regression
Types
Principal Component
Regression
Regression but with Principal
Components substituted for
original descriptors/variables

QSAR/QSPR-Regression
Types
Partial Least Squares
Cross-validation determines
number of
descriptors/components to use
Derive equation
Use bootstrapping and t-test to
test coefficients in QSAR
regression

QSAR/QSPR-Regression
Types
Partial Least Squares (a.k.a.
Projection to Latent Structures)
Regression of a Regression
Provides insight into variation in
xs(bi,js as in PCA) AND ys (ais)

The tis are orthogonal


M= (# of variables/descriptors OR
#observations/molecules whichever
smaller)

QSAR/QSPR-Regression
Types
PLS is NOT MR or PCR in practice
PLS is MR w/cross-validation
PLS Faster
couples the target representation
(QSAR generation) and component
generation while PCA and PCR are
separate

PLS well applied to multi-variate


problems

QSAR/QSPR
Post-Qualifications
Confidence in Regression
TSS-Total Sum of Squares
ESS-Explained Sum of Squares
RSS-Residual Sum of Squares

QSAR/QSPR
Post-Qualifications
Confidence in Prediction (Predictive
Error Sum of Squares)

QSAR/QSPR
Post-Qualification
Bias?
Bootstrapping

Choosing best model?


Cross Validation

QSAR/QSPR
Post-Qualification
Bootstrapping

ASSUME calculated data is


experimental/observed data
Randomly choose N data (allowing for a
multiple picks of same data)
Regenerate parameters/regression
Repeat M times
Average over M bootstraps
Compare (calculate residual)
If close to zero then no bias
If large then bias exists
M is typically 50-100

QSAR/QSPR
Post-Qualification
Cross-Validation (used in PLS)

Remove one or more pieces of input data


Rederive QSAR equation
Calculate omitted data
Compute root-mean-square error to evaluate
efficacy of model
Typically 20% of data is removed for each iteration
The model with the lowest RMS error has the optimal
number of components/descriptors

QSPR Example
Narvaez, J. N., Lavine, B. K. and Jurs, P. C. Chemical Senses, 11, 145-156

Relation between musk odourant properties


and benzenoid structure
Training set of 148 compounds (81 non-musk and 67
musk)
47 chemical descriptors initially
Pre-qualifications
Correlations (47-12=35)

Post-qualifications
Bootstrapping
Test-set
6/6 musks, 8/9 non-musks

Practical Issues
10 times as many compounds as
parameters fit
3-5 compounds per descriptor
Traditional QSAR
Good for activity prediction
Not good for whether activity is due
to binding or transport

Advanced Methods

Neural Networks
Genetic/Evolutionary Algorithms
Monte Carlo
Alternate descriptors
Reduced graphs
Molecular connectivity indices
Indicator variables (0 or 1)
Combinatorics (e.g. multiple substituent sites)

Tools Available
Sybyl (Tripos Inc.)
Insight II (Accelrys Inc.)
Pole Bio-Informatique Lyonnais
http://pbil.univ-lyon1.fr/

Molecular Biology
http://www.infobiogen.fr/services/dea
mbulum/english/logiciels.html

Summary
QSAR/QSPR
Statistics connect structure/behavior w/
observables
Interpolate/Extrapolate

Multi-Variate Analysis
Pre-Qualification
Regression
PCA
PLS
MLS

Post-Qualification

You might also like