Molecular Modeling: Statistical Analysis of Complex Data

Molecular Modeling:
Statistical Analysis of
Complex Data
Terminology
SAR (Structure-Activity Relationships)

Circa 19th century?
QSPR (Quantitative Structure Property

Relationships)
Relate structure to any physical-chemical property of
molecule
QSAR (Quantitative Structure Activity Relationships)

Specific to some biological/pharmaceutical function of
molecule (Absorption, Distribution/Digestion, Metabolism,
Excretion)
Brown and Frazer (1868-9)
constitution related to biological response
LogP
Statistical Models
Simple
Mean, median and variation
Regression
Advanced
Validation methods
Principal components, co-variance
QSAR,QSPR
Multiple Regression
Modern QSAR
Hansch et. Al. (1963)
Activity travel through body
partitioning between varied solvents
C (minimum dosage required)

(hydrophobicity)
(electronic)
Es (steric)
Choosing Descriptors
Buffons Problem
Needle
Needle
Needle
Needle
Needle
Length?
Color?
Compostion?
Sheen?
Orientation?
Constitutional
MW, Natoms
Topological
Connectivity,Weiner index
Electrostatic
Polarity, polarizability, partial charges
Geometrical Descriptors
Length, width, Molecuar volume
Quantum Chemical
HOMO and LUMO energies

Vibrational frequencies
Bond orders
Energy total
Constitutional
MW, Natoms of element
Topological
Connectivity,Weiner index (sums of bond distances)
2D Fingerprints (bit-strings)
3D topographical indices, pharmacophore keys
Electrostatic
Polarity, polarizability, partial charges
Geometrical Descriptors
Length, width, Molecular volume
Chemical
Hydrophobicity (LogP)
HOMO and LUMO energies
Vibrational frequencies
Bond orders
Energy total
GSH
Statistical Methods
1-D analysis
Large dimension sets require
decomposition techniques
Multiple Regression
PCA
PLS
Connecting a descriptor with a

structural element so as to interpolate
and extrapolate data
Simple Error Analysis(1D)

Given N data
points
Mean
Variance
Regression
Cov ( X , Y )
R
Std ( X ) Std (Y )

Given N data
points
Regression
y
x
calc
obs
calc
obs
y calc y obs y
y residual
i

Given N data points
(Poor 0<R2<1(Good)
R
SSR
Cov ( X , Y )
Std (Y ) Std ( X ) Std (Y )

N
SSR ycalc y
i
1 N
Cov ( X , Y ) X i X Yi Y
N i 1
Correlation between fluctuations
Std (Y )
1
N
Y Y
i 1
Correlation vs.
Dependence?
Correlation
Two or more variables/descriptors may correlate
to the same property of a system
Dependence
When the correlation can be shown due to one
changing due to the change in another
Ex. Elephants head and legs

Correlation exists between size of head and legs
The size of one does not depend on the size of
the other
Quantitative Structure
Activity/Property
Relationships (QSAR,QSPR)
Discern relationships between multiple

variables (descriptors)
Identify connections between structural
traits (type of substituents, bond angles
substituent locale) and descriptor
values (e.g. activity, LogP, %
denaturation)
Pre-Qualifications
Size
Minimum of FIVE samples per
descriptor
Verification
Variance
Scaling
Correlations
QSAR/QSPR
Pre-Qualifications
Variance
Coefficient of Variation
QSAR/QSPR
Pre-Qualifications
Scaling
Standardizing or normalizing
descriptors to ensure they have equal
weight (in terms of magnitude) in
subsequent analysis
QSAR/QSPR
Pre-Qualifications
Scaling
Unit Variance (Auto Scaling)
Ensures equal statistical weights
(initially)
Mean Centering
QSAR/QSPR
Pre-Qualifications
CorrelationsR
i, j
Cov ( X i , X j )
th
th
Correlation between i and j descriptor
Std ( X i ) Std (Y j )
X
M
Ri , j
X
M
k
Remove correlated
descriptors
Keep correlated
descriptors so as to
reduce data set size
Apply math operation
to remove correlation
(PCR)
i ,k
i ,k
X i X j ,k X j
Xi
2 M
X
k
j ,k
X j
QSAR/QSPR
Pre-Qualifications
Correlations
QSAR/QSPR Scheme
Goal
Predict what happens next
(extrapolate)!
Predict what happens between data
points (interpolate)!
QSAR/QSPR Scheme
Types of Variable
Continuous
Concentration, occupied volume,
partition coefficient, hydrophobicity
Discrete
Structural (1-meta substituted, 0-no
meta substitution)
QSAR/QSPR-Principal
Components Analysis
Reduces dimensionality of
descriptors
Principle components are a set of
vectors representing the variance
in the original data
QSAR/QSPR-Principal
Components Analysis
Geometric Analogy (3-D to 2-D
PCA)
y
z
x
QSAR/QSPR-Principal
Components Analysis
Formulate matrix
Diagonalize matrix
Eigenvectors are the principal components
These principal components (new descriptors)
are a linear combination of the original
descriptors
Eigenvalues represent variance

Largest accounts for greatest % of data variance
Next corresponds to second greatest and so on
QSAR/QSPR-Principal
Components Analysis
Formulate matrix (Several types)
Correlation or covariance (N x P)
N is number of molecules
P is number of descriptors
Variance-Covariance matrix (N x N)
Diagonalize (Rotate) matrix
QSAR/QSPR-Principal
Components Analysis
Eigenvectors (Loadings)
Represents contribution from each original
descriptor to PC (new descriptor)
# columns = # of descriptors
# rows = # of descriptors OR # of molecules
Eigenvalues
Indicate which PC most important
(representative of original descriptors)
Benzene has 2 non-zero and 1 zero eigenvalue
(planar)
QSAR/QSPR-Principal
Components Analysis
Scores
Graphing each object/molecule
in space of 2 or more PCs
# rows = # of objects/molecules
# columns = # of descriptors OR #
of molecules
For benzene corresponds to graph in PC1 (x)
and PC2 (y) system
QSAR-PCASYBYL (Tripos I
nc.)
SYBYL (Tripos Inc.)
SYBYL (Tripos Inc.)
10D3D
SYBYL (Tripos Inc.)

Eigenvalues Explanation of
variance in data
SYBYL (Tripos Inc.)
Each point corresponds to column (#

points = # descriptors) in original data
Proximity correlation
SYBYL (Tripos Inc.)
Each point corresponds to row of

original data (i.e. #points =
#molecules) or graph of
molecules
Proximitysimilarity
in PC space
Small acting Big
H2O
He
Napthalene
Molecular Size
SYBYL (Tripos Inc.)

Outlier
SYBYL (Tripos Inc.)
QSAR/QSPR-Regression
Types
Principal
Component
Analysis
Types
Principal
Component
Analysis
Non-Linear Mappings
Calculate distance between points in Nd descriptor/parameter space
Euclidean
City-block distances
Randomly assign compounds in set to

points on a 2-D or 3-D space
Minimize Difference (Optimal N-d 2D
plot)
Non-Linear Mappings
Advantages
Non-linear
No assumptions!
Chance groupings unlikely (2D group
likely an N-D group)
Disadvantages
Dependence on initial guess (Use
PCA scores to improve)
Types
Multiple Regression
PCR
PLS
Types
Linear Regression
Minimize difference between
calculated and observed values
(residuals)
Regression
Multiple
Types
Principal Component
Regression
Regression but with Principal
Components substituted for
original descriptors/variables
Types
Partial Least Squares
Cross-validation determines
number of
descriptors/components to use
Derive equation
Use bootstrapping and t-test to
test coefficients in QSAR
regression
Types
Partial Least Squares (a.k.a.
Projection to Latent Structures)
Regression of a Regression
Provides insight into variation in
xs(bi,js as in PCA) AND ys (ais)
The tis are orthogonal

M= (# of variables/descriptors OR
#observations/molecules whichever
smaller)
Types
PLS is NOT MR or PCR in practice
PLS is MR w/cross-validation
PLS Faster
couples the target representation
(QSAR generation) and component
generation while PCA and PCR are
separate
PLS well applied to multi-variate

problems
QSAR/QSPR
Post-Qualifications
Confidence in Regression
TSS-Total Sum of Squares
ESS-Explained Sum of Squares
RSS-Residual Sum of Squares
QSAR/QSPR
Post-Qualifications
Confidence in Prediction (Predictive
Error Sum of Squares)
QSAR/QSPR
Post-Qualification
Bias?
Bootstrapping
Choosing best model?

Cross Validation
QSAR/QSPR
Post-Qualification
Bootstrapping
ASSUME calculated data is

experimental/observed data
Randomly choose N data (allowing for a
multiple picks of same data)
Regenerate parameters/regression
Repeat M times
Average over M bootstraps
Compare (calculate residual)
If close to zero then no bias
If large then bias exists
M is typically 50-100
QSAR/QSPR
Post-Qualification
Cross-Validation (used in PLS)
Remove one or more pieces of input data

Rederive QSAR equation
Calculate omitted data
Compute root-mean-square error to evaluate
efficacy of model
Typically 20% of data is removed for each iteration
The model with the lowest RMS error has the optimal
number of components/descriptors
QSPR Example
Narvaez, J. N., Lavine, B. K. and Jurs, P. C. Chemical Senses, 11, 145-156
Relation between musk odourant properties

and benzenoid structure
Training set of 148 compounds (81 non-musk and 67
musk)
47 chemical descriptors initially
Pre-qualifications
Correlations (47-12=35)
Post-qualifications
Bootstrapping
Test-set
6/6 musks, 8/9 non-musks
Practical Issues
10 times as many compounds as
parameters fit
3-5 compounds per descriptor
Traditional QSAR
Good for activity prediction
Not good for whether activity is due
to binding or transport
Advanced Methods
Neural Networks
Genetic/Evolutionary Algorithms
Monte Carlo
Alternate descriptors
Reduced graphs
Molecular connectivity indices
Indicator variables (0 or 1)
Combinatorics (e.g. multiple substituent sites)
Tools Available
Sybyl (Tripos Inc.)
Insight II (Accelrys Inc.)
Pole Bio-Informatique Lyonnais
http://pbil.univ-lyon1.fr/
Molecular Biology
http://www.infobiogen.fr/services/dea
mbulum/english/logiciels.html
Summary
QSAR/QSPR
Statistics connect structure/behavior w/
observables
Interpolate/Extrapolate
Multi-Variate Analysis
Pre-Qualification
Regression
PCA
PLS
MLS
Post-Qualification

Molecular Modeling: Statistical Analysis of Complex Data

Uploaded by

Copyright:

Available Formats

You might also like

Molecular Modeling: Statistical Analysis of Complex Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Molecular Modeling: Statistical Analysis of Complex Data

Uploaded by

Copyright:

Available Formats

Molecular Modeling:

SAR (Structure-Activity Relationships)

QSPR (Quantitative Structure Property

QSAR (Quantitative Structure Activity Relationships)

C (minimum dosage required)

HOMO and LUMO energies

Connecting a descriptor with a

Simple Error Analysis(1D)

Simple Error Analysis(1D)

Simple Error Analysis(1D)

Std (Y ) Std ( X ) Std (Y )

Ex. Elephants head and legs

Discern relationships between multiple

Eigenvalues represent variance

Diagonalize (Rotate) matrix

SYBYL (Tripos Inc.)

SYBYL (Tripos Inc.)

SYBYL (Tripos Inc.)

SYBYL (Tripos Inc.)

Each point corresponds to column (#

SYBYL (Tripos Inc.)

Each point corresponds to row of

SYBYL (Tripos Inc.)

SYBYL (Tripos Inc.)

Randomly assign compounds in set to

The tis are orthogonal

PLS well applied to multi-variate

Choosing best model?

ASSUME calculated data is

Remove one or more pieces of input data

Relation between musk odourant properties

You might also like