Professional Documents
Culture Documents
Molecular Modeling: Statistical Analysis of Complex Data
Molecular Modeling: Statistical Analysis of Complex Data
Molecular Modeling: Statistical Analysis of Complex Data
Statistical Analysis of
Complex Data
Terminology
LogP
Statistical Models
Simple
Mean, median and variation
Regression
Advanced
Validation methods
Principal components, co-variance
QSAR,QSPR
Multiple Regression
Modern QSAR
Hansch et. Al. (1963)
Activity travel through body
partitioning between varied solvents
Choosing Descriptors
Buffons Problem
Needle
Needle
Needle
Needle
Needle
Length?
Color?
Compostion?
Sheen?
Orientation?
Choosing Descriptors
Constitutional
MW, Natoms
Topological
Connectivity,Weiner index
Electrostatic
Polarity, polarizability, partial charges
Geometrical Descriptors
Length, width, Molecuar volume
Quantum Chemical
Choosing Descriptors
Constitutional
MW, Natoms of element
Topological
Connectivity,Weiner index (sums of bond distances)
2D Fingerprints (bit-strings)
3D topographical indices, pharmacophore keys
Electrostatic
Polarity, polarizability, partial charges
Geometrical Descriptors
Length, width, Molecular volume
Choosing Descriptors
Chemical
Hydrophobicity (LogP)
HOMO and LUMO energies
Vibrational frequencies
Bond orders
Energy total
GSH
Statistical Methods
1-D analysis
Large dimension sets require
decomposition techniques
Multiple Regression
PCA
PLS
Regression
Cov ( X , Y )
R
Std ( X ) Std (Y )
y
x
calc
obs
calc
obs
y calc y obs y
y residual
i
SSR
Cov ( X , Y )
SSR ycalc y
i
1 N
Cov ( X , Y ) X i X Yi Y
N i 1
Correlation between fluctuations
Std (Y )
1
N
Y Y
i 1
Correlation vs.
Dependence?
Correlation
Two or more variables/descriptors may correlate
to the same property of a system
Dependence
When the correlation can be shown due to one
changing due to the change in another
Quantitative Structure
Activity/Property
Relationships (QSAR,QSPR)
Pre-Qualifications
Size
Minimum of FIVE samples per
descriptor
Verification
Variance
Scaling
Correlations
QSAR/QSPR
Pre-Qualifications
Variance
Coefficient of Variation
QSAR/QSPR
Pre-Qualifications
Scaling
Standardizing or normalizing
descriptors to ensure they have equal
weight (in terms of magnitude) in
subsequent analysis
QSAR/QSPR
Pre-Qualifications
Scaling
Unit Variance (Auto Scaling)
Ensures equal statistical weights
(initially)
Mean Centering
QSAR/QSPR
Pre-Qualifications
CorrelationsR
i, j
Cov ( X i , X j )
th
th
Correlation between i and j descriptor
Std ( X i ) Std (Y j )
X
M
Ri , j
X
M
k
Remove correlated
descriptors
Keep correlated
descriptors so as to
reduce data set size
Apply math operation
to remove correlation
(PCR)
i ,k
i ,k
X i X j ,k X j
Xi
2 M
X
k
j ,k
X j
QSAR/QSPR
Pre-Qualifications
Correlations
QSAR/QSPR Scheme
Goal
Predict what happens next
(extrapolate)!
Predict what happens between data
points (interpolate)!
QSAR/QSPR Scheme
Types of Variable
Continuous
Concentration, occupied volume,
partition coefficient, hydrophobicity
Discrete
Structural (1-meta substituted, 0-no
meta substitution)
QSAR/QSPR-Principal
Components Analysis
Reduces dimensionality of
descriptors
Principle components are a set of
vectors representing the variance
in the original data
QSAR/QSPR-Principal
Components Analysis
Geometric Analogy (3-D to 2-D
PCA)
y
z
x
QSAR/QSPR-Principal
Components Analysis
Formulate matrix
Diagonalize matrix
Eigenvectors are the principal components
These principal components (new descriptors)
are a linear combination of the original
descriptors
QSAR/QSPR-Principal
Components Analysis
Formulate matrix (Several types)
Correlation or covariance (N x P)
N is number of molecules
P is number of descriptors
Variance-Covariance matrix (N x N)
QSAR/QSPR-Principal
Components Analysis
Eigenvectors (Loadings)
Represents contribution from each original
descriptor to PC (new descriptor)
# columns = # of descriptors
# rows = # of descriptors OR # of molecules
Eigenvalues
Indicate which PC most important
(representative of original descriptors)
Benzene has 2 non-zero and 1 zero eigenvalue
(planar)
QSAR/QSPR-Principal
Components Analysis
Scores
Graphing each object/molecule
in space of 2 or more PCs
# rows = # of objects/molecules
# columns = # of descriptors OR #
of molecules
For benzene corresponds to graph in PC1 (x)
and PC2 (y) system
QSAR-PCASYBYL (Tripos I
nc.)
10D3D
Proximity correlation
H2O
He
Napthalene
Molecular Size
QSAR/QSPR-Regression
Types
Principal
Component
Analysis
QSAR/QSPR-Regression
Types
Principal
Component
Analysis
Non-Linear Mappings
Calculate distance between points in Nd descriptor/parameter space
Euclidean
City-block distances
Non-Linear Mappings
Advantages
Non-linear
No assumptions!
Chance groupings unlikely (2D group
likely an N-D group)
Disadvantages
Dependence on initial guess (Use
PCA scores to improve)
QSAR/QSPR-Regression
Types
Multiple Regression
PCR
PLS
QSAR/QSPR-Regression
Types
Linear Regression
Minimize difference between
calculated and observed values
(residuals)
Regression
Multiple
QSAR/QSPR-Regression
Types
Principal Component
Regression
Regression but with Principal
Components substituted for
original descriptors/variables
QSAR/QSPR-Regression
Types
Partial Least Squares
Cross-validation determines
number of
descriptors/components to use
Derive equation
Use bootstrapping and t-test to
test coefficients in QSAR
regression
QSAR/QSPR-Regression
Types
Partial Least Squares (a.k.a.
Projection to Latent Structures)
Regression of a Regression
Provides insight into variation in
xs(bi,js as in PCA) AND ys (ais)
QSAR/QSPR-Regression
Types
PLS is NOT MR or PCR in practice
PLS is MR w/cross-validation
PLS Faster
couples the target representation
(QSAR generation) and component
generation while PCA and PCR are
separate
QSAR/QSPR
Post-Qualifications
Confidence in Regression
TSS-Total Sum of Squares
ESS-Explained Sum of Squares
RSS-Residual Sum of Squares
QSAR/QSPR
Post-Qualifications
Confidence in Prediction (Predictive
Error Sum of Squares)
QSAR/QSPR
Post-Qualification
Bias?
Bootstrapping
QSAR/QSPR
Post-Qualification
Bootstrapping
QSAR/QSPR
Post-Qualification
Cross-Validation (used in PLS)
QSPR Example
Narvaez, J. N., Lavine, B. K. and Jurs, P. C. Chemical Senses, 11, 145-156
Post-qualifications
Bootstrapping
Test-set
6/6 musks, 8/9 non-musks
Practical Issues
10 times as many compounds as
parameters fit
3-5 compounds per descriptor
Traditional QSAR
Good for activity prediction
Not good for whether activity is due
to binding or transport
Advanced Methods
Neural Networks
Genetic/Evolutionary Algorithms
Monte Carlo
Alternate descriptors
Reduced graphs
Molecular connectivity indices
Indicator variables (0 or 1)
Combinatorics (e.g. multiple substituent sites)
Tools Available
Sybyl (Tripos Inc.)
Insight II (Accelrys Inc.)
Pole Bio-Informatique Lyonnais
http://pbil.univ-lyon1.fr/
Molecular Biology
http://www.infobiogen.fr/services/dea
mbulum/english/logiciels.html
Summary
QSAR/QSPR
Statistics connect structure/behavior w/
observables
Interpolate/Extrapolate
Multi-Variate Analysis
Pre-Qualification
Regression
PCA
PLS
MLS
Post-Qualification