Efficient Exploration of Chemical Space

Efficient
Exploration
of Chemical
Space
Presented by Erick Tavares
Active Learning
Active learning is a machine learning approach where the model is

trained on a subset of data initially and then selectively chooses
additional data points for training based on the current model's
uncertainties or areas of difficulty. This iterative process aims to
improve model performance with fewer labeled examples, making the
learning process more efficient.
You label Uncertainty
Subset Cats or non-

Model
10000 images 100 labeled images
cats 9900
1.Introduction
2.Materials and methods
3.Results and discussions
4.Conclusions
5.Q&A session
Introduction
Huge number of compounds available for virtual screening

Zinc database - past 1 billion readily synthesizable
448 M - lead-like range
This article : 3 ultra-large libraries
their size and diversity - more ligands that can bind to a target
empirical docking functions - prioritize better ligands
Ultralarge - uncommon - computational cost is high, which is becoming unpractical
Introduction
Deep Learning
Prediction of compound properties - design of
chemical structures, reactions predictions,
retrossynthetic analysis and prediction of protein-
ligand interactions - convolutional neural networks
Use of machine learning and molecular docking - learning
docking score - but comparisons of hit rates between
traditional and active learning protocols are missing
Materials and methods
DATASETS: three protein systems used

AmpC Beta-lactamase (AmpC) - 99 459 562 docking results (prefer anionic
molecules)
D4 dopamine receptor - 138 312 677 molecules (favor cationic and neutral)
Blind test - 150 927 915 - MT1 melatonin receptor (favor cationic and neutral)
AutoQSAR/DeepChem(AQ/DC)
Glide SP and DOCK3.7
User adjustable parameter - model search time - length of the search, random
search of hyperparameters(number of layers, number of training epochs and
normalization scheme)
5-cross validation
best performing model combined - 15 models - averaging scores - standard
deviation provided
Yu et al., 2023
Docking calculations
glide and dock3.7, prepared 3D structures
Chemical Libraries
100 M subsets of ZINC15 library of commercial available compounds - smaller but
every compound in stock
ZINC15 - enumeration of building blocks and reaction database
combinatorial design - well-suited - chemical space - allow more robust training
Null model based on 2d chemical similarity
effectiness for recovering virtual hit compounds
Figure 1 - top compounds by docking score - probe molecules - final score for
each compound - maximum similarity - fingerprint
GPU - fingerprints
Yang et al., 2023

Yang et al., 2023

Results and Discussion
Recovery of virtual hits
top 10k virtual hits(Dock3.7 D4 and AmpC (A
and B)
experimentally verified hit compounds - C
and D
top 10 k from GLIDE SP - 2E and 2F
Docking top 5% of the library from AQ/DC -
80 and 98% virtual hits - dock 3.7 and glide sp
can be recovered - d4 screen
97 and 70% for AmpC.
How to improve the model?
Yang et al., 2023

Improve the model
Increase information content

Selection rule in the active-learning scheme to select the
compounds for model training
Size of the training set
Increase the site of the training set - 0.1 to 0.2% and 0.5% of
ZINC screening subset
Yang et al., 2023

Improve the model
Model:0.1% selected and trained, than 0.1%
selected and docked
Selection rule
Top 0.1% by ML score - almost no
improvement
Most uncertain 0.1% according to the ML
model as defined by the standard deviation of
the ensemble predictions for a given
compound - decreases recall
0.1% randomly selected compounds from the
top10% by ML - similar into training 0.2% of
the data set
Most uncertain 0.1% from the top 5%
compounds by ML - best recovery
Improve the model
3A and B - retrieve 50% of compounds in the

smaller training subset within 2% of the
compounds scored by the ML
3E and F - recovery of Glide SP virtual hits - for
AmpC - is huge
Amount of docking calculations - AL 0.2% - 200
000 in 100 000 000 - when using 0.5 % - 500
000 - saving 300 000 docking calculations
But this is not valid for experiment hits, in which
the active learning approach - almost the same
as larger subset
Improve the model
Possible explanation - Maybe because it was

trained upon 0.1% of the zinc database
Yang et al., 2023

Chemical Diversity Analysis of Recovered Hits from the A1/DC Model
Models based on fingerprints(Smiles)

Ensure they were not simply learning a simple
measure of ligand similarity
clustered the virtual hits compounds according
to both Glide and DOCK3.7
Clustering - tanimoto coefficient from 0.3 to 0.6
cluster recovery = any ligand in the cluster was in
the ML prioritized ligands
Chemical Diversity Analysis of
Recovered Hits from the A1/DC Model
D4 screening - 60 and 80%

of the clusters were
recovered if we dock 2% and
5% of predictions
Chemical Diversity Analysis of Recovered Hits from the A1/DC Model
Count unique Bemis-Murcko

scaffolds - number of scaffolds for
the top 10k by DOCK compared to
machine learning - goes from 4515
to 1595
If we redock top 2% and 5% scored
by AQ/DC model - much better
overlap of common scaffolds - 2M-
5M to recover similar diversity
This is also seen in GLIDE SP
strong benefits by ML followed by
rescoring top-ranked compounds
with explicit docking
Additional Rounds of Active Learning
Good until 3 iterations, later is small increase

Null Model
Is the ML gaining information besides learning the training set?
Random selection of 0.2% of the library was docked
Machine learning - 2 iterations - learning features of
docking itself
Comparison of libraries
ZINC vs Sigma-Aldrich MARKET (8M unique compounds)
ZINC
AMPC = redocking top 2% - 40%
recovered only 20% recovered for
Sigma
D4 - 80 to 40%
Conclusion : library is important in the
effectviness of workflow
Blind test on MT1 system
DOCK3.7 on the MT1 system

150 927 915 molecules
0.1% docked molecules and AQ/DC, 0.5% and
active learning protocol(0.1 from the top 5%
of first round ML was added to the original
randomly selected 0.1% training set
80% recovered by 9.5 and AL in 5% of ML
evaluated
Advantage obvious in early recovery, similar
to experimental hits(104)
Comparison of computational cost
Conclusions
Ultra-large libraries will soon become intractable -

atomistic docking - limiting docking
Active-learning - minimizing number of docking
calculations
Loss of diversity can can be reduced adding
redocking step in the end
Expand scope of typical virtual screening
compounds, democratizing access to ultralarge
chemical libraries
Any questions?
Results and di
1.Introduction
2.The need for active learning in virtual screening
3.Impact of active learning on computational cost
4.Evaluation of Active learning protocol
5.Chemical Diversity analysis and recovery of hit compounds
6.Application and blind test
7.Conclusion and future directions
8.Q&A session
Introduction
-Huge number of compounds available for virtual screening

-Zinc database - past 1 billion readily synthesizable

Efficient Exploration of Chemical Space

Uploaded by

Copyright:

Available Formats

You might also like

Efficient Exploration of Chemical Space

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Exploration of Chemical Space

Uploaded by

Copyright:

Available Formats

Efficient

Active learning is a machine learning approach where the model is

Subset Cats or non-

Huge number of compounds available for virtual screening

DATASETS: three protein systems used

Yang et al., 2023

Yang et al., 2023

Yang et al., 2023

Increase information content

Yang et al., 2023

3A and B - retrieve 50% of compounds in the

Possible explanation - Maybe because it was

Yang et al., 2023

Models based on fingerprints(Smiles)

D4 screening - 60 and 80%

Count unique Bemis-Murcko

Good until 3 iterations, later is small increase

ZINC vs Sigma-Aldrich MARKET (8M unique compounds)

DOCK3.7 on the MT1 system

Ultra-large libraries will soon become intractable -

-Huge number of compounds available for virtual screening

You might also like