Efficient Exploration of Chemical Space

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Efficient

Exploration
of Chemical
Space
Presented by Erick Tavares
Active Learning

Active learning is a machine learning approach where the model is


trained on a subset of data initially and then selectively chooses
additional data points for training based on the current model's
uncertainties or areas of difficulty. This iterative process aims to
improve model performance with fewer labeled examples, making the
learning process more efficient.
You label Uncertainty

Subset Cats or non-


Model
10000 images 100 labeled images
cats 9900
1.Introduction
2.Materials and methods
3.Results and discussions
4.Conclusions
5.Q&A session
Introduction

Huge number of compounds available for virtual screening


Zinc database - past 1 billion readily synthesizable
448 M - lead-like range
This article : 3 ultra-large libraries
their size and diversity - more ligands that can bind to a target
empirical docking functions - prioritize better ligands
Ultralarge - uncommon - computational cost is high, which is becoming unpractical
Introduction

Deep Learning
Prediction of compound properties - design of
chemical structures, reactions predictions,
retrossynthetic analysis and prediction of protein-
ligand interactions - convolutional neural networks
Use of machine learning and molecular docking - learning
docking score - but comparisons of hit rates between
traditional and active learning protocols are missing
Materials and methods

DATASETS: three protein systems used


AmpC Beta-lactamase (AmpC) - 99 459 562 docking results (prefer anionic
molecules)
D4 dopamine receptor - 138 312 677 molecules (favor cationic and neutral)
Blind test - 150 927 915 - MT1 melatonin receptor (favor cationic and neutral)
AutoQSAR/DeepChem(AQ/DC)
Glide SP and DOCK3.7
User adjustable parameter - model search time - length of the search, random
search of hyperparameters(number of layers, number of training epochs and
normalization scheme)
5-cross validation
best performing model combined - 15 models - averaging scores - standard
deviation provided

Yu et al., 2023
Materials and methods

Docking calculations
glide and dock3.7, prepared 3D structures
Chemical Libraries
100 M subsets of ZINC15 library of commercial available compounds - smaller but
every compound in stock
ZINC15 - enumeration of building blocks and reaction database
combinatorial design - well-suited - chemical space - allow more robust training
Null model based on 2d chemical similarity
effectiness for recovering virtual hit compounds
Figure 1 - top compounds by docking score - probe molecules - final score for
each compound - maximum similarity - fingerprint
GPU - fingerprints

Yang et al., 2023


Materials and methods

Yang et al., 2023


Results and Discussion
Recovery of virtual hits
top 10k virtual hits(Dock3.7 D4 and AmpC (A
and B)
experimentally verified hit compounds - C
and D
top 10 k from GLIDE SP - 2E and 2F
Docking top 5% of the library from AQ/DC -
80 and 98% virtual hits - dock 3.7 and glide sp
can be recovered - d4 screen
97 and 70% for AmpC.
How to improve the model?

Yang et al., 2023


Results and Discussion
Improve the model

Increase information content


Selection rule in the active-learning scheme to select the
compounds for model training
Size of the training set
Increase the site of the training set - 0.1 to 0.2% and 0.5% of
ZINC screening subset

Yang et al., 2023


Results and Discussion
Improve the model
Model:0.1% selected and trained, than 0.1%
selected and docked
Selection rule
Top 0.1% by ML score - almost no
improvement
Most uncertain 0.1% according to the ML
model as defined by the standard deviation of
the ensemble predictions for a given
compound - decreases recall
0.1% randomly selected compounds from the
top10% by ML - similar into training 0.2% of
the data set
Most uncertain 0.1% from the top 5%
compounds by ML - best recovery
Results and Discussion
Improve the model

3A and B - retrieve 50% of compounds in the


smaller training subset within 2% of the
compounds scored by the ML
3E and F - recovery of Glide SP virtual hits - for
AmpC - is huge
Amount of docking calculations - AL 0.2% - 200
000 in 100 000 000 - when using 0.5 % - 500
000 - saving 300 000 docking calculations
But this is not valid for experiment hits, in which
the active learning approach - almost the same
as larger subset
Results and Discussion
Improve the model

Possible explanation - Maybe because it was


trained upon 0.1% of the zinc database

Yang et al., 2023


Results and Discussion
Chemical Diversity Analysis of Recovered Hits from the A1/DC Model

Models based on fingerprints(Smiles)


Ensure they were not simply learning a simple
measure of ligand similarity
clustered the virtual hits compounds according
to both Glide and DOCK3.7
Clustering - tanimoto coefficient from 0.3 to 0.6
cluster recovery = any ligand in the cluster was in
the ML prioritized ligands
Results and Discussion
Chemical Diversity Analysis of
Recovered Hits from the A1/DC Model

D4 screening - 60 and 80%


of the clusters were
recovered if we dock 2% and
5% of predictions
Results and Discussion
Chemical Diversity Analysis of Recovered Hits from the A1/DC Model

Count unique Bemis-Murcko


scaffolds - number of scaffolds for
the top 10k by DOCK compared to
machine learning - goes from 4515
to 1595
If we redock top 2% and 5% scored
by AQ/DC model - much better
overlap of common scaffolds - 2M-
5M to recover similar diversity
This is also seen in GLIDE SP
strong benefits by ML followed by
rescoring top-ranked compounds
with explicit docking
Results and Discussion
Additional Rounds of Active Learning

Good until 3 iterations, later is small increase


Results and Discussion
Null Model
Is the ML gaining information besides learning the training set?
Random selection of 0.2% of the library was docked
Machine learning - 2 iterations - learning features of
docking itself
Results and Discussion
Comparison of libraries

ZINC vs Sigma-Aldrich MARKET (8M unique compounds)

ZINC
AMPC = redocking top 2% - 40%
recovered only 20% recovered for
Sigma
D4 - 80 to 40%
Conclusion : library is important in the
effectviness of workflow
Results and Discussion
Blind test on MT1 system

DOCK3.7 on the MT1 system


150 927 915 molecules
0.1% docked molecules and AQ/DC, 0.5% and
active learning protocol(0.1 from the top 5%
of first round ML was added to the original
randomly selected 0.1% training set
80% recovered by 9.5 and AL in 5% of ML
evaluated
Advantage obvious in early recovery, similar
to experimental hits(104)
Results and Discussion
Comparison of computational cost
Conclusions

Ultra-large libraries will soon become intractable -


atomistic docking - limiting docking
Active-learning - minimizing number of docking
calculations
Loss of diversity can can be reduced adding
redocking step in the end
Expand scope of typical virtual screening
compounds, democratizing access to ultralarge
chemical libraries
Any questions?
Results and di
1.Introduction
2.The need for active learning in virtual screening
3.Impact of active learning on computational cost
4.Evaluation of Active learning protocol
5.Chemical Diversity analysis and recovery of hit compounds
6.Application and blind test
7.Conclusion and future directions
8.Q&A session
Introduction

-Huge number of compounds available for virtual screening


-Zinc database - past 1 billion readily synthesizable

You might also like