Professional Documents
Culture Documents
Accelerating Chemical Design and Synthesis Using Artificial Intelligence RISE Open Workshop 2020-05-29 Conference Binder
Accelerating Chemical Design and Synthesis Using Artificial Intelligence RISE Open Workshop 2020-05-29 Conference Binder
chemical design
and synthesis
using artificial
intelligence
Open Workshop, May 29, 2020
Some fun statistics: Throughout the day we had 90-110 participants viewing the presentations and participating actively in the Q&A sessions. The audience
was spread all around the globe (Europe, US, Far East, Australia, South America) and we hope that the day was valuable to you all, and worth either getting
up really early, or staying up really late!!
This workshop was a combined effort by several cross-functional departments at RISE to jointly define the contents. This included the units Process
Chemistry, Toxicology/Safety Assessment and AI/Digital Systems. The day addressed how Machine Learning and AI is rapidly paving roads into how
chemicals are designed and manufactured with optimal functional characteristics. Achieving optimal functionality is always a balance between the desired
physico-chemical properties of a molecule, the hazardous/risk potential and the practicality in scale up of production. In each of these areas, Machine
Learning/AI techniques present great opportunities in achieving this balance in a more pro-active and effective manner.
To conclude, please get in touch with any of the speakers and RISE if you have a question or a request that you think any of us can solve or assist you with.
Ulf.Tedebark@ri.se (Biomolecules)
Sverker.Janson@ri.se (Computer Science)
Ian.Cotgreave@ri.se (Chemical and Pharmaceutical Toxicology)
Alexander.Minidis@ri.se (Medicinal/Process Chemistry and Data-Science)
Erik.Ylipaa@ri.se (Deep Learning)
Fernando.Huerta@ri.se (Medicinal/Process Chemistry and Data-Science)
Swapnil.Chavan@ri.se (Computational Toxicology)
Andreas.Thore@ri.se (Materials Science and Computational Chemistry)
Martin.Nilsson@ri.se (Algorithms and Mathematics)
Agenda
(times are CEST (Stockholm time zone))
• 16.20-16.30
Concluding remarks.
Ian Cotgreave, RISE, Sweden
09.00 – 09.15
09.15 – 09.45
Abstract
Machine Learning and Chemistry
The last couple of years has seen a shift in how machine learning is used in chemoinformatics: from methods with an emphasis on feature
engineering to deep neural networks which allow molecules to be modelled directly. This talk will be a brief overview of the past, present and
future of machine learning in chemistry with a focus on deep learning. We will also look at how contemporary neural networks for processing
mathematical sets are ideally suited for working with small molecules and how lessons learned from the field of natural language processing
could catalyse the field of chemoinformatics.
Machine
Learning and
Chemistry
2 RISE — Research Institutes of Sweden Bengio, Yoshua, Ian Goodfellow, and Aaron Courville. Deep learning.
Vol. 1. MIT press, 2017. https://www.deeplearningbook.org/
Machine Learning
and Learning
Algorithms
• Learning Algorithms improve
their performance on a task
by experience
• This is in contrast with a rule
based algorithm, which does
not change in performance
with more data
● In the case of deep learning, the This element correspond to Settings corresponding
mathematical expression used to setting the parameters to to good
specific values approximations of f(x)
approximate f(x) has some free parameters
● We call this function g(x; w), where w are
the parameters which determine what the
function computes
○ We can think of the parameters w as an
index into a set of possible functions
● Learning is then a search problem: find the
This element correspond to
member (settings of w) of this set which best different values for the
logo
parameters
Generalization
● We want the function to be general, to Settings
corresponding to
learn the underlying phenomena good approximations
of f(x) in general
● The goal is not to directly minimize the
error on training data, but on new data not
seen during training
● A flexible g(x; w) can have many settings
which minimize the training error, without
being good in general
● Machine learning as a practice is about
finding function families which perform well
in general as they minimize their error on Settings corresponding to logo
– Molecular graphs
– String-representations
InChI=1S/C6H11N2O4PS3/c1-10-5-7-8(6(9)16- logo
5)4-15-13(14,11-2)12-3/h4H2,1-3H3
C1=CC(=C(C=C1CCN)O)O
InChI=1S/C6H11N2O4PS3/c1-10-5-7-8(6(9)16- logo
5)4-15-13(14,11-2)12-3/h4H2,1-3H3
Jha, D., Ward, L., Paul, A. et al. ElemNet: Deep Learning the Chemistry of Materials From Only
Elemental Composition. Sci Rep 8, 17593 (2018) doi:10.1038/s41598-018-35934-y logo
Neural Networks
Source speech
Embedding
waveform Learnt
Ingratiating
decoder
Tone analysis
• The encoder-decoder setup allows for multitask learning and transfer learning
• The same encoder can embed the input data in a space which is useful for logo
logo
Mahé, Pierre, et al. "Graph kernels for molecular structure− activity relationship analysis with support
vector machines." Journal of chemical information and modeling 45.4 (2005): 939-951. logo
• There have been four main directions for dealing with molecules and neural networks:
– Traditional machine learning – traditional neural network with dense layers fed
with molecular descriptors and/or structural fingerprints. Tree ensembles often
perform very well with these representations.
– SMILES-based – a sequence neural network reads the molecule description as a
string
– Graph-based – neural networks for mathematical sets process the molecule as a
graph
– Geometrical - Geometrical neural networks process the molecule as a point cloud
represented in a 3D space logo
Neural Networks for molecules
• There have been four main directions for dealing with molecules and neural networks:
– Traditional machine learning – traditional neural network with dense layers fed
with molecular descriptors and/or structural fingerprints. Tree ensembles often
perform very well with these representations.
– SMILES-based – a sequence neural network reads the molecule description as a
string
– Graph-based – neural networks for mathematical sets process the molecule as a
graph
– Geometrical - Geometrical neural networks process the molecule as a point cloud
represented in a 3D space logo
SMILES-based
logo
logo
• While SMILES “hides” the graph behind a linearization, it completely encodes the
graph structure together with bond information
– Graph Neural Networks will often have to add specialized architecture to encode
the graph structure, especially for edge attributes like different types of bonds
logo
• Generating graphs with GNN are difficult, we need to generate the edge set
and can have trouble with set matching. We typically need to induce some
ordering on the edge set which can get complicated. See Jin, Wengong, Regina
Barzilay, and Tommi Jaakkola. "Hierarchical Generation of Molecular Graphs using Structural Motifs."
arXiv preprint arXiv:2002.03230 (2020).
• For a transformer,
the graph
information needs
to be added in the
reduction and
feature function
• Typically as
additional pair-wise
information
logo
Schwaller, Philippe, et al. "Molecular transformer: A model for uncertainty-calibrated chemical reaction logo
logo
Nogueira, Rodrigo, Zhiying Jiang, and Jimmy Lin. "Document ranking with a pretrained
sequence-to-sequence model." arXiv preprint arXiv:2003.06713 (2020).
Transfer Learning dominates NLP
• Natural Language Processing are shifting from task-specific solutions to reusing general language models
• A single huge model (hundreds of millions of parameters) is trained on a source tasks with huge amounts of data (tens to
hundreds of gigabytes of text)
• The same model is re-used as a frontend to obtain state of the art results in a very diverse set of target tasks
logo
Howard, Jeremy, and Sebastian Ruder. Pre-OpenAI SOTA BiLSTM+ELMo+Attn OpenAI GPT BERT_large RoBERTa
"Universal language model fine-tuning for text
classification." arXiv preprint
arXiv:1801.06146 (2018). logo
Transfer learning
source
task
in neural
networks
1. Train a deep network on
a task where vast amounts
of data is available Output for
Image Encoder
target task
2. Remove the parts specific to
the first task
3. Train a new task-specific part
for the new task
Limits of supervised
transfer learning
logo
Abstract
Molecular de-novo design and synthesis prediction
Molecular de-novo design and synthesis prediction
Drug discovery is the earliest part of developing new medicine for unmet medical needs of patients. Transforming early hits into clinical
candidates is a year long project with multiple rounds of designing new compounds with proposed better properties, synthesizing the
compounds in the laboratory and test the properties and analyze the results for new rounds of this design-make-test-analyze cycle. Our aim is
to utilize artificial intelligence to speed up these processes and allow us to create new medicines faster. In the molecular AI group, we develop
tools to answer two questions: What molecules to make and how to make them.
Part 1 will cover examples from the design phase, where we use a sequential molecular notation format, SMILES, in combination with deep
neural networks inspired from natural language processing. Using recurrent neural network architectures, we can read-in molecules in the
SMILES format, but also and more importantly, read-out molecules in SMILES format. The last property is what enables us to handle massive
molecular spaces by probabilistic sampling. The generation can be steered in directions of interested by different techniques, such as transfer-
learning, autoencoders, reinforcement-learning and conditional recurrent networks. Alternative architectures allow us to generate molecular
series or generate subtle changes in molecules to induce adjustments of existing compounds.
Part 2 will address question two: how to make the suggested molecules. Using large databases of known reactions pooled from public, in-
licensed and internal sources we extract reaction templates which covers the core of the reaction. To search for potential synthetic routes, we
use Monte Carlo tree-search coupled to neural network policies. The neural networks are trained to prioritize our template-library and predict
the most likely templates to be applied. Thus, the search-breadth of the tree search is reduced orders of magnitude and the tree search can be
performed in seconds to minutes. However, the policy networks prioritize often used reactions due to the imbalance of the reaction databases
and overlooks more seldom used and specialized reactions. By training policy networks on selected reaction classes such as ring-forming
reactions, special attention is given to these reactions, which can be injected into the tree search or used as alternative options in interactive
path-planning. Interestingly SMILES based translation can also play a role in Reaction Informatics.
Molecular de-novo design and
synthesis prediction
2
AstraZeneca: Global dimensions
invested in R&D
Total Revenue with research
$22.1bn (down 2% over 2017)
$5.9 bn across five 64.6k employees
countries
projects in clinical
of our senior
Product Sales development and
21.1bn
$ (up 4% over 2017) 149 eight NMEs in late- 45% roles are filled
by women
stage development
3
One of
AstraZeneca’s
three global
Gothenburg
R&D centres Boston
California Osaka
Shanghai
$ 21.1bn
Cambridge
Strategic R&D centres
Additional laboratories
Gaithersburg
4 4
Drug Discovery is just
the beginning …
Life-cycle of
a medicine
We are one of only a handful
of companies to span the
entire life-cycle of a medicine
from research and development
to manufacturing and supply,
and the global commercialisation
of primary care and speciality
care medicines
Drug
target Design
Chemical starting point Candidate drug
(“Hit”) found through HTS,
DEL, fragment screening or
knowledge ~3
Analyse Make years
• Weakly active • Highly potent
• Target unselective • Effective in in vivo models
• Toxicity risk • Metabolically stable
• Low metabolic stability • No toxicity issues
Test
11
Training and Sampling using Recurrent Neural Networks
Do the same computation for different time steps
Let previous computations influence current computation
Training:
Sampling:
12
14
Useful as expansion of
libraries for in silico HTS?
Solid: Train
Dashed: Generated
Bjerrum, E. J.; Threlfall, R. Molecular Generation with Recurrent
15
Neural Networks (RNNs). 2017 http://arxiv.org/abs/1705.04612.
Why? Generation of Novel Compounds in the 1060 Chemical Space!
1010-1012
1060
Segler, M. H. S.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating Focused Molecule Libraries for Drug
Discovery with Recurrent Neural Networks. ACS Cent. Sci. 2018.
17 https://doi.org/10.1021/acscentsci.7b00512 .
Assessing the quality of molecular generative models
1) Arús-Pous J, Blaschke T, Ulander S, et al (2019) Exploring the GDB-13 chemical space using deep generative models. J
Cheminform 11:20. https://doi.org/10.1186/s13321-019-0341-z
2) Arús-Pous J, Johansson SV, Prykhodko O, et al (2019) Randomized SMILES strings improve the quality of molecular
19
generative models. J Cheminform 11:71. https://doi.org/10.1186/s13321-019-0393-0
Data augmentation
21
SMILES enumeration increases Chemical Space Coverage
c1ccccc1 Molecule
24
HeteroEncoders
25
Latent vectors as a base for Quantitative structure-activity models (QSAR)
Figure adapted from : Bjerrum, Esben Jannik, and Boris Sattarov. 0 8. “Improving Chemical Autoencoder Latent Space and Molecular
26
De Novo Generation Diversity with Heteroencoders.” Biomolecules.
Latent Space gets more chemicaly relevant
27
Bjerrum, Esben Jannik, and Boris Sattarov. 0 8. “Improving Chemical Autoencoder Latent Space
and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules.
REINVENT: An In Silico mini-DMTA cycle
Generative RNN
Design
Decoder
based:
We
currently
use
REINVENT
Blaschke, Thomas; Arús-Pous, Josep; Chen, Hongming; Margreitter, Christian; Tyrchan, Christian;
Engkvist, Ola; et al. (2020): REINVENT 2.0 – an AI Tool for De Novo Drug Design. ChemRxiv. Preprint.
35
https://doi.org/10.26434/chemrxiv.12058026.v2
Further options in molecular generation and design
Optimizations on REINVENT Better molecular optimizations via
Graph based molecular representations
*
36
Return of the encoder architectures: Conditional RNN’s
Encoder Decoder
LogP
TPSA
RDKit MolWeight Decoder
HBA
HBD
Control of Properties
40
Kotsias, P.-C.; Arús-Pous, J.; Chen, H.; Engkvist, O.; Tyrchan, C.; Bjerrum, E. J. Direct Steering of de Novo
Molecular Generation with Descriptor Conditional Recurrent Neural Networks. Nat. Mach. Intell. 2020, 2 (May).
Molecular optimization
• Goal
– Given a starting molecule, generate molecules with desired property
changes while maintaining similarity to the original molecule
• Capturing chemist intuition in respect to chemical transformations that change
the property of molecules
– Matched molecular pairs
• Methods
– Seq2Seq
– Transformer
– Graph-based
47
Scaffold-based decoration using SMILES
1) Arús-Pous J, Patronov A, Bjerrum EJ, et al (2020) SMILES-based deep generative scaffold decorator for de novo
drug design. ChemRxiv
48
Conclusions part 1
50
Part 2: How to make the compounds?
Synthesis Prediction
52
From Design to Compound: Make step
Design
NMP, MeCN, 93%
Make
Analyze
Test C Y
R1 + R2 P
53
Different Objectives for Synthetis Prediction
? Y
R1 + R2 P Condition Prediction
Forward
C ?
R1 + R2 ? Reaction Feasibility
Retro-synthesis 1-step P ? + ?
Backward ?
? ?
Retro synthetic planning
54
?
P
? ?
?
Chemistry Reaction Data
MedChem PharmSci
ELN ELN
Flatfiles
Reaxys
Flat
Flat files
files Flatfiles
Pistachio files
Flat
Flat files
Flat Files USPTO
Flat Files
ChemConnect
ReactionConnect
Design
Test
Predictive Reaction
55 iLab, MedChem, PharmDev Models
Template Extraction Dataset Size Templates Extracted
Alpha Go architecture
Segler, M. H. S.; Preuss, M.; Waller, M. P. Planning Chemical
58 Syntheses with Deep Neural Networks and Symbolic AI. Nature 2018,
555 (7698), 604–610. https://doi.org/10.1038/nature25978 .
Results in Seconds to Minutes
Model: USPTO
Time taken: 3.26 s
59
Do more reactions equal better performance?
61
Overall
Performance as
measured for each
dataset
Thakkar, A.; Kogej, T.; Reymond, J.-L. L.; Engkvist, O.; Bjerrum,
E. J. Datasets and Their Influence on the Development of
Computer Assisted Synthesis Planning Tools in the
Pharmaceutical Domain. Chem. Sci. 2019, 11 (1).
https://doi.org/10.1039/C9SC04944D .
62
Making the tool available The Value: Chemists can quickly get
Web-GUI based on MIT MLDPS consortium tools suggested routes/ideas to
purchasable compounds.
Cheminformaticians can filter datasets
i t “sy th sizab / t-synthasizable”
T1
P R1 + R2
T2
P
(1) Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter, C. A.; Bekas, C.; Lee, A. A. Molecular Transformer: A
69
Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci. 2019, 5 (9), 1572–1583.
https://doi.org/10.1021/acscentsci.9b00576 .
Automating the DMTA cycle
• REINVENT is automated
AI REINVENT
design
• AiZynthFinder is
automated route
planning
Design • iLab is automated
synthesis
• Increased automation is
iLab key to speed
72
Conclusions part 2
73
Toolkits – Source code - Links
ReInvent: https://github.com/MolecularAI/Reinvent
Molvecgen: github.com/Ebjerrum/molvegen
Deep Drug Coder:https://github.com/pcko1/Deep-Drug-Coder
AiZynthFinder: TBR: https://github.com/MolecularAI
Blogposts: www.cheminformania.com
74
Acknowledgements
Rocio Mercado, Post.doc
Molecular AI group: Tomas Bastys, Post.doc
Ola Engkvist, Associate Director, Molecular AI Simon Johansson, Ph.D Student WASP
Panagiotis-Christos Kotsias, Graduate Scientist, Hampus Gummesson Svensson, Ph.D Student WASP
Graduate Programme Sebastian Nilsson, Master Student
Josep Arus Pous, Ph.D student, BIGCHEM Tobias Rastemo, Master Student
Jiazhen He, post.doc. Molecular AI Emil Sandström, Master Student
Amol Thakkar, Ph.D student, BIGCHEM Jonathan Sundkvist, Master Student
Dean Sumner, Graduate Scientist, Graduate Programme Huifang You, Master Student
Veronika Chadimova, Graduate Scientist, Graduate Carl Blomgren, Master Student
Programme
Samuel Genheden, Data Scientist/Software Engineer Collaborators:
Atanas Patronov, Associate Principal Scientist Prof. Dr. Jean-Louis Reymond · Dept. of Chemistry &
Isabella Feierberg, Associate Principal Scientist Biochemistry University of Berne
Thierry Kogej, Associate Principal Scientist Christian Tyrchan, Team Leader - Computational
Preeti Lyer, Machine Learning and Cheminformatics Chemistry
Experts Boris Sattarov, Informatics Programmer, Science Data
Christian Margreitter, Data Scientist Software LLC
Papadopoulos, Kostas, Associate Principal Scientist Hongming Chen, Professor, Centre of Chemistry and
Lewis Mervin, Machine Learning and Cheminformatics Chemical Biology, Guangzhou, China
Expert Nidhal Selmi, Research Outsourcing Specialist, Hit
Christos Kannas Machine Learning/Cheminformatics Discovery
Expert Peter Varkonyi, Senior Research Scientist |
Alexey Voronov, Data Scientist/Software Engineer
75 Computational Chemistry
Questions
76
Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this
this file
file in
in error,
error, please
please notify us and remove
it from your system and note that you must not copy, distribute or take any action in reliance on
on it. Any
Any unauthorized
unauthorized use use or
or disclosure of the
contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 1 Francis Crick Avenue, Cambridge Biomedical Campus,
Cambridge, CB2 0AA, UK, T: +44(0)203 749 5000, www.astrazeneca.com
77
10
GDB and the chemical space
11.20 – 11.50
Abstract
GDB and the Chemical Space
Chemical space is a concept to organize molecular diversity by postulating that different molecules occupy different regions of a mathematical
space where the position of each molecule is defined by its properties. Our aim is to develop methods to explicitly explore chemical space in
the area of drug discovery. We have enumerated all possible molecules following simple rules of chemical stability and synthetic feasibility to
form the Generated DataBases (GDB). Exploring GDB in comparison to known molecules reveals that vast areas of chemical space are still
entirely unknown yet are accessible for experimental exploration by straightforward synthetic methods. I will discuss how to visualize chemical
space and exemplify the discovery and synthesis of new scaffolds for drug discovery, and how we use machine learning methods to address
target predictions and synthesis predictions of GDB molecules. http://gdb.unibe.ch
GDB and the Chemical Space
Jean-Louis Reymond
29 May 2020, RISE AI workshop
http://gdb.unibe.ch
1
Philippe Schwaller
David Kreutter
Josep Arus-Pous
Daniel Probst
Sven Bühlmann
Alice Capecchi
Sacha Javor
http://gdb.unibe.ch
Finton Sirockin (Novartis)
Ola Engkvist (Astra Zeneca)
Matthias Hediger (UniBE) Florian Hollfelder (Cambridge UK)
Pierre Gönczy (EPFL) Anne Imberty (Grenoble)
Amol Thakkar
Roch-Philippe Charles (UniBE) Achim Stocker (UniBE)
Jürg Gertsch (UniBE) Luc Patiny (EPFL)
Dirk Trauner (New York) Andrea Endimiani (UniBE)
Daniel Bertrand (HiQScreen) Christian van Delden (Geneva)
2
Hugues Abriel (UniBE) Runze He (Space Peptides)
1. The GDB project
DB fraction
1) Ring strain / topologies
Graphs
114 B
2) Unsaturations Hydrocarbons
5.4 M
3) Heteroatoms
Skeletons
Claus Benzol 1.3 B
(1867)
GDB-17: Molecules
166.4 B
5) Uniform sampling
26.2 B Subset
ChEMBL-like
GDBChEMBL
10 M
Tobias Fink et al. Angew. Chem. Int. Ed. 2005, 44, 1504-1508, J. Chem. Inf. Model. 2007, 47, 342-353 (GDB-11)
Lorenz C. Blum et al., J. Am. Chem. Soc. 2009, 131, 8732-3 (GDB-13);
Lars Ruddigkeit et al., J. Chem. Inf. Model. 2012, 52, 2864-2875 (GDB-17) Trinorbornane
Ricardo Visini et al., J. Chem. Inf. Model. 2017, 57, 700-709 (FDB17), J. Chem. Inf. Model. 2017, 57, 2707-2718 (GDB4c) (2007)
Mahendra Awale et al., Mol. Inf. 2019, 38, 1900031 (GDBMedChem)
Sven Bühlmann et al., Front. Chem. 2020, doi:10.3389/fchem.2020.00046 3
Molecular quantum numbers (42D)
HO Polar groups
Atoms H-Bond donor atoms 3 1
O
Carbon 17 16 H-Bond donor sites 3 1
H
Fluorine 0 0 H-Bond acceptor atoms 3 4
H Chlorine 0 0
NMe H-Bond acceptor sites 3 7
HO Bromine 0 0 Positive charges 1 0
Morphine Iodine 0 0 Negative charges 0 1
Sulphur 0 1
H Phosphor 0 0
N
S Acyclic nitrogen 0 1
O Cyclic nitrogen 1 1 Topology
N
O Acyclic oxygen 2 4 Acyclic monovalent nodes 3 6
CO2H Cyclic oxygen 1 0 Acyclic divalent nodes 0 2
Penicillin G
Heavy atom count 21 23 Acyclic trivalent nodes 0 2
Acyclic tetravalent nodes 0 0
Cyclic divalent nodes 8 6
Cyclic trivalent nodes 9 6
Bonds Cyclic tetravalent nodes 1 1
Acyclic single bonds 3 8 3-Membered rings 0 0
Acyclic double bonds 0 3 4-Membered rings 0 1
Acyclic triple bonds 0 0 5-Membered rings 1 1
Cyclic single bonds 18 11 6-Membered rings 4 1
Cyclic double bonds 4 3 7-Membered rings 0 0
Cyclic triple bonds 0 0 8-Membered rings 0 0
Rotatable bonds 0 4 9-Membered rings 0 0
10 membered rings 0 0
Atoms shared by fused rings 7 2
Bonds shared by fused rings 6 1
4
Kong Thong Nguyen et al., ChemMedChem 2009, 4, 1803-1805
5
Daniel Probst et al., Bioinformatics, 2018, 34, 1433-1435, J. Chem. Inf. Model. 2018, 58, 1–7
6
Mahendra Awale et al., Mol. Inf.. 2019, 38, 1900031
Ring systems
GENG
93,463 graphs
≤ 4 cycles, ≤ 16 nodes
1) Ring enlargement
+ filters
2) Aromatization
728,391 saturated
carbocyclic ring systems
≤ 4 cycles, ≤ 30 atoms
GDB4c 3) Stereoisomers
916,130 carbocyclic
ring systems
(as SMILES)
RDB
12,536 known GDB4c3D
ring systems 6,555,929 carbocyclic
ring systems
(as sdf files /SMILES)
Ricardo Visini, Josep Arùs-Pous et al., J. Chem. Inf. Model. 2017, 57, 2707-2718 7
Deep learning
ChEMBL 1) Training
DrugBank
2) Transfer learning
FDB17
commercial fragments One known drug
LSTM
Generative Neural Network
3) Generate new SMILES Table 2. Number of unique/total high similarity drug analogs produced by the different LSTM neural networks.
- retain correct SMILES
- remove duplicates
Neural Network LSTM1 LSTM2 LSTM3 LSTM4 LSTM5 LSTM6
- remove undesirable functional groups Source database ChEMBL ChEMBLs DrugBank Commercial FDB17 All Unique
4) Select high similarity analogs Fragments databases across
training cpds. 344,319 40,000 5,104 40,986 500,000 890,409 LSTMs
Nicotine 0/23 32/82 1/32 32/93 9/47 16/67 166
New drug analogs
Fencamfamine 15/42 126/218 40/96 130/231 92/164 41/114 580
Aminophenazone 5/26 34/96 23/71 38/99 22/66 19/65 223
Sulfadiazine 6/27 19/59 11/37 28/74 8/30 2/25 124
Miconazole 2/10 301/500 268/438 174/336 0/0 153/256 1134
Roflumilast 8/15 319/557 117/283 351/585 0/0 45/166 1126
Lovastatin 0/1 631/986 460/757 352/625 487/728 289/530 2729
Epothilone D 0/1 911/1301 561/831 807/1160 1595/2039 1163/1511 5707
Nilotinib 0/1 506/666 180/321 218/381 0/0 243/355 1362
Erythromycin 0/2 832/1042 174/243 524/709 1243/1444 1105/1288 4190
Bit Value
50 50 50
0 0 0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 A B
B C
adamantane
35%
Bit Value
5015%
10% 50
0
D8 D9 D10 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 C
5%
0% 0 10
Mahendra Awale
0 et al.,5J. Chem.
10 Inf. Model.
15 2014,
2054, 1892-1907
25 30
Target prediction
NN search (MQN / Predicted
Xfp / ECfp4) in targets
ChEMBL (NN)
Predicted
ECfp4-NB Model
targets
built on 2,000 NN
Query (NN+NB)
molecule
Predicted
ECfp4-NB model
targets
built on ChEMBL
(NB)
Predicted
ECfp4-DNN model
targets
built on ChEMBL
(DNN)
11
Mahendra Awale et al., J. Chem. Inf. Model. 2019, 59, 10-17
12
Mahendra Awale et al., J. Chem. Inf. Model. 2019, 59, 10-17
PPB2 with Xfp similiarity
13
Marion Poirier et al. ChemMedChem 2019, 14, 224-236
14
Alice Capecchi et al., Mol. Inf. 2019, doi: 10.1002/minf.201900016
15
Antimicrobial peptides
Peptide Dendrimers
Enumerate analogs
G3KL (Lys, Leu, deletion: 52,530)
Cluster
(10)
17
Thissa N. Siriwardena et al., Angew. Chem. Int. Ed. 2018, 57, 8483-8487
Peptide Design Genetic Algorithm (PDGA)
50 random Target sequence Target: tyrocidine A
sequences or MXFP cyclo[D-Phe-Pro-Phe-D-Phe-Asn-Gln-Tyr-Val-Orn-Leu]
XfPFfNQYVOL
evaluation:
1. SMILES SMILES
2. MXFP
3. CBD from target MXFP
MXFP
CBDMXFP= 0 YES
OR Exit
run time = 24h
NO tyrocidine B CBDMXFP = 67
YES
CBDMXFP ≤ 300 ANALOGS
DATABASE retro-loloatin A CBDMXFP = 157
new generation
19
Daniel Probst et al., J. Cheminf. 2018, 10, 66
TMAP of the natural products atlas (24,594×232)
20
Daniel Probst et al., J. Cheminf. 2020, doi:10.1186/s13321-020-0416-x
Atom Pair shingles MAP4 encoding of jk
r1: O=c |15| c(c)c
r2: O=c(c)[nH]|15|c(cc)cc
j
OH count
MAP4
http://tm.gdb.tools/map4/ 22
Summary and Outlook
> GDB
– GDBChEMBL
– MQN (visualize)
– Deep learning
– GDBScaffold
> Atom-pairs
– Target prediction
– Beyond Lipinski
– Peptide discovery
23
Machine Learning in the Regulatory
landscape
13.10 – 13.20
Abstract
N/A
Machine Learning in the
Regulatory landscape
2
EU agencies
ECDC
ECHA
EMA
EFSA
EFSA IS
4
WHAT EFSA DOES
5
THE SCIENTIFIC PANELS
Plant protection
Plant health
GMO
Nutrition
Animal feed
Food Packaging
Food additives
Biological hazards
Chemical contaminants
6
In vivo •ADME studies
7
Previous
Evaluations
Reports, papers and
evaluations
Reports
Biological studies In depth or
(e.g. ADME) and systematic
toxicological studies literature
(e.g. genotoxicity, searches
90-day, Papers and reports
developmental etc..)
Data and
evidence to
support
risk
assessment
9
Impact on regulatory risk assessment:
Opportunities for machine learning
▪ Data and evidence from literature
➢ Transparency and reproducibility
➢ In depth and systematic literature searches
➢ Controversial substances – thousands of publications
➢ Tools for literature screening (Distiller SR) and study quality evaluation (e.g.
SciRAP)
▪ Big data
➢ Whole-genome sequencing data
➢ OMICs data
▪ In silico predictions
➢ (Q)SAR
➢ RAx
➢ Need for better predictive models
11
Emerging risks in the food chain
Opportunities for machine learning
Does their release in the environment lead to accumulation in the food chain?
How to identify emerging issues? Alerts through text mining of scientific literature?
Abstract
Why, Where and When Machine Learning Works in Predictive Toxicology – And When it Doesn’t….
Mark Cronin, School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, UK.
E-mail: m.t.cronin@ljmu.ac.uk
We've been using multivariate statistics in predictive toxicology for many decades now, as soon as the computational power was sufficient in the 1980s we
used supervised and unsupervised techniques to analyse what would now be thought of as rather trivial data sets. Neural Networks became relatively
commonplace in QSAR in the early 1990s, with some notorious early studies only proving that it was easier to overfit a model then provide a reliable
predictive technique! Whilst the data resources, hardware and machine learning techniques currently available to us, as well the possibilities they provide,
could not have been imagined 30 years ago, the uses of computational toxicology have largely remained unchanged e.g. supporting product development,
screening and prioritisation, safety assessment. A key use is also in regulatory assessment, and this is an area where potentially there has been the greatest
resistance to the use of Machine Learning QSARs. The purpose of this talk is not to discourage the use of Machine Learning in computational toxicology,
precisely the opposite, but to say we need to use it correctly and in the right place, understanding when other techniques, based on a more mechanistic
approach, may be preferred and/ or acceptable.
Why, Where and When Machine Learning
Works in Predictive Toxicology
– And When it Doesn’t….
Mark Cronin
With Thanks to Dr
James Firman, LJMU
Machine Learning
- Social Media
- Epidemiology
• What do you want to learn? - Adverse Drug
Reactions
• Size of data set
• Homogeneity of mechanisms
• Appropriate descriptors
• Overfitting
• Understand experimental error
• Structure of data
Main Message:
eTOX 25/10/16
DrugBank Approved v5.0.3
Some Freely Available Toxicity Data Resources
• OECD QSAR Toolbox (https://www.qsartoolbox.org/)
• Requires download and some expertise!
• ChEMBL (https://www.ebi.ac.uk/chembl/)
• Targets: 13,382 - Activities: 16,066,124
• Compounds: 1,961,462 - Documents: 76,086
• PubChem (https://pubchem.ncbi.nlm.nih.gov/)
• Compounds: 102.7 million - Substances: 253 million
• Bioactivities: 268 million
• And there are many others….
Why Does
This Happen?
The Hypothesis
Organised into An Adverse Outcome Pathway
AOPs Networks:
Quantify With NAMs
Spinu N. et al. (2019). Development and analysis of an adverse outcome pathway Slide adapted from Dr
network for human neurotoxicity. Archives of Toxicology 93, 2759–2772 Andrew Worth, EC JRC
Adverse Outcome Pathway Network for Neurotoxicity
Cytotoxicity
• Opportunities:
• To update assessment / validation
• Utilise knowledge of uncertainties
• Develop frameworks for regulatory use
Inspirations
13 Types of Uncertainty, Variability and Bias of QSARs
49 Assessment Criteria
Definition of Chemical Structures → 2 Biological Data → 7
ADME Effects → 2
Details in: Cronin MTD et al (2019) Reg. Toxicol. Pharmacol. 106: 90-104
Making it Usable
Making it Useful
Uncertainties Confidence
Low High
Key Question:
What is
acceptable level
of uncertainty?
Acceptable
High Low
Fit for Purpose In Silico Models
Risk Assessment
Mining datasets
– Searching for patterns
– Finding analogues
• Simplify – Explain
• Make Even Simpler - Explain Again
• Make Really, Really Simple…
Predictions with Confidence using
Conformal Prediction
14.15 – 14.45
Abstract
Predictions with Confidence using Conformal Prediction.
The presentation will cover the utility of confidence predictors such as Conformal Prediction as an in silico modelling framework for obtaining
predictions with known, and mathematically proven, error rates set by the user as well as the graceful handling of highly imbalanced datasets,
typical in toxicology, without the need for balancing measures such as under- and/or oversampling.
Predictions with Confidence
using Conformal Prediction
Ulf Norinder*
Dept of Computer and Systems Sciences
Stockholm University, Sweden
* Most of the work performed at: Swedish Toxicology Sciences Research Centre (Swetox),
Unit of Toxicology Sciences, Karolinska Institutet, Sweden
as part of the EU-ToxRisk project (Horizon 2020, grant agreement No 681002)
Question of (un)certainty
* = compounds
Question of (un)certainty
Desired:
Mathematical proof
Classes
Active
Inactive
Both {active, inactive)
Empty {null}
Binary classification
In conformal prediction:
If a classification contains the correct class it is correct
both = always correct, empty = always erroneous
Validity = % of correct classifications (for each class)
Efficiency = % of single label classifications (right or wrong)
Conformal Prediction
Why Conformal Prediction?
• Win situation
• Statistical guarantees (on validity)
• CP is instance-based
• The risk is known up-front for the decision taken
• Applicability domain closely linked to model development
CP strictly defines the level of similarity (conformity) needed
No ambiguity anymore
• Gracefully handles (severely) imbalanced datasets
Ratios of 1:100 – 1:1000
No need for over- or undersampling
• CP is a framework (almost any ML algorithm will work)
Conformal Prediction
How does this work?
Data
Proper Calibration
Train set set
Imbalanced dataset
(toxic minority class)
New compound to predict 0/8 = 0.0 < 0.2 therefore the compound is not
(is toxic) assigned to the non-toxic class
32 trees: toxic
68 trees: non-toxic
Conformal Prediction
Example: Predicting Toxicity
Calibration set, 6 toxic, 7 non-toxic compounds
N trees predicting correct class
Several p-values
(for each class):
New compound to predict Use median p-
(is toxic) value
32 trees: toxic
68 trees: non-toxic
In conformal prediction:
If a classification contains the correct class it is correct
both = always correct, empty = always erroneous
Validity = % of correct classifications (for each class)
Efficiency = % of single label classifications (right or wrong)
PubChem Cytotox Assays
• Results from 16 high throughput cell viability (tox) screens from
PubChem
• RDKit descriptors
39.6 911.2
Training set
Test set
If however –
• The new predictions are not valid
15.15 – 15.40
Abstract:
Molecular descriptors for organic reactions.
Reactions are complex processes, the number of factors that can affect the outcome of a reaction goes from collision theory, kinetics, enthalpy,
or entropy to the use of catalysts or additives to vary the reaction mechanism. Different approaches have been reported in the literature to
describe organic reactions as vectors to enable datamining or machine learning processes. However, the big question still remains, do we have
enough molecular descriptors to develop predictive models?
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS
Fernando F. Huerta
Bioscience & Materials Division (RISE)
fernando.huerta@ri.se
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS
why?
a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS
how?
reaction matrix
n n
(DA DB DC)
n
(a and/or y)
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
what type of molecules are A, B, C
n n n
unique reaction matrix (DA DB DC)
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
constitutional isomers can be described by means of a unique array of molecular descriptors
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS
however?
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
5. Goal
Suzuki-Miyaura reaction
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
Reaction product ≠ f(reaction conditions)
Molecular descriptors
2.
1. Reaction
DefinitionMatrix (Examples)
& Examples *ACS Omega submitted
for organic reactions
Reaction product ≠ f(reaction conditions)
Buchwald-Hartwig coupling
90% (AUC)
Negishi coupling
83% (AUC)
Molecular descriptors
2.
1. Reaction
DefinitionMatrix (Examples)
& Examples *ACS Omega submitted
for organic reactions
other reactions
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
Reaction product = f(reaction conditions)
75%
• Selectivity depending on
reaction conditions
2
74-86%
(reagents used).
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS
Matrix difference lies only on the product molecular properties
P SM1 SM2
2
Reaction product = f(reaction concentration)
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
Reaction product = f(reaction conditions)
• Not reliable
• Lack of data
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
5. Goal
a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
5. Goal
a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
are all organic reactions as simple?
a
A+B C y
a -> set or reaction conditions
y -> yield
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
I’M STILL A CHEMIST
are all organic reactions as simple?
a
A+B C y
a -> set or reaction conditions
y -> yield
Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
Reaction Thermodynamics
• Catalyst and/or additives
• Solvent
• Temperature
• Etc…
Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
Reaction Thermodynamics
• Catalyst and/or additives
• Solvent
• Temperature
• Etc…
Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
Reaction Thermodynamics
• Catalyst and/or additives
• Solvent
• Temperature
• Etc…
Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
Reaction Thermodynamics Reaction Kinetics
• Catalyst and/or additives • Concentration
• Solvent • Polar surface area
• Temperature • Temperature
• Etc… • Etc…
Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS
HYPOTHESIS
• reaction classification
• key reagents
• analysis of bad
• identification of “bad”
• generation of “descriptor/s”
HYPOTHESIS PROBLEMS
• no bad reactions reported
• curation
• lack of data
• number of reaction clusters
• and many more…
a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
5. Goal
a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
5. Goal
a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
https://www.pinterest.co.uk/McrHCC/co
mmunity-engagement/
ACKNOWLEDGEMENTS
ALEXANDER MINIDIS
SAMUEL HALLINDER
ULF TEDEBARK
ULF NORINDER
fernando.huerta@ri.se
THANKYOU
THANK YOU
Pitfalls when applying machine learning
to chemistry
15.50 – 16.10
Abstract:
Pitfalls when applying machine learning to chemistry
Despite its recent popularity, machine learning (ML) is not a free lunch. Obnoxious problems often pop up when applying ML in general and no
less when applied to chemistry. I will discuss a selection of such problems that may require more attention, including computational efficiency,
data quality, debugging, descriptors/representations, and evaluation.
Pitfalls when applying machine
learning to chemistry
•Scaling
•Data quality
•Model problems
1. Scaling
Parameter distributions in chemistry
are often exponential (or worse),
while linear-algebra-based algorithms
prefer normal distributions.
?
Conclusions
• Three problems that deserve attention:
Scaling, Data Quality, and Model Problems
16.20 – 16.30
Ulf.Tedebark@ri.se (Biomolecules)
Sverker.Janson@ri.se (Computer Science)
Ian.Cotgreave@ri.se (Chemical and Pharmaceutical Toxicology)
Alexander.Minidis@ri.se (Medicinal/Process Chemistry and Data-Science)
Erik.Ylipaa@ri.se (Deep Learning)
Fernando.Huerta@ri.se (Medicinal/Process Chemistry and Data-Science)
Swapnil.Chavan@ri.se (Computational Toxicology)
Andreas.Thore@ri.se (Materials Science and Computational Chemistry)
Martin.Nilsson@ri.se (Algorithms and Mathematics)
RISE is Sweden’s research institute and innovation partner. Through our international collaboration programs with industry, academia and the
public sector, we ensure the competitiveness of the Swedish business community on an international level and contribute to a sustainable
society. Our 2,800 employees engage in and support all types of innovation processes. RISE is an independent, state-owned research institute,
which offers unique expertise and over 100 testbeds and demonstration environments for future-proof technologies, products and services.
ISBN: 978-91-89167-42-1