Download as pdf or txt
Download as pdf or txt
You are on page 1of 273

Accelerating

chemical design
and synthesis
using artificial
intelligence
Open Workshop, May 29, 2020

1 RISE — Research Institutes of Sweden


ISBN: 978-91-89167-42-1
Dear Speakers and Participants,
We (RISE) thank all participants for making this Open Workshop a very interactive and interesting day! Especially, we are grateful to all the speakers, who
deserve extra credit for putting together impressing presentations covering a wide range of areas!

Some fun statistics: Throughout the day we had 90-110 participants viewing the presentations and participating actively in the Q&A sessions. The audience
was spread all around the globe (Europe, US, Far East, Australia, South America) and we hope that the day was valuable to you all, and worth either getting
up really early, or staying up really late!!

This workshop was a combined effort by several cross-functional departments at RISE to jointly define the contents. This included the units Process
Chemistry, Toxicology/Safety Assessment and AI/Digital Systems. The day addressed how Machine Learning and AI is rapidly paving roads into how
chemicals are designed and manufactured with optimal functional characteristics. Achieving optimal functionality is always a balance between the desired
physico-chemical properties of a molecule, the hazardous/risk potential and the practicality in scale up of production. In each of these areas, Machine
Learning/AI techniques present great opportunities in achieving this balance in a more pro-active and effective manner.

To conclude, please get in touch with any of the speakers and RISE if you have a question or a request that you think any of us can solve or assist you with.

Take care and stay safe!

RISE organizing committee

Ulf.Tedebark@ri.se (Biomolecules)
Sverker.Janson@ri.se (Computer Science)
Ian.Cotgreave@ri.se (Chemical and Pharmaceutical Toxicology)
Alexander.Minidis@ri.se (Medicinal/Process Chemistry and Data-Science)
Erik.Ylipaa@ri.se (Deep Learning)
Fernando.Huerta@ri.se (Medicinal/Process Chemistry and Data-Science)
Swapnil.Chavan@ri.se (Computational Toxicology)
Andreas.Thore@ri.se (Materials Science and Computational Chemistry)
Martin.Nilsson@ri.se (Algorithms and Mathematics)
Agenda
(times are CEST (Stockholm time zone))

• Moderator: Ian Cotgreave, RISE • 13.10 – 13.20


Machine Learning in the Regulatory landscape.
• 09.00 – 09.15 George Kass, EFSA, Italy
Welcome and introduction.
Ian Cotgreave, RISE, Sweden • 13.30-14.00
Why, Where and When Machine Learning Works in Predictive
• 09.15-09.45 Toxicology – And When it Doesn’t….
Machine Learning and Chemistry Mark Cronin, LJMU, UK
Erik Ylipää, RISE, Sweden
• 14.15-14.45
• 10.00-11.00 Predictions with Confidence using Conformal Prediction.
Molecular de-novo design and synthesis prediction Ulf Norinder, Stockholm University, Sweden
Esben Jannik Bjerrum, AstraZeneca, Sweden
• Coffee break
• Coffee break
• 15.15-15.40
• 11.20-11.50 Molecular descriptors for organic reactions.
GDB and the chemical space. Fernando Huerta, RISE, Sweden
Jean-Louis Reymond, Univ of Bern, Switzerland
• 15.50-16.10
• Lunch break Pitfalls when applying machine learning to chemistry
Martin Nilsson, RISE, Sweden

• 16.20-16.30
Concluding remarks.
Ian Cotgreave, RISE, Sweden

3 RISE — Research Institutes of Sweden


Welcome and Introduction

09.00 – 09.15

Ian Cotgreave, RISE, Sweden

2 RISE — Research Institutes of Sweden


Machine Learning and Chemistry

09.15 – 09.45

Erik Ylipää, RISE, Sweden

4 RISE — Research Institutes of Sweden


Erik Ylipää
Bio
Erik Ylipää is a deep learning researcher at RISE. His main area of interest is in neural network architecture, in particular arbitrarily structured
data such as sequences and sets.
Other research areas of interest are sparsely connected neural networks, representation learning and natural language processing

Abstract
Machine Learning and Chemistry
The last couple of years has seen a shift in how machine learning is used in chemoinformatics: from methods with an emphasis on feature
engineering to deep neural networks which allow molecules to be modelled directly. This talk will be a brief overview of the past, present and
future of machine learning in chemistry with a focus on deep learning. We will also look at how contemporary neural networks for processing
mathematical sets are ideally suited for working with small molecules and how lessons learned from the field of natural language processing
could catalyse the field of chemoinformatics.
Machine
Learning and
Chemistry

Erik Ylipää (erik.ylipaa@ri.se) 2020-05-29


Artificial Intelligence
Example: Knowledge
The AI and Deep bases
Learning onion
Machine Learning
Example: Random
Forest

• Deep Neural Networks are Representation


the focus of this Learning
presentation Example: Shallow word
vectors (CBoW)
• They are a deep learning
method
Deep Learning
• Similar to how Random Example: Recurrent
Forest is an Ensemble Neural Networks
learning method

2 RISE — Research Institutes of Sweden Bengio, Yoshua, Ian Goodfellow, and Aaron Courville. Deep learning.
Vol. 1. MIT press, 2017. https://www.deeplearningbook.org/
Machine Learning
and Learning
Algorithms
• Learning Algorithms improve
their performance on a task
by experience
• This is in contrast with a rule
based algorithm, which does
not change in performance
with more data

3 RISE — Research Institutes of Sweden


Supervised machine learning
● In many cases the data we have can be organized as a table
○ Each row is an observation
○ Each column is a different variable
● The task is often to predict the value of one variable (e.g. cardio) based on the
others
● We often call the column to predict our target, denoted by y, and the other
columns our features, often gathered as a vector x
age gender Height Weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
18393 2 168 62 110 80 1 1 0 0 1 0
20228 1 156 85 140 90 3 1 0 0 1 1
18857 1 165 64 130 70 3 1 0 0 0 1
17623 2 169 82 150 100 1 1 0 0 1 1
17474 1 156 56 100 60 1 1 0 0 0 0
21914 1 151 67 120 80 2 2 0 0 0 0
22113 1 157 93 130 80 3 1 0 0 1 0 logo

22584 2 178 95 130 90 3 3 0 0 1 1


17668 1 158 71 110 70 1 1 0 0 1 0
https://www.kaggle.com/sulianova/cardiovascular-disease-dataset/version/1
Supervised machine learning as function
approximation
cardio
● We typically make an assumption that 0

there exists an underlying functional


relationship between x and y:
○ y = f(x) + noise
● The goal of supervised machine
learning is to learn an approximation
g(x)
of f(x) from data, which can be used
to predict y for new observations
● Here we denote the approximation as
g( x) age
18393
gender
2
Height
168
Weight
62
ap_hi
110
ap_lo
80
cholesterol
1
gluc
1
smoke
0
alco
0
active
1
logo
The set is all functions

Parameterized models parameterized by w

● In the case of deep learning, the This element correspond to Settings corresponding
mathematical expression used to setting the parameters to to good
specific values approximations of f(x)
approximate f(x) has some free parameters
● We call this function g(x; w), where w are
the parameters which determine what the
function computes
○ We can think of the parameters w as an
index into a set of possible functions
● Learning is then a search problem: find the
This element correspond to
member (settings of w) of this set which best different values for the
logo

approximates f(x) parameters


The set is all functions

Parameterized models parameterized by w

This element correspond to Settings corresponding


setting the parameters to to good
specific values approximations of f(x)

• A parametric model like a neural network


can be thought of as a filter
• It has an input and an output
• The parameters correspond to the
dials we can turn to decide how the
input is transformed to produce the This element correspond to
output different values for the
logo

parameters
Generalization
● We want the function to be general, to Settings
corresponding to
learn the underlying phenomena good approximations
of f(x) in general
● The goal is not to directly minimize the
error on training data, but on new data not
seen during training
● A flexible g(x; w) can have many settings
which minimize the training error, without
being good in general
● Machine learning as a practice is about
finding function families which perform well
in general as they minimize their error on Settings corresponding to logo

good approximations of f(x)


the training data on the training dataset
Representing
molecules for
machine learning

RISE — Research Institutes of Sweden


9
How to represent molecules

• Many different representations exist for


working with computational methods on
molecules

– Molecular graphs

– String-representations

– Geometrical representations C1=CC(=C(C=C1CCN)O)O

InChI=1S/C6H11N2O4PS3/c1-10-5-7-8(6(9)16- logo

5)4-15-13(14,11-2)12-3/h4H2,1-3H3

10 RISE — Research Institutes of Sweden


How to represent molecules

• These representations vary in size

• Traditional machine learning methods


typically require fixed dimensional
features

C1=CC(=C(C=C1CCN)O)O

InChI=1S/C6H11N2O4PS3/c1-10-5-7-8(6(9)16- logo

5)4-15-13(14,11-2)12-3/h4H2,1-3H3

11 RISE — Research Institutes of Sweden


Historical representation
of molecules for machine
learning
Descriptors • Historically, machine learning
methods used in cheminformatics
has been:
– Fixed dimensional molecular
descriptors and fingerprints
– Induced by graph kernels (for
use in e.g. Support Vector
Structural Machines)
fingerprints Graphs with
graph kernels logo

12 RISE — Research Institutes of Sweden


Molecular descriptors
and fingerprints
• Molecular descriptors are properties either
determined by experimentation or calculated
from the molecule
– Molecular weight
– Number of different atoms or number of H-
bond donors
– Quantitative feature of substructures
– Solubility coefficients
• Fingerprints are binary indicator variables for the
presence of substructure, similar to bag-of-words
or bag-of-n-grams in NLP
– Each substructure is determined by a
manually crafted rule, based on insight into
which substructures are important to
predict activity

13 RISE — Research Institutes of Sweden


Feature learning
instead of engineering

• Learn the important


features from the
molecular structure
• How do we perform
learning on unbounded
inputs?

Jha, D., Ward, L., Paul, A. et al. ElemNet: Deep Learning the Chemistry of Materials From Only
Elemental Composition. Sci Rep 8, 17593 (2018) doi:10.1038/s41598-018-35934-y logo
Neural Networks

RISE — Research Institutes of Sweden


15
Neural networks learn representations

Learnt Learnt “It is a truth


universally
Encoder decoder acknowledged”

Target text string


Source speech
Embedding
waveform

• Learn an encoder which embeds the data point in a high


dimensional vector space
• Distance between embeddings of different inputs should make
solving tasks easier (e.g. audio waveform embeddings close logo

together should correspond to similar transcriptions)


Transfer learning and multitask
learning
“It is a truth
Learnt
universally
decoder
acknowledged”
Speech recognition

Learnt Learnt Mrs.


Encoder decoder Bennet
Speaker identification

Source speech
Embedding
waveform Learnt
Ingratiating
decoder
Tone analysis

• The encoder-decoder setup allows for multitask learning and transfer learning

• The same encoder can embed the input data in a space which is useful for logo

different for many different decoders


Anatomy of a deep neural network

Linear prediction (e.g.


Adaptive functions tailored for logistic regression)
our data (e.g. convolutional General adaptive
logo

layers for images) function (dense layer)


Image representations in neural nets

Simple linear classifier


(logistic regression)

logo

Neural image embedding


Molecular embeddings

Mahé, Pierre, et al. "Graph kernels for molecular structure− activity relationship analysis with support
vector machines." Journal of chemical information and modeling 45.4 (2005): 939-951. logo

20 RISE — Research Institutes of Sweden


Neural Networks
for Molecules

RISE — Research Institutes of Sweden


21
Neural Networks for molecules

• There have been four main directions for dealing with molecules and neural networks:
– Traditional machine learning – traditional neural network with dense layers fed
with molecular descriptors and/or structural fingerprints. Tree ensembles often
perform very well with these representations.
– SMILES-based – a sequence neural network reads the molecule description as a
string
– Graph-based – neural networks for mathematical sets process the molecule as a
graph
– Geometrical - Geometrical neural networks process the molecule as a point cloud
represented in a 3D space logo
Neural Networks for molecules

• There have been four main directions for dealing with molecules and neural networks:
– Traditional machine learning – traditional neural network with dense layers fed
with molecular descriptors and/or structural fingerprints. Tree ensembles often
perform very well with these representations.
– SMILES-based – a sequence neural network reads the molecule description as a
string
– Graph-based – neural networks for mathematical sets process the molecule as a
graph
– Geometrical - Geometrical neural networks process the molecule as a point cloud
represented in a 3D space logo
SMILES-based

• A sequence neural network (recurrent


or self-attention) reads in a SMILES
representation of the molecule
• Properties are predicted on the latent
space embedding of the molecule
• Often, pretraining is used by
reconstructing the input string using a
decoder network with the latent
embedding as input
• Has worked really well for design
Gómez-Bombarelli, Rafael, et al. "Automatic chemical design using a data-driven
space exploration (drug design) continuous representation of molecules." arXiv preprint arXiv:1610.02415 (2016).
Problems with SMILES
• The same molecule can be represented
with many different SMILES strings
• Since SMILES linearizes the graph,
locality can be harder to learn (two
adjacent atoms can be far apart in the
SMILES)
• While it captures the same information
as the mathematical graph a lot of it is
implicit
• The neural network needs to learn an
implicit SMILES-parser to solve the main
predictive task
• There are neural networks designed for
graphs and can be designed to work
directly with molecules as graphs
Chen, Benson, et al. "Learning to Make Generalizable and Diverse Predictions for
logo

Retrosynthesis." arXiv preprint arXiv:1910.09688 (2019).


Mathematical graphs for neural
networks
• In a neural network, each node in a graph is represented as a
mathematical vector

• The graph is represented a set of vectors, additionally the edges


might also have attributes represented by vectors

logo

26 RISE — Research Institutes of Sweden


Graph Neural Networks for
molecules
• For each node in a graph, we aggregate the vectors of
its neighbourhood and transform them with a transition
function
– Each such application is a layer of a neural network
• This is also referred to as message passing neural
networks
– Information between two nodes which are not
direct neighbors must be passed through the
nodes on a path between them logo
Graph neural network

The neural network is applied to


logo

each node and its context


28 RISE — Research Institutes of Sweden
Graph Neural Network - parallel view

logo

29 RISE — Research Institutes of Sweden


Graph Neural Networks for
molecules
• For molecules, only using the direct neighbourhood
according to bond structure is likely overly restrictive
– The model is forced to focus on local interactions,
having difficulties learning properties of larger
structures
• The number of layers is a ceiling on how long paths we can
learn things over
– and the longer the path, the harder to learn the
relationship
– Transporting information in a nested function will give
vanishing gradient issues
• It has been noted that learning global molecular features
such as molecular weight is difficult for Graph Neural
Networks logo
A case for SMILES

• While SMILES “hides” the graph behind a linearization, it completely encodes the
graph structure together with bond information

– Graph Neural Networks will often have to add specialized architecture to encode
the graph structure, especially for edge attributes like different types of bonds

• The non-uniqueness of SMILES representation could be positive, it allows for straight


forward data augmentation (e.g. Bjerrum, Esben Jannik. "Smiles enumeration as data
augmentation for neural network modeling of molecules." arXiv preprint arXiv:1703.07076
(2017).)

logo

31 RISE — Research Institutes of Sweden


A case for SMILES (continued)

• One important application of machine learning for chemistry is generative


models, e.g. for de novo drug design or chemical reaction prediction

• Generating graphs with GNN are difficult, we need to generate the edge set
and can have trouble with set matching. We typically need to induce some
ordering on the edge set which can get complicated. See Jin, Wengong, Regina
Barzilay, and Tommi Jaakkola. "Hierarchical Generation of Molecular Graphs using Structural Motifs."
arXiv preprint arXiv:2002.03230 (2020).

• A SMILES representation introduces an ordering of the generated elements,


which makes learning much easier. It doesn’t require any specialized neural
architecture
logo

32 RISE — Research Institutes of Sweden


Transformers are like GNNs where all
nodes are the context

• For a transformer,
the graph
information needs
to be added in the
reduction and
feature function
• Typically as
additional pair-wise
information
logo

33 RISE — Research Institutes of Sweden


Transformers for molecules

Chen, Benson, Regina Barzilay, and Tommi Jaakkola. "Path-augmented


graph transformer network." arXiv preprint arXiv:1905.12712 (2019).
logo

34 RISE — Research Institutes of Sweden


Molecular Transformer

Schwaller, Philippe, et al. "Molecular transformer: A model for uncertainty-calibrated chemical reaction logo

prediction." ACS central science 5.9 (2019): 1572-1583.

35 RISE — Research Institutes of Sweden


Pretraining and 
Self-supervised
learning

RISE — Research Institutes of Sweden


36
Performance of deep networks given
compute, data and parameters

logo

Nogueira, Rodrigo, Zhiying Jiang, and Jimmy Lin. "Document ranking with a pretrained
sequence-to-sequence model." arXiv preprint arXiv:2003.06713 (2020).
Transfer Learning dominates NLP
• Natural Language Processing are shifting from task-specific solutions to reusing general language models

• A single huge model (hundreds of millions of parameters) is trained on a source tasks with huge amounts of data (tens to
hundreds of gigabytes of text)

• The same model is re-used as a frontend to obtain state of the art results in a very diverse set of target tasks

logo

38 RISE — Research Institutes of Sweden


Transfer Learning and Transformers in
NLP
NLP Progress
100
90
80
70
60
50
40
30
20
10
0
MNLI-mm QQP QNLI SST-2 CoLa STS-B MRPC RTE

Howard, Jeremy, and Sebastian Ruder. Pre-OpenAI SOTA BiLSTM+ELMo+Attn OpenAI GPT BERT_large RoBERTa
"Universal language model fine-tuning for text
classification." arXiv preprint
arXiv:1801.06146 (2018). logo

39 RISE — Research Institutes of Sweden


Pretraining huge Transformer
models tailored for molecules
will do for cheminformatics
what BERT has done for NLP
logo

RISE — Research Institutes of Sweden


40
Transfer Learning

L. Torrey and J. Shavlik, “Transfer learning,” Handbook of Research


on Machine Learning Applications and Trends: Algorithms, Methods,
and Techniques, vol. 1,
pp. 242–264, 2009. logo

Ruder, Sebastian. Neural transfer learning for natural


language processing. Diss. NUI Galway, 2019.
Transfer learning and setting space
Settings corresponding to good
approximations of the target
task in general

Settings corresponding to Settings corresponding to


good approximations of the good approximations of the logo

target task on its training source task on its training


set set
42 RISE — Research Institutes of Sweden

Image Encoder Output for

Transfer learning
source
task

in neural
networks
1. Train a deep network on
a task where vast amounts
of data is available Output for
Image Encoder
target task
2. Remove the parts specific to
the first task
3. Train a new task-specific part
for the new task
Limits of supervised
transfer learning

• A model trained on a supervised task only learns


a representation suitable for solving that task
– Learning to separate between leopards and
lions could be solved by only learning to
recognize leopard spots.
– Such a representation would not be useful
for more general image understanding
– Pretraining with this task could lead to
worse performance than no pretraining at Representation space
all: a negative transfer logo

44 RISE — Research Institutes of Sweden


Reducing negative transfer

• Each point is performance on a finegrained protein


function prediction task
• Adding additional self-supervised tasks improves
transfer logo

Hu, Weihua, et al. "Strategies for Pre-training Graph Neural


45 RISE — Research Institutes of Sweden Networks." ICLR (2020).
Self-supervised learning

Yann LeCun, https://www.youtube.com/watch?v=7I0Qt7GALVk


logo

46 RISE — Research Institutes of Sweden


Opportunities for molecular neural
networks
• Neural networks for molecules is at the start of a cambrian explosion
– Lot's of new ideas are flooding in from other fields, in particular Natural Language
Processing
• In regards to transfer learning, one interesting question stands out:
• How should we design strong self-supervised tasks that:
– Force the model to retain as much information as possible of the molecule, preferably
on many scales (local to global)
– Force the model to learn something about chemistry
• What are good representations of molecules which allows for strong self-supervised tasks?
How useful is geometrical information?

47 RISE — Research Institutes of Sweden


Where are we on
the scaling line?

• Is there still room for


improvement in the
benchmark datasets we
have?
• What is the irreducible
error in the datasets we
evaluate our models on?
• If we are close, any
progress we see is just
tighter overfitting to the
test set
Hestness, Joel, et al. "Deep learning scaling is predictable, empirically." arXiv logo

preprint arXiv:1712.00409 (2017).

48 RISE — Research Institutes of Sweden


Erik Ylipää
erik.ylipaa@ri.se

logo

RISE – Research Institutes of Sweden AB · info@ri.se · ri.se


Molecular de-novo design and synthesis
prediction
10.00 – 11.00

Esben Jannik Bjerrum, AstraZeneca, Sweden

5 RISE — Research Institutes of Sweden


Esben Jannik Bjerrum
Bio
Esben Jannik Bjerrum completed his PhD in Computational Chemistry at Copenhagen University in 2008. He has since worked both in academia
as a post.doc, industry as an IT specialist as well as a self-employed IT consultant. In 2017 his independent research resulted in several
contributions to the deep learning for chemistry renaissance. He joined AstraZeneca in 2018 where he currently works with development of de
novo design algorithms and deep learning assisted retrosynthetic planning. He’s the lead blogger of cheminformania.com.

Abstract
Molecular de-novo design and synthesis prediction
Molecular de-novo design and synthesis prediction
Drug discovery is the earliest part of developing new medicine for unmet medical needs of patients. Transforming early hits into clinical
candidates is a year long project with multiple rounds of designing new compounds with proposed better properties, synthesizing the
compounds in the laboratory and test the properties and analyze the results for new rounds of this design-make-test-analyze cycle. Our aim is
to utilize artificial intelligence to speed up these processes and allow us to create new medicines faster. In the molecular AI group, we develop
tools to answer two questions: What molecules to make and how to make them.
Part 1 will cover examples from the design phase, where we use a sequential molecular notation format, SMILES, in combination with deep
neural networks inspired from natural language processing. Using recurrent neural network architectures, we can read-in molecules in the
SMILES format, but also and more importantly, read-out molecules in SMILES format. The last property is what enables us to handle massive
molecular spaces by probabilistic sampling. The generation can be steered in directions of interested by different techniques, such as transfer-
learning, autoencoders, reinforcement-learning and conditional recurrent networks. Alternative architectures allow us to generate molecular
series or generate subtle changes in molecules to induce adjustments of existing compounds.
Part 2 will address question two: how to make the suggested molecules. Using large databases of known reactions pooled from public, in-
licensed and internal sources we extract reaction templates which covers the core of the reaction. To search for potential synthetic routes, we
use Monte Carlo tree-search coupled to neural network policies. The neural networks are trained to prioritize our template-library and predict
the most likely templates to be applied. Thus, the search-breadth of the tree search is reduced orders of magnitude and the tree search can be
performed in seconds to minutes. However, the policy networks prioritize often used reactions due to the imbalance of the reaction databases
and overlooks more seldom used and specialized reactions. By training policy networks on selected reaction classes such as ring-forming
reactions, special attention is given to these reactions, which can be injected into the tree search or used as alternative options in interactive
path-planning. Interestingly SMILES based translation can also play a role in Reaction Informatics.
Molecular de-novo design and
synthesis prediction

Esben Jannik Bjerrum, Principal Scientist


Machine Learning in Chemical Designs and Functions, RISE workshop 2020 may 29
Agenda
• Introduction to AstraZeneca and
Drug Design and Development

• Part 1: De-novo design Part 2: Reaction prediction


• SMILES – a chemical language Retrosynthesis planning with MCTS
• Recurrent neural networks tree search
• Data augmentation strategies Scoped prediction models
• Unbiased Molecular Generation Artificial Labels
• Controlling the molecular Condition prediction
generation Transformer Models
• Next generation algorithms

2
AstraZeneca: Global dimensions

invested in R&D
Total Revenue with research
$22.1bn (down 2% over 2017)
$5.9 bn across five 64.6k employees
countries

projects in clinical
of our senior
Product Sales development and
21.1bn
$ (up 4% over 2017) 149 eight NMEs in late- 45% roles are filled
by women
stage development

NME approvals manufacturing


Externalisation
1
$ bn Revenue 23 in 2018 30 sites in 17
(and 71 since 2014) countries

3
One of
AstraZeneca’s
three global
Gothenburg
R&D centres Boston
California Osaka
Shanghai

$ 21.1bn
Cambridge
Strategic R&D centres
Additional laboratories
Gaithersburg
4 4
Drug Discovery is just
the beginning …

Life-cycle of
a medicine
We are one of only a handful
of companies to span the
entire life-cycle of a medicine
from research and development
to manufacturing and supply,
and the global commercialisation
of primary care and speciality
care medicines

5 Speaker Kit March 2020


Design-Make-Test-Analyse cycles in Drug Discovery (DMTA)

Drug
target Design
Chemical starting point Candidate drug
(“Hit”) found through HTS,
DEL, fragment screening or
knowledge ~3
Analyse Make years
• Weakly active • Highly potent
• Target unselective • Effective in in vivo models
• Toxicity risk • Metabolically stable
• Low metabolic stability • No toxicity issues
Test

Multiple of DMTA cycles The challenge: Find ways


4-6 weeks per cycle to speed up and improve
Hand-overs between multiple labs the process using AI
Drug Design
Molecular AI group provides tools for the projects:

What to make next? How to make it?

De novo design Retrosynthesis


Multi-parameter scoring function
Simplified Molecular Input Line Entry Specification
(SMILES)

• A sequence format for molecules


• Allows us to use the progresses made with natural language
processing in the recent years ☺

11
Training and Sampling using Recurrent Neural Networks
Do the same computation for different time steps
Let previous computations influence current computation

Training:

Sampling:

12
14

De novo compound generation using recurrent neural networks

The generative process

Allow us to generate novel chemical structures in silico.


Molecules generated follows properties of the training set

Useful as expansion of
libraries for in silico HTS?

Needs fairly large


datasets or refocusing

Blue: Zinc Fragment like

Green: Zinc Drug like

Solid: Train
Dashed: Generated
Bjerrum, E. J.; Threlfall, R. Molecular Generation with Recurrent
15
Neural Networks (RNNs). 2017 http://arxiv.org/abs/1705.04612.
Why? Generation of Novel Compounds in the 1060 Chemical Space!

1010-1012
1060

Can be handled explicitly in databases Must be handled probabilistically

Where´s the impact?


• Use for de novo Molecular Design
• Scaffold Hopping
• Novelty
• Virtual Screening
16 • Library Design
Transfer learning 1. Train on large dataset of molecules
2. Retrain on smaller dataset of
molecules with desired properties

Segler, M. H. S.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating Focused Molecule Libraries for Drug
Discovery with Recurrent Neural Networks. ACS Cent. Sci. 2018.
17 https://doi.org/10.1021/acscentsci.7b00512 .
Assessing the quality of molecular generative models

1) Arús-Pous J, Blaschke T, Ulander S, et al (2019) Exploring the GDB-13 chemical space using deep generative models. J
Cheminform 11:20. https://doi.org/10.1186/s13321-019-0341-z
2) Arús-Pous J, Johansson SV, Prykhodko O, et al (2019) Randomized SMILES strings improve the quality of molecular
19
generative models. J Cheminform 11:71. https://doi.org/10.1186/s13321-019-0393-0
Data augmentation

• Canonical SMILES ensures a 1:1


relationship between molecule and
●Zooming, cropping, mirroring, flipping, SMILES
rotation, hue, color, contrast, etc. +
combinations • SMILES enumeration generate multiple
SMILES for the same molecule
Bj rr m, sb Ja ik. 0 7. “ m rati as Data Augmentation for Neural Network
20
Modeling of Molecules.” http //arxiv. rg/abs/ 70 .07076
Random SMILES in practice

21
SMILES enumeration increases Chemical Space Coverage

More uniform More Complete

Set SMILES Validity Completeness


Canonical 0.994 0.836
1M
Randomized 0.999 0.953
Canonical 0.905 0.445
10K
Randomized 0.974 0.715
Canonical 0.504 0.167
1K
Randomized 0.812 0.392

GDB-13 is 975 million molecules


(1) Arús-Pous, J.; Johansson, S. V.; Prykhodko, O.; Bjerrum, E. J.; Tyrchan, C.; Reymond, J.-L.; Chen, H.;
22 Engkvist, O. Randomized SMILES Strings Improve the Quality of Molecular Generative Models. J. Cheminform.
2019, 11 (1), 71. https://doi.org/10.1186/s13321-019-0393-0
SMILES Based Autoencoders

Example: Gómez-Bombarelli, Rafa t a . 0 8. “Automatic Chemical Design Using a Data-Driven Continuous


23 Representation of Molecules.” ACS Central Science 4(2): 268–76. Preprint from 2016!
A SMILES string is not a Molecule

c1ccccc1 Molecule

24
HeteroEncoders

Also possible with hi’s and CAS-names : Winter et. al 2018


From chemical images to SMILES: Bjerrum & Sattarov 2018

25
Latent vectors as a base for Quantitative structure-activity models (QSAR)

Figure adapted from : Bjerrum, Esben Jannik, and Boris Sattarov. 0 8. “Improving Chemical Autoencoder Latent Space and Molecular
26
De Novo Generation Diversity with Heteroencoders.” Biomolecules.
Latent Space gets more chemicaly relevant

RMSEP of 5 datasets modelled using deep neural networks


IGC50 LD50 BCF Solubility MP Norm
Mean
improvement

Enum2Enum 0.43 0.54 0.71 0.65 37 0.75


Can2Enum 0.46 0.54 0.69 0.69 37 0.77
Enum2Can 0.46 0.57 0.71 0.66 38 0.78
Can2Can 0.53 0.62 0.79 0.87 43 0.89
ECFP4 0.62 0.59 0.94 1.21 43 1.00

ECFP4 performance low when compared to literature, Enum2Enum close

27
Bjerrum, Esben Jannik, and Boris Sattarov. 0 8. “Improving Chemical Autoencoder Latent Space
and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules.
REINVENT: An In Silico mini-DMTA cycle

Generative RNN

Design

Reinforcement The Value:


Learning Optimizes Make Molecules for DMTA cycle
the RNN Analyse
Produces novel scaffolds and
Test improved compound
suggestions for drug discovery
projects
Predicted Compound
properties (e.g. QSAR) Less real world DMTA cycles
=> Saved time
Open Source:
https://github.com/MolecularAI/Reinvent
34
Optimization of molecular properties

Decoder
based:

We
currently
use
REINVENT

Blaschke, Thomas; Arús-Pous, Josep; Chen, Hongming; Margreitter, Christian; Tyrchan, Christian;
Engkvist, Ola; et al. (2020): REINVENT 2.0 – an AI Tool for De Novo Drug Design. ChemRxiv. Preprint.
35
https://doi.org/10.26434/chemrxiv.12058026.v2
Further options in molecular generation and design
Optimizations on REINVENT Better molecular optimizations via
Graph based molecular representations

Novel neural network architectures


AI Library Generation
Directly Steered Optimization Chemical intuition via MMP Transcoder

*
36
Return of the encoder architectures: Conditional RNN’s

Encoder Decoder

LogP
TPSA
RDKit MolWeight Decoder
HBA
HBD
Control of Properties

40
Kotsias, P.-C.; Arús-Pous, J.; Chen, H.; Engkvist, O.; Tyrchan, C.; Bjerrum, E. J. Direct Steering of de Novo
Molecular Generation with Descriptor Conditional Recurrent Neural Networks. Nat. Mach. Intell. 2020, 2 (May).
Molecular optimization
• Goal
– Given a starting molecule, generate molecules with desired property
changes while maintaining similarity to the original molecule
• Capturing chemist intuition in respect to chemical transformations that change
the property of molecules
– Matched molecular pairs
• Methods
– Seq2Seq
– Transformer
– Graph-based

47
Scaffold-based decoration using SMILES

1) Arús-Pous J, Patronov A, Bjerrum EJ, et al (2020) SMILES-based deep generative scaffold decorator for de novo
drug design. ChemRxiv
48
Conclusions part 1

• SMILES + NLP => Fast Results


• Data augmentation tricks improve performance
• We can read-in molecules:
– e.g. Property prediction
• We can read-out molecules:
– Probabilistic handling of huge databases
– Steered molecular generation with several
options:
• Reinforcement learning
• Conditional RNNs
• Encoder-decoder architectures
– More finetuned tasks also doable:
• Ligand series generation
• MMP generation

50
Part 2: How to make the compounds?

Synthesis Prediction

52
From Design to Compound: Make step

Design
NMP, MeCN, 93%

Make
Analyze

Test C Y
R1 + R2 P

53
Different Objectives for Synthetis Prediction

? Y
R1 + R2 P Condition Prediction

Forward
C ?
R1 + R2 ? Reaction Feasibility

Retro-synthesis 1-step P ? + ?

Backward ?
? ?
Retro synthetic planning
54
?
P
? ?
?
Chemistry Reaction Data

MedChem PharmSci
ELN ELN
Flatfiles
Reaxys
Flat
Flat files
files Flatfiles
Pistachio files
Flat
Flat files
Flat Files USPTO
Flat Files

ChemConnect

ReactionConnect
Design

Analyse DMTA Make

Test

Predictive Reaction
55 iLab, MedChem, PharmDev Models
Template Extraction Dataset Size Templates Extracted

Pistachio (incl. PGs) 6,839,427 308,951


USPTO 1976-2016 3,748,191 252,877

Pistachio + USPTO 10,587,618 315,286

All Data >1,000,000

Explicit Handling of Protection Groups


increase template quality

Retro Reaction Templates Extracted


Searching the possible reaction tree
Monte Carlo Tree Search

Branching Factors Neural Network selects and prioritizes


=> More Manageable Problem
Chess Go Retrosynthesi
s
Search Breadth ~35 ~250 > 1,000,000
Search Depth ~80 ~150 ~12

Alpha Go architecture
Segler, M. H. S.; Preuss, M.; Waller, M. P. Planning Chemical
58 Syntheses with Deep Neural Networks and Symbolic AI. Nature 2018,
555 (7698), 604–610. https://doi.org/10.1038/nature25978 .
Results in Seconds to Minutes
Model: USPTO
Time taken: 3.26 s

59
Do more reactions equal better performance?

61
Overall
Performance as
measured for each
dataset

Thakkar, A.; Kogej, T.; Reymond, J.-L. L.; Engkvist, O.; Bjerrum,
E. J. Datasets and Their Influence on the Development of
Computer Assisted Synthesis Planning Tools in the
Pharmaceutical Domain. Chem. Sci. 2019, 11 (1).
https://doi.org/10.1039/C9SC04944D .
62
Making the tool available The Value: Chemists can quickly get
Web-GUI based on MIT MLDPS consortium tools suggested routes/ideas to
purchasable compounds.
Cheminformaticians can filter datasets
i t “sy th sizab / t-synthasizable”

Scripting access via Python Objects

Jupyter based GUI

Open-sourced next week (+ Chemrxiv)


63
Improving the route-finding: Ring Breaker

• Extract templates corresponding to ring formations

• Create model specific to ring formations


• Model learns mapping to ring formations
without “distractions” from other reactions

Thakkar, A.; mi, .; R ym d, J.; gkvist, .; Bj rr m, . J. “ Ri g Br ak r ” : Neural


64 Network Driven Synthesis Prediction of the Ring System Chemical Space. 2020.
https://doi.org/10.1021/acs.jmedchem.9b01919 .
Improving the route finding 2: Artificial Labels for filtering

T1
P R1 + R2

T2
P

Bjerrum, Esben Jannik; Thakkar, Amol; Engkvist, Ola (2020):


Artificial Applicability Labels for Improving Policies in
Retrosynthesis Prediction. ChemRxiv. Preprint.
65 960 000 000 000 REACTION MATCHES https://doi.org/10.26434/chemrxiv.12249458.v1
Template-free NLP approach to reaction prediction

• Using SMILES a chemical


reaction can be formulated
as a translation
• Direct learning from data
• No templates with arbitrary
cutoffs need to be
extracted
• Good performance

(1) Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter, C. A.; Bekas, C.; Lee, A. A. Molecular Transformer: A
69
Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci. 2019, 5 (9), 1572–1583.
https://doi.org/10.1021/acscentsci.9b00576 .
Automating the DMTA cycle
• REINVENT is automated
AI REINVENT
design
• AiZynthFinder is
automated route
planning
Design • iLab is automated
synthesis

Make AiZynthFinder • 2018 De-novo design


Analyze
• 2019 Retrosynthesis
Test
• 2020 Combine with
Automation

• Increased automation is
iLab key to speed
72
Conclusions part 2

• Reaction Prediction is many different tasks


• Data driven retro-synthetic algorithms are
performant
• Public data perform nearly on-par with
proprietary data-sources
• Specialized neural networks can improve route
search performance
• Calculation of artificial labels can improve route
search performance
• SMILES based approaches are an interesting
alternative to extracted reaction core templates

• The increased automation in de novo design


and retrosynthetic planning will pave the way
for automating the DMTA cycle.

73
Toolkits – Source code - Links

ReInvent: https://github.com/MolecularAI/Reinvent
Molvecgen: github.com/Ebjerrum/molvegen
Deep Drug Coder:https://github.com/pcko1/Deep-Drug-Coder
AiZynthFinder: TBR: https://github.com/MolecularAI
Blogposts: www.cheminformania.com
74
Acknowledgements
Rocio Mercado, Post.doc
Molecular AI group: Tomas Bastys, Post.doc
Ola Engkvist, Associate Director, Molecular AI Simon Johansson, Ph.D Student WASP
Panagiotis-Christos Kotsias, Graduate Scientist, Hampus Gummesson Svensson, Ph.D Student WASP
Graduate Programme Sebastian Nilsson, Master Student
Josep Arus Pous, Ph.D student, BIGCHEM Tobias Rastemo, Master Student
Jiazhen He, post.doc. Molecular AI Emil Sandström, Master Student
Amol Thakkar, Ph.D student, BIGCHEM Jonathan Sundkvist, Master Student
Dean Sumner, Graduate Scientist, Graduate Programme Huifang You, Master Student
Veronika Chadimova, Graduate Scientist, Graduate Carl Blomgren, Master Student
Programme
Samuel Genheden, Data Scientist/Software Engineer Collaborators:
Atanas Patronov, Associate Principal Scientist Prof. Dr. Jean-Louis Reymond · Dept. of Chemistry &
Isabella Feierberg, Associate Principal Scientist Biochemistry University of Berne
Thierry Kogej, Associate Principal Scientist Christian Tyrchan, Team Leader - Computational
Preeti Lyer, Machine Learning and Cheminformatics Chemistry
Experts Boris Sattarov, Informatics Programmer, Science Data
Christian Margreitter, Data Scientist Software LLC
Papadopoulos, Kostas, Associate Principal Scientist Hongming Chen, Professor, Centre of Chemistry and
Lewis Mervin, Machine Learning and Cheminformatics Chemical Biology, Guangzhou, China
Expert Nidhal Selmi, Research Outsourcing Specialist, Hit
Christos Kannas Machine Learning/Cheminformatics Discovery
Expert Peter Varkonyi, Senior Research Scientist |
Alexey Voronov, Data Scientist/Software Engineer
75 Computational Chemistry
Questions

76
Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this
this file
file in
in error,
error, please
please notify us and remove
it from your system and note that you must not copy, distribute or take any action in reliance on
on it. Any
Any unauthorized
unauthorized use use or
or disclosure of the
contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 1 Francis Crick Avenue, Cambridge Biomedical Campus,
Cambridge, CB2 0AA, UK, T: +44(0)203 749 5000, www.astrazeneca.com

77
10
GDB and the chemical space

11.20 – 11.50

Jean-Louis Reymond, University of Bern, Switzerland

7 RISE — Research Institutes of Sweden


Jean-Louis Reymond
Bio
https://orcid.org/0000-0003-2724-2942
Jean-Louis Reymond is a Professor of Chemistry at the University of Bern, Switzerland. He studied Chemistry and Biochemistry at the ETH
Zürich and obtained his Ph.D. at the University of Lausanne on natural products synthesis (1989). After a Post-Doc and Assistant Professorship
at the Scripps Research Institute, he joined the University of Bern (1997). His research focuses on the enumeration and visualization of
chemical space for small molecule drug discovery, the synthesis of new molecules from GDB (http://gdb.unibe.ch), and the design and
synthesis of peptide dendrimers and polycyclic peptides as antimicrobials and for nucleic acids delivery. He is the author of > 300 scientific
publications and reviews.

Abstract
GDB and the Chemical Space
Chemical space is a concept to organize molecular diversity by postulating that different molecules occupy different regions of a mathematical
space where the position of each molecule is defined by its properties. Our aim is to develop methods to explicitly explore chemical space in
the area of drug discovery. We have enumerated all possible molecules following simple rules of chemical stability and synthetic feasibility to
form the Generated DataBases (GDB). Exploring GDB in comparison to known molecules reveals that vast areas of chemical space are still
entirely unknown yet are accessible for experimental exploration by straightforward synthetic methods. I will discuss how to visualize chemical
space and exemplify the discovery and synthesis of new scaffolds for drug discovery, and how we use machine learning methods to address
target predictions and synthesis predictions of GDB molecules. http://gdb.unibe.ch
GDB and the Chemical Space

Jean-Louis Reymond
29 May 2020, RISE AI workshop
http://gdb.unibe.ch

1. The GDB project


2. Atom-pairs
3. Molecular shingles

1
Philippe Schwaller
David Kreutter

Josep Arus-Pous
Daniel Probst

Sven Bühlmann

Alice Capecchi
Sacha Javor
http://gdb.unibe.ch
Finton Sirockin (Novartis)
Ola Engkvist (Astra Zeneca)
Matthias Hediger (UniBE) Florian Hollfelder (Cambridge UK)
Pierre Gönczy (EPFL) Anne Imberty (Grenoble)
Amol Thakkar
Roch-Philippe Charles (UniBE) Achim Stocker (UniBE)
Jürg Gertsch (UniBE) Luc Patiny (EPFL)
Dirk Trauner (New York) Andrea Endimiani (UniBE)
Daniel Bertrand (HiQScreen) Christian van Delden (Geneva)
2
Hugues Abriel (UniBE) Runze He (Space Peptides)
1. The GDB project

DB fraction
1) Ring strain / topologies
Graphs
114 B
2) Unsaturations Hydrocarbons
5.4 M

3) Heteroatoms
Skeletons
Claus Benzol 1.3 B
(1867)
GDB-17: Molecules
166.4 B

4) ChEMBL-likeness score ≥ 3.3 CLscore

5) Uniform sampling
26.2 B Subset
ChEMBL-like

GDBChEMBL
10 M

Tobias Fink et al. Angew. Chem. Int. Ed. 2005, 44, 1504-1508, J. Chem. Inf. Model. 2007, 47, 342-353 (GDB-11)
Lorenz C. Blum et al., J. Am. Chem. Soc. 2009, 131, 8732-3 (GDB-13);
Lars Ruddigkeit et al., J. Chem. Inf. Model. 2012, 52, 2864-2875 (GDB-17) Trinorbornane
Ricardo Visini et al., J. Chem. Inf. Model. 2017, 57, 700-709 (FDB17), J. Chem. Inf. Model. 2017, 57, 2707-2718 (GDB4c) (2007)
Mahendra Awale et al., Mol. Inf. 2019, 38, 1900031 (GDBMedChem)
Sven Bühlmann et al., Front. Chem. 2020, doi:10.3389/fchem.2020.00046 3
Molecular quantum numbers (42D)
HO Polar groups
Atoms H-Bond donor atoms 3 1
O
Carbon 17 16 H-Bond donor sites 3 1
H
Fluorine 0 0 H-Bond acceptor atoms 3 4
H Chlorine 0 0
NMe H-Bond acceptor sites 3 7
HO Bromine 0 0 Positive charges 1 0
Morphine Iodine 0 0 Negative charges 0 1
Sulphur 0 1
H Phosphor 0 0
N
S Acyclic nitrogen 0 1
O Cyclic nitrogen 1 1 Topology
N
O Acyclic oxygen 2 4 Acyclic monovalent nodes 3 6
CO2H Cyclic oxygen 1 0 Acyclic divalent nodes 0 2
Penicillin G
Heavy atom count 21 23 Acyclic trivalent nodes 0 2
Acyclic tetravalent nodes 0 0
Cyclic divalent nodes 8 6
Cyclic trivalent nodes 9 6
Bonds Cyclic tetravalent nodes 1 1
Acyclic single bonds 3 8 3-Membered rings 0 0
Acyclic double bonds 0 3 4-Membered rings 0 1
Acyclic triple bonds 0 0 5-Membered rings 1 1
Cyclic single bonds 18 11 6-Membered rings 4 1
Cyclic double bonds 4 3 7-Membered rings 0 0
Cyclic triple bonds 0 0 8-Membered rings 0 0
Rotatable bonds 0 4 9-Membered rings 0 0
 10 membered rings 0 0
Atoms shared by fused rings 7 2
Bonds shared by fused rings 6 1
4
Kong Thong Nguyen et al., ChemMedChem 2009, 4, 1803-1805
5
Daniel Probst et al., Bioinformatics, 2018, 34, 1433-1435, J. Chem. Inf. Model. 2018, 58, 1–7
6
Mahendra Awale et al., Mol. Inf.. 2019, 38, 1900031
Ring systems

GENG
93,463 graphs
≤ 4 cycles, ≤ 16 nodes
1) Ring enlargement
+ filters

2) Aromatization
728,391 saturated
carbocyclic ring systems
≤ 4 cycles, ≤ 30 atoms

GDB4c 3) Stereoisomers
916,130 carbocyclic
ring systems
(as SMILES)

RDB
12,536 known GDB4c3D
ring systems 6,555,929 carbocyclic
ring systems
(as sdf files /SMILES)

Ricardo Visini, Josep Arùs-Pous et al., J. Chem. Inf. Model. 2017, 57, 2707-2718 7
Deep learning

Josep Arus-Pous et al., J. Cheminf. 2019, 11, 20 8


Deep learning

ChEMBL 1) Training
DrugBank
2) Transfer learning
FDB17
commercial fragments One known drug

LSTM
Generative Neural Network

3) Generate new SMILES Table 2. Number of unique/total high similarity drug analogs produced by the different LSTM neural networks.
- retain correct SMILES
- remove duplicates
Neural Network LSTM1 LSTM2 LSTM3 LSTM4 LSTM5 LSTM6
- remove undesirable functional groups Source database ChEMBL ChEMBLs DrugBank Commercial FDB17 All Unique
4) Select high similarity analogs Fragments databases across
training cpds. 344,319 40,000 5,104 40,986 500,000 890,409 LSTMs
Nicotine 0/23 32/82 1/32 32/93 9/47 16/67 166
New drug analogs
Fencamfamine 15/42 126/218 40/96 130/231 92/164 41/114 580
Aminophenazone 5/26 34/96 23/71 38/99 22/66 19/65 223
Sulfadiazine 6/27 19/59 11/37 28/74 8/30 2/25 124
Miconazole 2/10 301/500 268/438 174/336 0/0 153/256 1134
Roflumilast 8/15 319/557 117/283 351/585 0/0 45/166 1126
Lovastatin 0/1 631/986 460/757 352/625 487/728 289/530 2729
Epothilone D 0/1 911/1301 561/831 807/1160 1595/2039 1163/1511 5707
Nilotinib 0/1 506/666 180/321 218/381 0/0 243/355 1362
Erythromycin 0/2 832/1042 174/243 524/709 1243/1444 1105/1288 4190

Mahendra Awale et al., J. Chem. Inf. Model. 2019, 59, 1347-1356 9


2. Atom-pairs
A
AUC value
decane cyclodecane adamantane ≤ 50 % 60 % 75 % ≥ 90 %
Bit Value

200 200 AUC for recovery of ROCS Color200


1,5-dimethylcyclooctane Tanimoto analogues
bis-homocubane
2,7-dimethyloctane decalin iPrEt3CH
Sfp ECfp4 Xfp10 Xfp20 CATS
150 iPr3CH 150
150
Pr3CH
Pr3CH

100 100 100

Bit Value
50 50 50

0 0 0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 A B
B C
adamantane
35%
Bit Value

octane 200 bis-homocubane


ChEMBL AV ChEMBL SD

Bit Value av. / SD


iPrEt3CH
i-PrCEt3 150
% Database

30% ZINC AV ZINC SD


150 ChEMBL GDB-17 AV GDB-17 SD
25% ZINC CSD AV CSD SD
100 GDB-17 100
20%
CSD

5015%

10% 50
0
D8 D9 D10 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 C
5%

0% 0 10
Mahendra Awale
0 et al.,5J. Chem.
10 Inf. Model.
15 2014,
2054, 1892-1907
25 30
Target prediction
NN search (MQN / Predicted
Xfp / ECfp4) in targets
ChEMBL (NN)

Predicted
ECfp4-NB Model
targets
built on 2,000 NN
Query (NN+NB)
molecule
Predicted
ECfp4-NB model
targets
built on ChEMBL
(NB)

Predicted
ECfp4-DNN model
targets
built on ChEMBL
(DNN)

11
Mahendra Awale et al., J. Chem. Inf. Model. 2019, 59, 10-17
12
Mahendra Awale et al., J. Chem. Inf. Model. 2019, 59, 10-17
PPB2 with Xfp similiarity

angiogenesis inhibitor from phenotypic screen


known LPAAT-β inhibitor

13
Marion Poirier et al. ChemMedChem 2019, 14, 224-236
14
Alice Capecchi et al., Mol. Inf. 2019, doi: 10.1002/minf.201900016
15
Antimicrobial peptides

Stéphane Baeriswyl et al., ACS Chem. Biol. 2019, DOI: 10.1021/acschembio.9b00047 16


3

Library (50k) – PCHRG atoms (0%-10%)

Peptide Dendrimers

Enumerate analogs
G3KL (Lys, Leu, deletion: 52,530)

VS top 200 / First Round Sythesis


NN-T7 / NN-T5 / NN-T4

Property space (2DP)

Select nearest neighbours


Synthesis and testing (200)

Cluster
(10)

17
Thissa N. Siriwardena et al., Angew. Chem. Int. Ed. 2018, 57, 8483-8487
Peptide Design Genetic Algorithm (PDGA)
50 random Target sequence Target: tyrocidine A
sequences or MXFP cyclo[D-Phe-Pro-Phe-D-Phe-Asn-Gln-Tyr-Val-Orn-Leu]

XfPFfNQYVOL

evaluation:
1. SMILES SMILES
2. MXFP
3. CBD from target MXFP
MXFP

CBDMXFP= 0 YES
OR Exit
run time = 24h

NO tyrocidine B CBDMXFP = 67
YES
CBDMXFP ≤ 300 ANALOGS
DATABASE retro-loloatin A CBDMXFP = 157

retro-tyrocidine C CBDMXFP = 188


10 survivors 40 new individuals: crossover,
+
selection random generation, and mutation

new generation

A. Capecchi et al., J. Chem. Inf. Model., 2020, doi:10.1021/acs.jcim.9b01014 18


3. Molecular shingles encoding precise atom environments

19
Daniel Probst et al., J. Cheminf. 2018, 10, 66
TMAP of the natural products atlas (24,594×232)

20
Daniel Probst et al., J. Cheminf. 2020, doi:10.1186/s13321-020-0416-x
Atom Pair shingles MAP4 encoding of jk
r1: O=c |15| c(c)c
r2: O=c(c)[nH]|15|c(cc)cc
j

a) DUD, MUV, ChEMBL b) Mutated and scrambled peptides c) All datasets

A. Capecchi et al, ChemRXiv., 2020, doi: 10.26434/chemrxiv.11994630.v1 21


TMAPs of the Metabolome MHFP6

OH count
MAP4

http://tm.gdb.tools/map4/ 22
Summary and Outlook
> GDB
– GDBChEMBL
– MQN (visualize)
– Deep learning
– GDBScaffold

> Atom-pairs
– Target prediction
– Beyond Lipinski
– Peptide discovery

> Molecular shingles


– MHFP6 (MinHash)
– TMAP
– MAP4

23
Machine Learning in the Regulatory
landscape
13.10 – 13.20

George Kass, EFSA, Italy

9 RISE — Research Institutes of Sweden


George Kass
Bio
George Kass was trained as a biochemist. He received his PhD in biochemical toxicology from the Karolinska Institute in Stockholm in 1990.
After a post-doc at the Swiss Federal Institute of Technology in Zurich he returned to the Karolinska Institute as Assistant Professor. In 1994 he
moved to the University of Surrey in the UK as Professor of Toxicology. He moved to the European Food Safety Authority in 2009. And he
knows nothing about machine learning or artificial intelligence.

Abstract
N/A
Machine Learning in the
Regulatory landscape

George Kass & Caroline Merten


European Food Safety Authority
Disclaimer

The views, thoughts and opinions presented are not


necessarily those of EFSA

2
EU agencies

ECDC
ECHA

EMA

EFSA
EFSA IS

The reference body for risk assessment of food


and feed in the European Union. Its work covers
the entire food chain – from field to fork

One of the number of bodies that are responsible


for food safety in Europe

4
WHAT EFSA DOES

Provides independent scientific advice and support for EU


risk managers and policy makers on food and feed safety

Provides independent, timely risk communication

Promotes scientific cooperation

5
THE SCIENTIFIC PANELS
Plant protection
Plant health

GMO

Nutrition
Animal feed

Food Packaging

Animal health & welfare

Food additives
Biological hazards

Chemical contaminants
6
In vivo •ADME studies

biological •Following OECD TG and GLP criteria


•Traditional TK parameters (Tmax, t1/2, AUC,
studies analytical data, etc...)

In vivo •Sub-chronic, chronic, repro-dev studies


•Following OECD TG and GLP criteria
toxicological •Traditional Tox parameters (biochemistry,
histopathology, weight, food consumption,
studies etc...)

•Mainly for genotoxicity and metabolism


In vitro •Following OECD TG and GLP criteria
•Traditional parameters (biochemistry, markers
studies for mutagenesis and chromosomal aberrations,
etc..)

7
Previous
Evaluations
Reports, papers and
evaluations
Reports
Biological studies In depth or
(e.g. ADME) and systematic
toxicological studies literature
(e.g. genotoxicity, searches
90-day, Papers and reports
developmental etc..)

Data and
evidence to
support
risk
assessment

Quantity of information is small Risk assessor evaluates evidence manually


8
The Challenges for EFSA

Transparency and openness

Need to accelerate pace of risk assessment

New methodologies and tools for risk assessment

One health approach to food and feed safety

9
Impact on regulatory risk assessment:
Opportunities for machine learning
▪ Data and evidence from literature
➢ Transparency and reproducibility
➢ In depth and systematic literature searches
➢ Controversial substances – thousands of publications
➢ Tools for literature screening (Distiller SR) and study quality evaluation (e.g.
SciRAP)
▪ Big data
➢ Whole-genome sequencing data
➢ OMICs data

▪ In silico predictions
➢ (Q)SAR
➢ RAx
➢ Need for better predictive models

Opportunities for machine learning and AI!


10
Machine learning and AI
could lead to better
decisions on chemicals

BUT we must get it right!

11
Emerging risks in the food chain
Opportunities for machine learning

REACH database contains 25,405 unique substances

Only small fraction extensively characterised

Does their release in the environment lead to accumulation in the food chain?

How to identify emerging issues? Alerts through text mining of scientific literature?

Alerts through screening social media and media reports?

Opportunities for machine learning and AI?


12
Why, Where and When Machine Learning
Works in Predictive Toxicology – And When it
Doesn’t….
13.30 – 14.00

Mark Cronin, LJMU, UK

10 RISE — Research Institutes of Sweden


Mark Cronin
Bio
Mark Cronin is Professor of Predictive Toxicology at Liverpool John Moores University. He's been interested and in silico methods to protect toxicity and
ADME endpoints for over 30 years having experience in both human health and environmental endpoints. As a biologist he is keen to find patterns within
data, but also to ensure models have mechanistic meaning and interpretation and are fit for purpose.

Abstract
Why, Where and When Machine Learning Works in Predictive Toxicology – And When it Doesn’t….
Mark Cronin, School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, UK.
E-mail: m.t.cronin@ljmu.ac.uk

We've been using multivariate statistics in predictive toxicology for many decades now, as soon as the computational power was sufficient in the 1980s we
used supervised and unsupervised techniques to analyse what would now be thought of as rather trivial data sets. Neural Networks became relatively
commonplace in QSAR in the early 1990s, with some notorious early studies only proving that it was easier to overfit a model then provide a reliable
predictive technique! Whilst the data resources, hardware and machine learning techniques currently available to us, as well the possibilities they provide,
could not have been imagined 30 years ago, the uses of computational toxicology have largely remained unchanged e.g. supporting product development,
screening and prioritisation, safety assessment. A key use is also in regulatory assessment, and this is an area where potentially there has been the greatest
resistance to the use of Machine Learning QSARs. The purpose of this talk is not to discourage the use of Machine Learning in computational toxicology,
precisely the opposite, but to say we need to use it correctly and in the right place, understanding when other techniques, based on a more mechanistic
approach, may be preferred and/ or acceptable.
Why, Where and When Machine Learning
Works in Predictive Toxicology
– And When it Doesn’t….

Or: From Luddite to Deep Learning

Mark Cronin

Liverpool John Moores University


A Two Minute Rant on ML and AI !!
Machine Learning
Machine Learning

With Thanks to Dr
James Firman, LJMU
Machine Learning
- Social Media
- Epidemiology
• What do you want to learn? - Adverse Drug
Reactions
• Size of data set
• Homogeneity of mechanisms
• Appropriate descriptors
• Overfitting
• Understand experimental error
• Structure of data
Main Message:

Use the Right


Tool for the Job!
http://researchonline.ljmu.ac.uk/id/eprint/4989/
Interest in
Computational
Toxicology

1980s 1990s 2000s 2020s


What I Learned From My PhD

• You can model complex toxicological effects


…and the models are beautiful

• QSAR datasets have a structure


…random numbers don’t

• There are many ways to understand data


…and they all tell you the same thing
fish toxicity
log P
Structure-Toxicity
Relationships
The “Good Old Days”
of Data Collection
And then
came
neural
networks...
Data Capture and Sharing

eTOX 25/10/16
DrugBank Approved v5.0.3
Some Freely Available Toxicity Data Resources
• OECD QSAR Toolbox (https://www.qsartoolbox.org/)
• Requires download and some expertise!
• ChEMBL (https://www.ebi.ac.uk/chembl/)
• Targets: 13,382 - Activities: 16,066,124
• Compounds: 1,961,462 - Documents: 76,086
• PubChem (https://pubchem.ncbi.nlm.nih.gov/)
• Compounds: 102.7 million - Substances: 253 million
• Bioactivities: 268 million
• And there are many others….
Why Does
This Happen?
The Hypothesis
Organised into An Adverse Outcome Pathway
AOPs Networks:
Quantify With NAMs

Spinu N. et al. (2019). Development and analysis of an adverse outcome pathway Slide adapted from Dr
network for human neurotoxicity. Archives of Toxicology 93, 2759–2772 Andrew Worth, EC JRC
Adverse Outcome Pathway Network for Neurotoxicity

Spinu N et al. (2020) Archives of Toxicology


https://doi.org/10.1007/s00204-020-02774-7
Weight of Evidence – A Bayesian Dream?
• Different streams of evidence to predict acute toxicity classification

Cytotoxicity

• When do we have enough “certainty” to make a decision

With thanks to Dr John Doe, for really


making me think about this!
Product Development Regulatory Use

• Screening What are • Prioritisation and


• Lead Optimisation the use Screening
• Risk assessment cases? • Classification and
• Registration Labelling
• Risk Assessment
Regulatory Use of Predictions from In Silico Tools:
Validation and Acceptance

• Opportunities:
• To update assessment / validation
• Utilise knowledge of uncertainties
• Develop frameworks for regulatory use
Inspirations
13 Types of Uncertainty, Variability and Bias of QSARs
49 Assessment Criteria
Definition of Chemical Structures → 2 Biological Data → 7

Creation Physico-Chemical Properties and Structural Descriptors → 5

Compilation of the Data Set → 5 Modelling Approach → 1

Description of Model → 3 Statistical Performance → 2

Characteristics Applicability Domains → 3 Mechanistic Relevance → 3

ADME Effects → 2

Documentation and Reproducibility → 2


Application
Usability → 9 Relevance → 5

Details in: Cronin MTD et al (2019) Reg. Toxicol. Pharmacol. 106: 90-104
Making it Usable
Making it Useful
Uncertainties Confidence
Low High

Key Question:
What is
acceptable level
of uncertainty?

Acceptable
High Low
Fit for Purpose In Silico Models

Risk Assessment

Classification and Labelling

Prioritisation and Screening


Challenges of Predicting
Toxicology
Applications of Machine Learning in Predictive Toxicology

Mining datasets
– Searching for patterns
– Finding analogues

Tools for Screening


– Predicting hazard

Supporting regulatory decision making


– In silico profiling and evidence
– qAOPs, QST
Main Message:

Use the Right Tool for the Job!

• Formulate the Problem


• Understand the Data
• Assess Uncertainty
• Be Transparent

• Simplify – Explain
• Make Even Simpler - Explain Again
• Make Really, Really Simple…
Predictions with Confidence using
Conformal Prediction
14.15 – 14.45

Ulf Norinder, Stockholm University, Sweden

11 RISE — Research Institutes of Sweden


Ulf Norinder
Bio
Ulf Norinder received his Ph.D. degree in organic chemistry from Chalmers University of Technology (CTH) 1984. From 1985 – 2014 he worked
as computational chemist, senior principal scientist and research fellow in the pharmaceutical industry (Karo Bio AB, AstraZeneca AB and H.
Lundbeck A/S) and from 2015 - 2018 as senior research specialist at Swetox Södertälje (Karolinska Institutet).
He is currently affiliate professor at MTM Research Centre, School of Science and Technology, Örebro University, visiting researcher at the Dept
of Pharmaceutical Biosciences, Uppsala University and affiliate researcher at the Dept of Computer and Systems Sciences, Stockholm
University. His areas of expertise include computer-assisted drug design and pattern recognition with special emphasis on multivariate data
analysis and machine learning.

Abstract
Predictions with Confidence using Conformal Prediction.
The presentation will cover the utility of confidence predictors such as Conformal Prediction as an in silico modelling framework for obtaining
predictions with known, and mathematically proven, error rates set by the user as well as the graceful handling of highly imbalanced datasets,
typical in toxicology, without the need for balancing measures such as under- and/or oversampling.
Predictions with Confidence
using Conformal Prediction

Ulf Norinder*
Dept of Computer and Systems Sciences
Stockholm University, Sweden

* Most of the work performed at: Swedish Toxicology Sciences Research Centre (Swetox),
Unit of Toxicology Sciences, Karolinska Institutet, Sweden
as part of the EU-ToxRisk project (Horizon 2020, grant agreement No 681002)
Question of (un)certainty

A prediction without a quantified uncertainty is less useful


Question of (un)certainty

• Many methods to assess the average uncertainty


• Probability distributions
• Bayesian models
• Reliability-density neighbourhoods
• Ensemble model variance

However, none of these methods guarantee the error rate


for new instances*!

* = compounds
Question of (un)certainty

Desired:

• A method that deliver predictions with well defined


uncertainties on a instance by instance basis
Question of (un)certainty

Why Conformal Prediction?


• Win situation
• Statistical guarantees (on validity)
Conformal Prediction

If {Exchangeability} then {conformal predictors are always valid}

Mathematical proof

Vovk V, Gammerman A, Shafer G


(2005) Algorithmic learning in a
random world, Springer, New York

If 20 % prediction errors on validity acceptable --->


CP will give, at most, 20 % errors!!
Conformal Prediction

“What we ideally would like to know is in fact that a particular


prediction is derived from an area of property space from which
reliable predictions are to be expected”
Conformal Prediction
validity

Classes
Active
Inactive
Both {active, inactive)
Empty {null}

Binary classification

In conformal prediction:
If a classification contains the correct class it is correct
both = always correct, empty = always erroneous
Validity = % of correct classifications (for each class)
Efficiency = % of single label classifications (right or wrong)
Conformal Prediction
Why Conformal Prediction?
• Win situation
• Statistical guarantees (on validity)
• CP is instance-based
• The risk is known up-front for the decision taken
• Applicability domain closely linked to model development
CP strictly defines the level of similarity (conformity) needed
No ambiguity anymore
• Gracefully handles (severely) imbalanced datasets
Ratios of 1:100 – 1:1000
No need for over- or undersampling
• CP is a framework (almost any ML algorithm will work)
Conformal Prediction
How does this work?

Data

Train set Test set

Proper Calibration
Train set set

Calibration set CP p-values


Model
predictions (for each class)

U. Norinder, L. Carlsson, S. Boyer, M. Eklund, Introducing Conformal


classification
Prediction in Predictive Modeling. A Transparent and Flexible
Alternative to Applicability Domain Determination, J. Chem. Inf. Model.,
2014, 54 ,596–1603
CP is a framework (almost any ML algorithm will work)

• ML algorithm must provide a ranking


• Use current models, descriptors, algorithms
• Add calibration set -
New examples in time
Conformal Prediction
Example: Predicting Toxicity

Imbalanced dataset
(toxic minority class)

A binary RF classifier (100 trees) gives the


output:

New compound to predict 32 trees: toxic


(is toxic) 68 trees: non-toxic
Conformal Prediction
Example: Predicting Toxicity
Calibration set, 6 toxic, 7 non-toxic compounds
N trees predicting correct class
Calibration set
Toxic Non-Toxic
45 91
42 88
36 85
New compound to predict 30 82
(is toxic)
28 79
32 trees: toxic
27 79
68 trees: non-toxic
77

Mondrian Conformal Prediction


Conformal Prediction
Example: Predicting Toxicity
Calibration set, 6 toxic, 7 non-toxic compounds
N trees predicting correct class
Calibration set
Toxic Non-Toxic
45 91
42 88
36 85
New compound to predict 30 82
(is toxic)
28 79
32 trees: toxic
27 79
68 trees: non-toxic
77

Mondrian Conformal Prediction


Conformal Prediction
Example: Predicting Toxicity
Based on the similarity to the known examples in the calibration set:
Position toxic: 3/7
Position non-toxic: 0/8 Calibration set
Toxic Non-Toxic
45 91
42 88
36 85
New compound to predict 30 82
(is toxic)
28 79
32 trees: toxic
27 79
68 trees: non-toxic
77

Mondrian Conformal Prediction


Conformal Prediction
Example: Predicting Toxicity

Using 80% confidence level


(0.2 significance level):

3/7 = 0.43 > 0.2 therefore the compound is


assigned to the toxic class

New compound to predict 0/8 = 0.0 < 0.2 therefore the compound is not
(is toxic) assigned to the non-toxic class
32 trees: toxic
68 trees: non-toxic
Conformal Prediction
Example: Predicting Toxicity
Calibration set, 6 toxic, 7 non-toxic compounds
N trees predicting correct class

Several pairs of proper train and calibration sets

Several p-values
(for each class):
New compound to predict Use median p-
(is toxic) value

32 trees: toxic
68 trees: non-toxic

Aggregated Mondrian Conformal Prediction


Mondrian Cross-Conformal Prediction
Binary Mondrian Conformal Prediction p-values

In conformal prediction:
If a classification contains the correct class it is correct
both = always correct, empty = always erroneous
Validity = % of correct classifications (for each class)
Efficiency = % of single label classifications (right or wrong)
PubChem Cytotox Assays
• Results from 16 high throughput cell viability (tox) screens from
PubChem

• On average 0.8% toxic compounds

AID Tested compounds Toxic compounds %active ratio non-tox/tox


624418 386 360 524 0.14 736.3
504648 367 995 600 0.16 612.3
602141 359 040 1302 0.36 274.8
620 86 701 364 0.42 237.2
847 41 152 194 0.47 211.1
903 52 783 338 0.64 155.2
2275 29 938 193 0.64 154.1
588856 404 016 3018 0.75 132.9
1825 290 605 2259 0.78 127.6
2717 299 957 3181 1.06 93.3
648 86 121 924 1.07 92.2
719 84 841 937 1.10 89.5
1486 217 851 2408 1.11 89.5
463 56 465 706 1.25 79.0
430 62 627 1121 1.79 54.9
598 85 162 5139 6.03 15.6
PubChem Cytotox Assays
• Results from 16 high throughput cell viability (tox) screens from
PubChem

• On average 0.8% toxic compounds

• RDKit descriptors

• Random Forest, 500 trees, ensemble of 100 models

• 80 % training set, 20 % external test set


Validity of the predictions (test sets) at the 80% confidence
level. Models are valid for both classes.
PubChem & Hansen Datasets
• Four dataset of different sizes and class imbalances

• 10 % randomly selected training sets

• Signature descriptors of heights 0–2 for chemical structure


characterization

• Support vectors machines (SVM) C-SVC, RBF kernel, parameters C =50,


gamma = 0.002

• Ensemble of 100 SVM models


Four dataset of different sizes and class imbalances
ratio inactive:active compounds
0.9 4.1 39.6 911.2

Size and imbalance differs considerably


between the datasets.
0.9 4.1

39.6 911.2

Fraction predicted active and inactive compounds.


Results are similar across the datasets despite
the varying imbalance.
#compounds in both class & empty class

@acceptable significance level:


Results from new data 
• Many predictions in empty class  outside AD of current model  measure and update model
• Many predictions in both class  inside AD of current model  lack of information 
add new information (features),
develop better model (classifier, algorithm)
@acceptable significance level (decided by the user)

Efficiency = % of single label classifications


If a classification contains the correct class it is correct (right or wrong)
both = always correct, empty = always erroneous
Not over-optimistic models

Validity minority class Validity majority class

Training set
Test set

Signif. level 0.2


Conformal Prediction

If {Exchangeability} then {conformal predictors are always valid}


Conformal Prediction

If {Exchangeability} then {conformal predictors are always valid}

• Training and test data:


the relationship between input and output in training and test data is the same.
Conformal Prediction

If {Exchangeability} then {conformal predictors are always valid}

• Training and test data:


the relationship between input and output in training and test data is the same.

If however –
• The new predictions are not valid

• Training and test data:


the relationship between input and output in training and test data is different.

• Trigger to update the current model


Molecular descriptors for organic reactions

15.15 – 15.40

Fernando Huerta, RISE, Sweden

13 RISE — Research Institutes of Sweden


Fernando F. Huerta
Bio
Fernando F. Huerta received his BSc degree (1993) and his PhD degree (1998) in chemistry from the University of Alicante under the
supervision of Professor Miguel Yus. After a postdoctoral period at Stockholm University with Professor Jan-E. Bäckvall (1998-2000)
working on dynamic kinetic resolution of secondary alcohols, he took an associate professor position at the University of Alicante (2000-2002).
In 2002 he moved to Stockholm where he started his career as medicinal chemist at AstraZeneca (2002-2012). During this period, Fernando
was involved in a team for the evaluation of different synthesis planning software packages. In particular, he became part of the AZ team
collaborating with InfoChem GmbH in the development of ICSynth. Together with two former AstraZeneca colleagues, Fernando founded
Chemnotia AB, Stockholm, Sweden, in 2012. In this new role and in collaboration with InfoChem GmbH (proprietary owner) Fernando was a
key player in the development of IC Forward Reaction Prediction. From 2016 Fernando works as Senior Researcher/Project Leader at Research
Institutes of Sweden (RISE), Bioscience and Materials Division in Stockholm, Sweden.

Abstract:
Molecular descriptors for organic reactions.
Reactions are complex processes, the number of factors that can affect the outcome of a reaction goes from collision theory, kinetics, enthalpy,
or entropy to the use of catalysts or additives to vary the reaction mechanism. Different approaches have been reported in the literature to
describe organic reactions as vectors to enable datamining or machine learning processes. However, the big question still remains, do we have
enough molecular descriptors to develop predictive models?
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS
Fernando F. Huerta
Bioscience & Materials Division (RISE)
fernando.huerta@ri.se
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

• Easy to calculate molecular properties


• Accessible with open access software
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

1. Definition & Examples


2. Reaction Matrix
3. Organic Reactions Understanding
4. Work in Progress (RISE)
5. Goal
I’M A CHEMIST
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

1. Definition & Examples


2. Reaction Matrix
3. Organic Reactions Understanding
4. Work in Progress (RISE)
5. Goal
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

Molecular descriptors are formally mathematical representations of a


molecule obtained by a well-specified algorithm applied to a defined
molecular representation or a well-specified experimental procedure

1. Definition & Examples


what type of descriptors are we talking about?
Mannhold LogP Largest Chain
Moreau-Broto Autocorrelation (mass) Largest Pi Chain
descriptors Petitjean Number
Atomic Polarizabilities Rotatable Bonds Count
Bond Count Topological Polar Surface Area
Moreau-Broto Autocorrelation (charge) Molecular Weight
descriptors XLogP
Moreau-Broto Autocorrelation Zagreb Index
(polarizability) descriptors Molar Mass
Charged Partial Surface Areas (3D) SP3 Character
Eccentric Connectivity Index Roftatable Bonds Count (non terminal)
Fragment Complexity … up to 200 descriptors
VABC Volume Descriptor

Indigo, CDK and RDKit nodes Molecular descriptors


1. Definition & Examples
KNIME for organic reactions
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

why?

1. Definition & Examples


MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

1. Definition & Examples


2. Reaction Matrix
3. Organic Reactions Understanding
4. Work in Progress (RISE)
5. Goal
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

1. Definition & Examples


2. Reaction Matrix
3. Organic Reactions Understanding
4. Work in Progress (RISE)
5. Goal

1. Definition & Examples


5. Goal

• Predict the outcome of a chemical reaction

• Predict the starting materials for the synthesis of


a complex molecule (retrosynthetic analysis)

a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

how?

1. Definition & Examples


MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

1. Definition & Examples


2. Reaction Matrix
3. Organic Reactions Understanding
4. Work in Progress (RISE)
5. Goal

1. Definition & Examples


each reaction has to be described by an UNIQUE reaction matrix
a
in a typical reaction A+B C y
a -> set or reaction conditions
y -> yield

reaction representation using molecular descriptors

reaction matrix
n n
(DA DB DC)
n
(a and/or y)

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
what type of molecules are A, B, C

n n n
unique reaction matrix (DA DB DC)

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
constitutional isomers can be described by means of a unique array of molecular descriptors

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

however?

1. Definition & Examples


Stereoisomers cannot be distinguished by unique reaction matrix

No unique reaction matrix for stereoselective reactions

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
5. Goal

• Predict the outcome of a chemical reaction

• Predict the starting materials for the synthesis of


a complex molecule (retrosynthetic analysis)

…for reactions with no stereoselective issues


a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
Reaction product ≠ f(reaction conditions)

Suzuki-Miyaura reaction

• No chemical selectivity issues


• Reaction conditions will only affect yield (hypothesis)

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
Reaction product ≠ f(reaction conditions)

reaction matrix based Yield class


on starting materials
and product Yield classification model using
ML with up to 95% accuracy*

Molecular descriptors
2.
1. Reaction
DefinitionMatrix (Examples)
& Examples *ACS Omega submitted
for organic reactions
Reaction product ≠ f(reaction conditions)

Buchwald-Hartwig coupling

90% (AUC)

Negishi coupling

83% (AUC)

Molecular descriptors
2.
1. Reaction
DefinitionMatrix (Examples)
& Examples *ACS Omega submitted
for organic reactions
other reactions

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
Reaction product = f(reaction conditions)

N- versus O- alkylation reactions

75%

• Selectivity depending on
reaction conditions
2
74-86%
(reagents used).

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS
Matrix difference lies only on the product molecular properties
P SM1 SM2

2
Reaction product = f(reaction concentration)

Lactamization versus amide coupling

Product depends on the reaction concentration

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
Reaction product = f(reaction conditions)

• Predictive models plausible (unique reaction matrix)

• Not reliable

• Lack of data

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
for organic reactions
5. Goal

• Predict the outcome of a chemical reaction

• Predict the starting materials for the synthesis of


a complex molecule (retrosynthetic analysis)

a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
5. Goal

• Predict the outcome of a chemical reaction

• Predict the starting materials for the synthesis of


a complex molecule (retrosynthetic analysis)

a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
are all organic reactions as simple?
a
A+B C y
a -> set or reaction conditions
y -> yield

• ca 446 reactions (with name)

Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
I’M STILL A CHEMIST
are all organic reactions as simple?
a
A+B C y
a -> set or reaction conditions
y -> yield

• ca 446 reactions (with name)


• chemists have full knowledge of reaction
mechanism before setting up an experiment
Molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

1. Definition & Examples


2. Reaction Matrix
3. Organic Reactions Understanding
4. Work in Progress (RISE)
5. Goal

1. Definition & Examples


Reaction Thermodynamics
• Catalyst and/or additives
• Solvent
• Temperature
• Etc…

Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
Reaction Thermodynamics
• Catalyst and/or additives
• Solvent
• Temperature
• Etc…

Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
Reaction Thermodynamics
• Catalyst and/or additives
• Solvent
• Temperature
• Etc…

Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
Reaction Thermodynamics
• Catalyst and/or additives
• Solvent
• Temperature
• Etc…

Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
Reaction Thermodynamics Reaction Kinetics
• Catalyst and/or additives • Concentration
• Solvent • Polar surface area
• Temperature • Temperature
• Etc… • Etc…

Molecular descriptors
1. Organic
3. Definition
Reactions
& Examples
Understanding
for organic reactions
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

1. Definition & Examples


2. Reaction Matrix
3. Organic Reactions Understanding
4. Work in Progress (RISE)
5. Goal
need for different descriptors

HYPOTHESIS
• reaction classification
• key reagents
• analysis of bad
• identification of “bad”
• generation of “descriptor/s”

Work in progress with


4. Work in Progress (RISE)
the Suzuki reactions!!!
need for different descriptors

HYPOTHESIS PROBLEMS
• no bad reactions reported
• curation
• lack of data
• number of reaction clusters
• and many more…

Work in progress with


4. Work in Progress (RISE)
the Suzuki reactions!!!
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS

1. Definition & Examples


2. Reaction Matrix
3. Organic Reactions Understanding
4. Work in Progress (RISE)
5. Goal
5. Goal

• Predict the outcome of a chemical reaction

• Predict the starting materials for the synthesis of


a complex molecule (retrosynthetic analysis)

a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
5. Goal

• Predict the outcome of a chemical reaction

1. Normalization of organic reactions does not help models


2. Relative good results for specific transformations has been reported

a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
5. Goal

• Predict the starting materials for the synthesis of


a complex molecule (retrosynthetic analysis)
1. Some attempts published with theoretical good results
2. Best practical results using LHASA, ICSYNTH, ChemPlanner

a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
https://www.pinterest.co.uk/McrHCC/co
mmunity-engagement/
ACKNOWLEDGEMENTS

ALEXANDER MINIDIS

SAMUEL HALLINDER

ULF TEDEBARK

ULF NORINDER

fernando.huerta@ri.se
THANKYOU
THANK YOU
Pitfalls when applying machine learning
to chemistry
15.50 – 16.10

Martin Nilsson, RISE, Sweden

14 RISE — Research Institutes of Sweden


Martin Nilsson
Bio
Martin Nilsson, RISE, Assoc. Prof., Mathematical physicist. M.Sc. at KTH, Stockholm, 1983; PhD at Tokyo University, 1989. Primary research
interests are mathematical aspects of machine learning and biologically inspired approaches to representation learning.

Abstract:
Pitfalls when applying machine learning to chemistry
Despite its recent popularity, machine learning (ML) is not a free lunch. Obnoxious problems often pop up when applying ML in general and no
less when applied to chemistry. I will discuss a selection of such problems that may require more attention, including computational efficiency,
data quality, debugging, descriptors/representations, and evaluation.
Pitfalls when applying machine
learning to chemistry

Martin Nilsson, RISE 2020-05-29


Frequently Overlooked Features of
Machine Learning

(Image credits: xkcd.com)


Three subtopics:

•Scaling

•Data quality

•Model problems
1. Scaling
Parameter distributions in chemistry
are often exponential (or worse),
while linear-algebra-based algorithms
prefer normal distributions.

Use a transform (e.g., log) –


Don’t trust your ML package
to do this automatically!
2. Data Quality
Some factors influencing yield:

• Rate and order of adding reactants


• Temperature
• Quality of reactants
• Number of equiv. to drive reaction The problem is:
• Solvent and its quality
• Reactant surplus over time Available datasets
• Humidity rarely provide all
• ... that data!
So then: How well can we learn?
What is a lower bound on the error?

Bayes’ error = best possible error for any learning algorithm

• Hard to compute exactly, but


• can be estimated, using, e.g.,
• Friedman-Rafsky statistic (Skoraczynski et al. 2017)
3. Model problems
• Toy problems
• Known solutions
• Solvable by classical methods
• Almost any ML algorithm applicable
• Not requiring huge amounts of data
• Debuggable & explainable

For image recognition: MNIST and CIFAR-10

?
Conclusions
• Three problems that deserve attention:
Scaling, Data Quality, and Model Problems

• References: Skoraczynski et al.: “Predicting the outcomes of organic reactions via


machine learning: are current descriptors sufficient?”, Scientific Reports 7:3582
(2017). (Don’t miss the Supplementary Info!)

Friedman & Rafsky: “Multivariate generalizations of the Wald-Wolfowitz and


Smirnov two-sample tests”, Ann. Stat. 7, 697-717 (1979).

• Credits to RISE colleagues: Andreas Thore, Ulf Tedebark, Fernando Huerta,


Alexander Minidis, Sverker Janson, Erik Ylipää, Swapnil Chavan
Concluding remarks

16.20 – 16.30

Ian Cotgreave, RISE, Sweden

15 RISE — Research Institutes of Sweden


RISE organizing committee

Ulf.Tedebark@ri.se (Biomolecules)
Sverker.Janson@ri.se (Computer Science)
Ian.Cotgreave@ri.se (Chemical and Pharmaceutical Toxicology)
Alexander.Minidis@ri.se (Medicinal/Process Chemistry and Data-Science)
Erik.Ylipaa@ri.se (Deep Learning)
Fernando.Huerta@ri.se (Medicinal/Process Chemistry and Data-Science)
Swapnil.Chavan@ri.se (Computational Toxicology)
Andreas.Thore@ri.se (Materials Science and Computational Chemistry)
Martin.Nilsson@ri.se (Algorithms and Mathematics)

RISE is Sweden’s research institute and innovation partner. Through our international collaboration programs with industry, academia and the
public sector, we ensure the competitiveness of the Swedish business community on an international level and contribute to a sustainable
society. Our 2,800 employees engage in and support all types of innovation processes. RISE is an independent, state-owned research institute,
which offers unique expertise and over 100 testbeds and demonstration environments for future-proof technologies, products and services.

ISBN: 978-91-89167-42-1

You might also like