Accelerating Chemical Design and Synthesis Using Artificial Intelligence RISE Open Workshop 2020-05-29 Conference Binder

Accelerating
chemical design
and synthesis
using artificial
intelligence
Open Workshop, May 29, 2020
1 RISE — Research Institutes of Sweden

ISBN: 978-91-89167-42-1
Dear Speakers and Participants,
We (RISE) thank all participants for making this Open Workshop a very interactive and interesting day! Especially, we are grateful to all the speakers, who
deserve extra credit for putting together impressing presentations covering a wide range of areas!
Some fun statistics: Throughout the day we had 90-110 participants viewing the presentations and participating actively in the Q&A sessions. The audience
was spread all around the globe (Europe, US, Far East, Australia, South America) and we hope that the day was valuable to you all, and worth either getting
up really early, or staying up really late!!
This workshop was a combined effort by several cross-functional departments at RISE to jointly define the contents. This included the units Process
Chemistry, Toxicology/Safety Assessment and AI/Digital Systems. The day addressed how Machine Learning and AI is rapidly paving roads into how
chemicals are designed and manufactured with optimal functional characteristics. Achieving optimal functionality is always a balance between the desired
physico-chemical properties of a molecule, the hazardous/risk potential and the practicality in scale up of production. In each of these areas, Machine
Learning/AI techniques present great opportunities in achieving this balance in a more pro-active and effective manner.
To conclude, please get in touch with any of the speakers and RISE if you have a question or a request that you think any of us can solve or assist you with.
Take care and stay safe!
RISE organizing committee
Ulf.Tedebark@ri.se (Biomolecules)
Sverker.Janson@ri.se (Computer Science)
Ian.Cotgreave@ri.se (Chemical and Pharmaceutical Toxicology)
Alexander.Minidis@ri.se (Medicinal/Process Chemistry and Data-Science)
Erik.Ylipaa@ri.se (Deep Learning)
Fernando.Huerta@ri.se (Medicinal/Process Chemistry and Data-Science)
Swapnil.Chavan@ri.se (Computational Toxicology)
Andreas.Thore@ri.se (Materials Science and Computational Chemistry)
Martin.Nilsson@ri.se (Algorithms and Mathematics)
Agenda
(times are CEST (Stockholm time zone))
• Moderator: Ian Cotgreave, RISE • 13.10 – 13.20

Machine Learning in the Regulatory landscape.
• 09.00 – 09.15 George Kass, EFSA, Italy
Welcome and introduction.
Ian Cotgreave, RISE, Sweden • 13.30-14.00
Why, Where and When Machine Learning Works in Predictive
• 09.15-09.45 Toxicology – And When it Doesn’t….
Machine Learning and Chemistry Mark Cronin, LJMU, UK
Erik Ylipää, RISE, Sweden
• 14.15-14.45
• 10.00-11.00 Predictions with Confidence using Conformal Prediction.
Molecular de-novo design and synthesis prediction Ulf Norinder, Stockholm University, Sweden
Esben Jannik Bjerrum, AstraZeneca, Sweden
• Coffee break
• Coffee break
• 15.15-15.40
• 11.20-11.50 Molecular descriptors for organic reactions.
GDB and the chemical space. Fernando Huerta, RISE, Sweden
Jean-Louis Reymond, Univ of Bern, Switzerland
• 15.50-16.10
• Lunch break Pitfalls when applying machine learning to chemistry
Martin Nilsson, RISE, Sweden
• 16.20-16.30
Concluding remarks.
Ian Cotgreave, RISE, Sweden

Welcome and Introduction
09.00 – 09.15

Machine Learning and Chemistry
09.15 – 09.45
Erik Ylipää, RISE, Sweden

Erik Ylipää
Bio
Erik Ylipää is a deep learning researcher at RISE. His main area of interest is in neural network architecture, in particular arbitrarily structured
data such as sequences and sets.
Other research areas of interest are sparsely connected neural networks, representation learning and natural language processing
Abstract
Machine Learning and Chemistry
The last couple of years has seen a shift in how machine learning is used in chemoinformatics: from methods with an emphasis on feature
engineering to deep neural networks which allow molecules to be modelled directly. This talk will be a brief overview of the past, present and
future of machine learning in chemistry with a focus on deep learning. We will also look at how contemporary neural networks for processing
mathematical sets are ideally suited for working with small molecules and how lessons learned from the field of natural language processing
could catalyse the field of chemoinformatics.
Machine
Learning and
Chemistry
Erik Ylipää (erik.ylipaa@ri.se) 2020-05-29

Artificial Intelligence
Example: Knowledge
The AI and Deep bases
Learning onion
Machine Learning
Example: Random
Forest
• Deep Neural Networks are Representation

the focus of this Learning
presentation Example: Shallow word
vectors (CBoW)
• They are a deep learning
method
Deep Learning
• Similar to how Random Example: Recurrent
Forest is an Ensemble Neural Networks
learning method
2 RISE — Research Institutes of Sweden Bengio, Yoshua, Ian Goodfellow, and Aaron Courville. Deep learning.
Vol. 1. MIT press, 2017. https://www.deeplearningbook.org/
Machine Learning
and Learning
Algorithms
• Learning Algorithms improve
their performance on a task
by experience
• This is in contrast with a rule
based algorithm, which does
not change in performance
with more data

Supervised machine learning
● In many cases the data we have can be organized as a table
○ Each row is an observation
○ Each column is a different variable
● The task is often to predict the value of one variable (e.g. cardio) based on the
others
● We often call the column to predict our target, denoted by y, and the other
columns our features, often gathered as a vector x
age gender Height Weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
18393 2 168 62 110 80 1 1 0 0 1 0
20228 1 156 85 140 90 3 1 0 0 1 1
18857 1 165 64 130 70 3 1 0 0 0 1
17623 2 169 82 150 100 1 1 0 0 1 1
17474 1 156 56 100 60 1 1 0 0 0 0
21914 1 151 67 120 80 2 2 0 0 0 0
22113 1 157 93 130 80 3 1 0 0 1 0 logo
22584 2 178 95 130 90 3 3 0 0 1 1

17668 1 158 71 110 70 1 1 0 0 1 0
https://www.kaggle.com/sulianova/cardiovascular-disease-dataset/version/1
Supervised machine learning as function
approximation
cardio
● We typically make an assumption that 0
there exists an underlying functional

relationship between x and y:
○ y = f(x) + noise
● The goal of supervised machine
learning is to learn an approximation
g(x)
of f(x) from data, which can be used
to predict y for new observations
● Here we denote the approximation as
g( x) age
18393
gender
2
Height
168
Weight
62
ap_hi
110
ap_lo
80
cholesterol
1
gluc
1
smoke
0
alco
0
active
1
logo
The set is all functions
Parameterized models parameterized by w
● In the case of deep learning, the This element correspond to Settings corresponding
mathematical expression used to setting the parameters to to good
specific values approximations of f(x)
approximate f(x) has some free parameters
● We call this function g(x; w), where w are
the parameters which determine what the
function computes
○ We can think of the parameters w as an
index into a set of possible functions
● Learning is then a search problem: find the
This element correspond to
member (settings of w) of this set which best different values for the
logo
approximates f(x) parameters

The set is all functions
Parameterized models parameterized by w
This element correspond to Settings corresponding

setting the parameters to to good
specific values approximations of f(x)
• A parametric model like a neural network

can be thought of as a filter
• It has an input and an output
• The parameters correspond to the
dials we can turn to decide how the
input is transformed to produce the This element correspond to
output different values for the
logo
parameters
Generalization
● We want the function to be general, to Settings
corresponding to
learn the underlying phenomena good approximations
of f(x) in general
● The goal is not to directly minimize the
error on training data, but on new data not
seen during training
● A flexible g(x; w) can have many settings
which minimize the training error, without
being good in general
● Machine learning as a practice is about
finding function families which perform well
in general as they minimize their error on Settings corresponding to logo
good approximations of f(x)

the training data on the training dataset
Representing
molecules for
machine learning
RISE — Research Institutes of Sweden

9
How to represent molecules
• Many different representations exist for

working with computational methods on
molecules
– Molecular graphs
– String-representations
– Geometrical representations C1=CC(=C(C=C1CCN)O)O
InChI=1S/C6H11N2O4PS3/c1-10-5-7-8(6(9)16- logo
5)4-15-13(14,11-2)12-3/h4H2,1-3H3

How to represent molecules
• These representations vary in size
• Traditional machine learning methods

typically require fixed dimensional
features
C1=CC(=C(C=C1CCN)O)O
InChI=1S/C6H11N2O4PS3/c1-10-5-7-8(6(9)16- logo
5)4-15-13(14,11-2)12-3/h4H2,1-3H3

Historical representation
of molecules for machine
learning
Descriptors • Historically, machine learning
methods used in cheminformatics
has been:
– Fixed dimensional molecular
descriptors and fingerprints
– Induced by graph kernels (for
use in e.g. Support Vector
Structural Machines)
fingerprints Graphs with
graph kernels logo

Molecular descriptors
and fingerprints
• Molecular descriptors are properties either
determined by experimentation or calculated
from the molecule
– Molecular weight
– Number of different atoms or number of H-
bond donors
– Quantitative feature of substructures
– Solubility coefficients
• Fingerprints are binary indicator variables for the
presence of substructure, similar to bag-of-words
or bag-of-n-grams in NLP
– Each substructure is determined by a
manually crafted rule, based on insight into
which substructures are important to
predict activity

Feature learning
instead of engineering
• Learn the important

features from the
molecular structure
• How do we perform
learning on unbounded
inputs?
Jha, D., Ward, L., Paul, A. et al. ElemNet: Deep Learning the Chemistry of Materials From Only
Elemental Composition. Sci Rep 8, 17593 (2018) doi:10.1038/s41598-018-35934-y logo
Neural Networks

15
Neural networks learn representations
Learnt Learnt “It is a truth

universally
Encoder decoder acknowledged”
Target text string

Source speech
Embedding
waveform
• Learn an encoder which embeds the data point in a high

dimensional vector space
• Distance between embeddings of different inputs should make
solving tasks easier (e.g. audio waveform embeddings close logo
together should correspond to similar transcriptions)

Transfer learning and multitask
learning
“It is a truth
Learnt
universally
decoder
acknowledged”
Speech recognition
Learnt Learnt Mrs.

Encoder decoder Bennet
Speaker identification
Source speech
Embedding
waveform Learnt
Ingratiating
decoder
Tone analysis
• The encoder-decoder setup allows for multitask learning and transfer learning
• The same encoder can embed the input data in a space which is useful for logo
different for many different decoders

Anatomy of a deep neural network
Linear prediction (e.g.

Adaptive functions tailored for logistic regression)
our data (e.g. convolutional General adaptive
logo
layers for images) function (dense layer)

Image representations in neural nets
Simple linear classifier

(logistic regression)
logo
Neural image embedding

Molecular embeddings
Mahé, Pierre, et al. "Graph kernels for molecular structure− activity relationship analysis with support
vector machines." Journal of chemical information and modeling 45.4 (2005): 939-951. logo

Neural Networks
for Molecules

21
Neural Networks for molecules
• There have been four main directions for dealing with molecules and neural networks:
– Traditional machine learning – traditional neural network with dense layers fed
with molecular descriptors and/or structural fingerprints. Tree ensembles often
perform very well with these representations.
– SMILES-based – a sequence neural network reads the molecule description as a
string
– Graph-based – neural networks for mathematical sets process the molecule as a
graph
– Geometrical - Geometrical neural networks process the molecule as a point cloud
represented in a 3D space logo
Neural Networks for molecules
• There have been four main directions for dealing with molecules and neural networks:
– Traditional machine learning – traditional neural network with dense layers fed
with molecular descriptors and/or structural fingerprints. Tree ensembles often
perform very well with these representations.
– SMILES-based – a sequence neural network reads the molecule description as a
string
– Graph-based – neural networks for mathematical sets process the molecule as a
graph
– Geometrical - Geometrical neural networks process the molecule as a point cloud
represented in a 3D space logo
SMILES-based
• A sequence neural network (recurrent

or self-attention) reads in a SMILES
representation of the molecule
• Properties are predicted on the latent
space embedding of the molecule
• Often, pretraining is used by
reconstructing the input string using a
decoder network with the latent
embedding as input
• Has worked really well for design
Gómez-Bombarelli, Rafael, et al. "Automatic chemical design using a data-driven
space exploration (drug design) continuous representation of molecules." arXiv preprint arXiv:1610.02415 (2016).
Problems with SMILES
• The same molecule can be represented
with many different SMILES strings
• Since SMILES linearizes the graph,
locality can be harder to learn (two
adjacent atoms can be far apart in the
SMILES)
• While it captures the same information
as the mathematical graph a lot of it is
implicit
• The neural network needs to learn an
implicit SMILES-parser to solve the main
predictive task
• There are neural networks designed for
graphs and can be designed to work
directly with molecules as graphs
Chen, Benson, et al. "Learning to Make Generalizable and Diverse Predictions for
logo
Retrosynthesis." arXiv preprint arXiv:1910.09688 (2019).

Mathematical graphs for neural
networks
• In a neural network, each node in a graph is represented as a
mathematical vector
• The graph is represented a set of vectors, additionally the edges

might also have attributes represented by vectors
logo

Graph Neural Networks for
molecules
• For each node in a graph, we aggregate the vectors of
its neighbourhood and transform them with a transition
function
– Each such application is a layer of a neural network
• This is also referred to as message passing neural
networks
– Information between two nodes which are not
direct neighbors must be passed through the
nodes on a path between them logo
Graph neural network
The neural network is applied to

logo
each node and its context

Graph Neural Network - parallel view
logo

Graph Neural Networks for
molecules
• For molecules, only using the direct neighbourhood
according to bond structure is likely overly restrictive
– The model is forced to focus on local interactions,
having difficulties learning properties of larger
structures
• The number of layers is a ceiling on how long paths we can
learn things over
– and the longer the path, the harder to learn the
relationship
– Transporting information in a nested function will give
vanishing gradient issues
• It has been noted that learning global molecular features
such as molecular weight is difficult for Graph Neural
Networks logo
A case for SMILES
• While SMILES “hides” the graph behind a linearization, it completely encodes the
graph structure together with bond information
– Graph Neural Networks will often have to add specialized architecture to encode
the graph structure, especially for edge attributes like different types of bonds
• The non-uniqueness of SMILES representation could be positive, it allows for straight

forward data augmentation (e.g. Bjerrum, Esben Jannik. "Smiles enumeration as data
augmentation for neural network modeling of molecules." arXiv preprint arXiv:1703.07076
(2017).)
logo

A case for SMILES (continued)
• One important application of machine learning for chemistry is generative

models, e.g. for de novo drug design or chemical reaction prediction
• Generating graphs with GNN are difficult, we need to generate the edge set
and can have trouble with set matching. We typically need to induce some
ordering on the edge set which can get complicated. See Jin, Wengong, Regina
Barzilay, and Tommi Jaakkola. "Hierarchical Generation of Molecular Graphs using Structural Motifs."
arXiv preprint arXiv:2002.03230 (2020).
• A SMILES representation introduces an ordering of the generated elements,

which makes learning much easier. It doesn’t require any specialized neural
architecture
logo

Transformers are like GNNs where all
nodes are the context
• For a transformer,
the graph
information needs
to be added in the
reduction and
feature function
• Typically as
additional pair-wise
information
logo

Transformers for molecules
Chen, Benson, Regina Barzilay, and Tommi Jaakkola. "Path-augmented

graph transformer network." arXiv preprint arXiv:1905.12712 (2019).
logo

Molecular Transformer
Schwaller, Philippe, et al. "Molecular transformer: A model for uncertainty-calibrated chemical reaction logo
prediction." ACS central science 5.9 (2019): 1572-1583.

Pretraining and
Self-supervised
learning

36
Performance of deep networks given
compute, data and parameters
logo
Nogueira, Rodrigo, Zhiying Jiang, and Jimmy Lin. "Document ranking with a pretrained
sequence-to-sequence model." arXiv preprint arXiv:2003.06713 (2020).
Transfer Learning dominates NLP
• Natural Language Processing are shifting from task-specific solutions to reusing general language models
• A single huge model (hundreds of millions of parameters) is trained on a source tasks with huge amounts of data (tens to
hundreds of gigabytes of text)
• The same model is re-used as a frontend to obtain state of the art results in a very diverse set of target tasks
logo

Transfer Learning and Transformers in
NLP
NLP Progress
100
90
80
70
60
50
40
30
20
10
0
MNLI-mm QQP QNLI SST-2 CoLa STS-B MRPC RTE
Howard, Jeremy, and Sebastian Ruder. Pre-OpenAI SOTA BiLSTM+ELMo+Attn OpenAI GPT BERT_large RoBERTa
"Universal language model fine-tuning for text
classification." arXiv preprint
arXiv:1801.06146 (2018). logo

Pretraining huge Transformer
models tailored for molecules
will do for cheminformatics
what BERT has done for NLP
logo

40
Transfer Learning
L. Torrey and J. Shavlik, “Transfer learning,” Handbook of Research

on Machine Learning Applications and Trends: Algorithms, Methods,
and Techniques, vol. 1,
pp. 242–264, 2009. logo
Ruder, Sebastian. Neural transfer learning for natural

language processing. Diss. NUI Galway, 2019.
Transfer learning and setting space
Settings corresponding to good
approximations of the target
task in general
Settings corresponding to Settings corresponding to

good approximations of the good approximations of the logo
target task on its training source task on its training

set set
✀
Image Encoder Output for
Transfer learning
source
task
in neural
networks
1. Train a deep network on
a task where vast amounts
of data is available Output for
Image Encoder
target task
2. Remove the parts specific to
the first task
3. Train a new task-specific part
for the new task
Limits of supervised
transfer learning
• A model trained on a supervised task only learns

a representation suitable for solving that task
– Learning to separate between leopards and
lions could be solved by only learning to
recognize leopard spots.
– Such a representation would not be useful
for more general image understanding
– Pretraining with this task could lead to
worse performance than no pretraining at Representation space
all: a negative transfer logo

Reducing negative transfer
• Each point is performance on a finegrained protein

function prediction task
• Adding additional self-supervised tasks improves
transfer logo
Hu, Weihua, et al. "Strategies for Pre-training Graph Neural

45 RISE — Research Institutes of Sweden Networks." ICLR (2020).
Self-supervised learning
Yann LeCun, https://www.youtube.com/watch?v=7I0Qt7GALVk

logo

Opportunities for molecular neural
networks
• Neural networks for molecules is at the start of a cambrian explosion
– Lot's of new ideas are flooding in from other fields, in particular Natural Language
Processing
• In regards to transfer learning, one interesting question stands out:
• How should we design strong self-supervised tasks that:
– Force the model to retain as much information as possible of the molecule, preferably
on many scales (local to global)
– Force the model to learn something about chemistry
• What are good representations of molecules which allows for strong self-supervised tasks?
How useful is geometrical information?

Where are we on
the scaling line?
• Is there still room for

improvement in the
benchmark datasets we
have?
• What is the irreducible
error in the datasets we
evaluate our models on?
• If we are close, any
progress we see is just
tighter overfitting to the
test set
Hestness, Joel, et al. "Deep learning scaling is predictable, empirically." arXiv logo
preprint arXiv:1712.00409 (2017).

Erik Ylipää
erik.ylipaa@ri.se
logo
RISE – Research Institutes of Sweden AB · info@ri.se · ri.se

Molecular de-novo design and synthesis
prediction
10.00 – 11.00
Esben Jannik Bjerrum, AstraZeneca, Sweden

Esben Jannik Bjerrum
Bio
Esben Jannik Bjerrum completed his PhD in Computational Chemistry at Copenhagen University in 2008. He has since worked both in academia
as a post.doc, industry as an IT specialist as well as a self-employed IT consultant. In 2017 his independent research resulted in several
contributions to the deep learning for chemistry renaissance. He joined AstraZeneca in 2018 where he currently works with development of de
novo design algorithms and deep learning assisted retrosynthetic planning. He’s the lead blogger of cheminformania.com.
Abstract
Molecular de-novo design and synthesis prediction
Molecular de-novo design and synthesis prediction
Drug discovery is the earliest part of developing new medicine for unmet medical needs of patients. Transforming early hits into clinical
candidates is a year long project with multiple rounds of designing new compounds with proposed better properties, synthesizing the
compounds in the laboratory and test the properties and analyze the results for new rounds of this design-make-test-analyze cycle. Our aim is
to utilize artificial intelligence to speed up these processes and allow us to create new medicines faster. In the molecular AI group, we develop
tools to answer two questions: What molecules to make and how to make them.
Part 1 will cover examples from the design phase, where we use a sequential molecular notation format, SMILES, in combination with deep
neural networks inspired from natural language processing. Using recurrent neural network architectures, we can read-in molecules in the
SMILES format, but also and more importantly, read-out molecules in SMILES format. The last property is what enables us to handle massive
molecular spaces by probabilistic sampling. The generation can be steered in directions of interested by different techniques, such as transfer-
learning, autoencoders, reinforcement-learning and conditional recurrent networks. Alternative architectures allow us to generate molecular
series or generate subtle changes in molecules to induce adjustments of existing compounds.
Part 2 will address question two: how to make the suggested molecules. Using large databases of known reactions pooled from public, in-
licensed and internal sources we extract reaction templates which covers the core of the reaction. To search for potential synthetic routes, we
use Monte Carlo tree-search coupled to neural network policies. The neural networks are trained to prioritize our template-library and predict
the most likely templates to be applied. Thus, the search-breadth of the tree search is reduced orders of magnitude and the tree search can be
performed in seconds to minutes. However, the policy networks prioritize often used reactions due to the imbalance of the reaction databases
and overlooks more seldom used and specialized reactions. By training policy networks on selected reaction classes such as ring-forming
reactions, special attention is given to these reactions, which can be injected into the tree search or used as alternative options in interactive
path-planning. Interestingly SMILES based translation can also play a role in Reaction Informatics.
Molecular de-novo design and
synthesis prediction
Esben Jannik Bjerrum, Principal Scientist

Machine Learning in Chemical Designs and Functions, RISE workshop 2020 may 29
Agenda
• Introduction to AstraZeneca and
Drug Design and Development
• Part 1: De-novo design Part 2: Reaction prediction

• SMILES – a chemical language Retrosynthesis planning with MCTS
• Recurrent neural networks tree search
• Data augmentation strategies Scoped prediction models
• Unbiased Molecular Generation Artificial Labels
• Controlling the molecular Condition prediction
generation Transformer Models
• Next generation algorithms
2
AstraZeneca: Global dimensions
invested in R&D
Total Revenue with research
$22.1bn (down 2% over 2017)
$5.9 bn across five 64.6k employees
countries
projects in clinical
of our senior
Product Sales development and
21.1bn
$ (up 4% over 2017) 149 eight NMEs in late- 45% roles are filled
by women
stage development
NME approvals manufacturing

Externalisation
1
$ bn Revenue 23 in 2018 30 sites in 17
(and 71 since 2014) countries
3
One of
AstraZeneca’s
three global
Gothenburg
R&D centres Boston
California Osaka
Shanghai
$ 21.1bn
Cambridge
Strategic R&D centres
Additional laboratories
Gaithersburg
4 4
Drug Discovery is just
the beginning …
Life-cycle of
a medicine
We are one of only a handful
of companies to span the
entire life-cycle of a medicine
from research and development
to manufacturing and supply,
and the global commercialisation
of primary care and speciality
care medicines
5 Speaker Kit March 2020

Design-Make-Test-Analyse cycles in Drug Discovery (DMTA)
Drug
target Design
Chemical starting point Candidate drug
(“Hit”) found through HTS,
DEL, fragment screening or
knowledge ~3
Analyse Make years
• Weakly active • Highly potent
• Target unselective • Effective in in vivo models
• Toxicity risk • Metabolically stable
• Low metabolic stability • No toxicity issues
Test
Multiple of DMTA cycles The challenge: Find ways

4-6 weeks per cycle to speed up and improve
Hand-overs between multiple labs the process using AI
Drug Design
Molecular AI group provides tools for the projects:
What to make next? How to make it?
De novo design Retrosynthesis

Multi-parameter scoring function
Simplified Molecular Input Line Entry Specification
(SMILES)
• A sequence format for molecules

• Allows us to use the progresses made with natural language
processing in the recent years ☺
11
Training and Sampling using Recurrent Neural Networks
Do the same computation for different time steps
Let previous computations influence current computation
Training:
Sampling:
12
14
De novo compound generation using recurrent neural networks
The generative process
Allow us to generate novel chemical structures in silico.

Molecules generated follows properties of the training set
Useful as expansion of
libraries for in silico HTS?
Needs fairly large

datasets or refocusing
Blue: Zinc Fragment like
Green: Zinc Drug like
Solid: Train
Dashed: Generated
Bjerrum, E. J.; Threlfall, R. Molecular Generation with Recurrent
15
Neural Networks (RNNs). 2017 http://arxiv.org/abs/1705.04612.
Why? Generation of Novel Compounds in the 1060 Chemical Space!
1010-1012
1060
Can be handled explicitly in databases Must be handled probabilistically
Where´s the impact?

• Use for de novo Molecular Design
• Scaffold Hopping
• Novelty
• Virtual Screening
16 • Library Design
Transfer learning 1. Train on large dataset of molecules
2. Retrain on smaller dataset of
molecules with desired properties
Segler, M. H. S.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating Focused Molecule Libraries for Drug
Discovery with Recurrent Neural Networks. ACS Cent. Sci. 2018.
17 https://doi.org/10.1021/acscentsci.7b00512 .
Assessing the quality of molecular generative models
1) Arús-Pous J, Blaschke T, Ulander S, et al (2019) Exploring the GDB-13 chemical space using deep generative models. J
Cheminform 11:20. https://doi.org/10.1186/s13321-019-0341-z
2) Arús-Pous J, Johansson SV, Prykhodko O, et al (2019) Randomized SMILES strings improve the quality of molecular
19
generative models. J Cheminform 11:71. https://doi.org/10.1186/s13321-019-0393-0
Data augmentation
• Canonical SMILES ensures a 1:1

relationship between molecule and
●Zooming, cropping, mirroring, flipping, SMILES
rotation, hue, color, contrast, etc. +
combinations • SMILES enumeration generate multiple
SMILES for the same molecule
Bj rr m, sb Ja ik. 0 7. “ m rati as Data Augmentation for Neural Network
20
Modeling of Molecules.” http //arxiv. rg/abs/ 70 .07076
Random SMILES in practice
21
SMILES enumeration increases Chemical Space Coverage
More uniform More Complete
Set SMILES Validity Completeness

Canonical 0.994 0.836
1M
Randomized 0.999 0.953
10K
1K
GDB-13 is 975 million molecules

(1) Arús-Pous, J.; Johansson, S. V.; Prykhodko, O.; Bjerrum, E. J.; Tyrchan, C.; Reymond, J.-L.; Chen, H.;
22 Engkvist, O. Randomized SMILES Strings Improve the Quality of Molecular Generative Models. J. Cheminform.
2019, 11 (1), 71. https://doi.org/10.1186/s13321-019-0393-0
SMILES Based Autoencoders
Example: Gómez-Bombarelli, Rafa t a . 0 8. “Automatic Chemical Design Using a Data-Driven Continuous

23 Representation of Molecules.” ACS Central Science 4(2): 268–76. Preprint from 2016!
A SMILES string is not a Molecule
c1ccccc1 Molecule
24
HeteroEncoders
Also possible with hi’s and CAS-names : Winter et. al 2018

From chemical images to SMILES: Bjerrum & Sattarov 2018
25
Latent vectors as a base for Quantitative structure-activity models (QSAR)
Figure adapted from : Bjerrum, Esben Jannik, and Boris Sattarov. 0 8. “Improving Chemical Autoencoder Latent Space and Molecular
26
De Novo Generation Diversity with Heteroencoders.” Biomolecules.
Latent Space gets more chemicaly relevant
RMSEP of 5 datasets modelled using deep neural networks

IGC50 LD50 BCF Solubility MP Norm
Mean
improvement
Enum2Enum 0.43 0.54 0.71 0.65 37 0.75

Can2Enum 0.46 0.54 0.69 0.69 37 0.77
Enum2Can 0.46 0.57 0.71 0.66 38 0.78
Can2Can 0.53 0.62 0.79 0.87 43 0.89
ECFP4 0.62 0.59 0.94 1.21 43 1.00
ECFP4 performance low when compared to literature, Enum2Enum close
27
Bjerrum, Esben Jannik, and Boris Sattarov. 0 8. “Improving Chemical Autoencoder Latent Space
and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules.
REINVENT: An In Silico mini-DMTA cycle
Generative RNN
Design
Reinforcement The Value:

Learning Optimizes Make Molecules for DMTA cycle
the RNN Analyse
Produces novel scaffolds and
Test improved compound
suggestions for drug discovery
projects
Predicted Compound
properties (e.g. QSAR) Less real world DMTA cycles
=> Saved time
Open Source:
https://github.com/MolecularAI/Reinvent
34
Optimization of molecular properties
Decoder
based:
We
currently
use
REINVENT
Blaschke, Thomas; Arús-Pous, Josep; Chen, Hongming; Margreitter, Christian; Tyrchan, Christian;
Engkvist, Ola; et al. (2020): REINVENT 2.0 – an AI Tool for De Novo Drug Design. ChemRxiv. Preprint.
35
https://doi.org/10.26434/chemrxiv.12058026.v2
Further options in molecular generation and design
Optimizations on REINVENT Better molecular optimizations via
Graph based molecular representations
Novel neural network architectures

AI Library Generation
Directly Steered Optimization Chemical intuition via MMP Transcoder
*
36
Return of the encoder architectures: Conditional RNN’s
Encoder Decoder
LogP
TPSA
RDKit MolWeight Decoder
HBA
HBD
Control of Properties
40
Kotsias, P.-C.; Arús-Pous, J.; Chen, H.; Engkvist, O.; Tyrchan, C.; Bjerrum, E. J. Direct Steering of de Novo
Molecular Generation with Descriptor Conditional Recurrent Neural Networks. Nat. Mach. Intell. 2020, 2 (May).
Molecular optimization
• Goal
– Given a starting molecule, generate molecules with desired property
changes while maintaining similarity to the original molecule
• Capturing chemist intuition in respect to chemical transformations that change
the property of molecules
– Matched molecular pairs
• Methods
– Seq2Seq
– Transformer
– Graph-based
47
Scaffold-based decoration using SMILES
1) Arús-Pous J, Patronov A, Bjerrum EJ, et al (2020) SMILES-based deep generative scaffold decorator for de novo
drug design. ChemRxiv
48
Conclusions part 1
• SMILES + NLP => Fast Results

• Data augmentation tricks improve performance
• We can read-in molecules:
– e.g. Property prediction
• We can read-out molecules:
– Probabilistic handling of huge databases
– Steered molecular generation with several
options:
• Reinforcement learning
• Conditional RNNs
• Encoder-decoder architectures
– More finetuned tasks also doable:
• Ligand series generation
• MMP generation
50
Part 2: How to make the compounds?
Synthesis Prediction
52
From Design to Compound: Make step
Design
NMP, MeCN, 93%
Make
Analyze
Test C Y
R1 + R2 P
53
Different Objectives for Synthetis Prediction
? Y
R1 + R2 P Condition Prediction
Forward
C ?
R1 + R2 ? Reaction Feasibility
Retro-synthesis 1-step P ? + ?
Backward ?
? ?
Retro synthetic planning
54
?
P
? ?
?
Chemistry Reaction Data
MedChem PharmSci
ELN ELN
Flatfiles
Reaxys
Flat
Flat files
files Flatfiles
Pistachio files
Flat
Flat files
Flat Files USPTO
Flat Files
ChemConnect
ReactionConnect
Design
Analyse DMTA Make
Test
Predictive Reaction
55 iLab, MedChem, PharmDev Models
Template Extraction Dataset Size Templates Extracted
Pistachio (incl. PGs) 6,839,427 308,951

USPTO 1976-2016 3,748,191 252,877
Pistachio + USPTO 10,587,618 315,286
All Data >1,000,000
Explicit Handling of Protection Groups

increase template quality
Retro Reaction Templates Extracted

Searching the possible reaction tree
Monte Carlo Tree Search
Branching Factors Neural Network selects and prioritizes

=> More Manageable Problem
Chess Go Retrosynthesi
s
Search Breadth ~35 ~250 > 1,000,000
Search Depth ~80 ~150 ~12
Alpha Go architecture
Segler, M. H. S.; Preuss, M.; Waller, M. P. Planning Chemical
58 Syntheses with Deep Neural Networks and Symbolic AI. Nature 2018,
555 (7698), 604–610. https://doi.org/10.1038/nature25978 .
Results in Seconds to Minutes
Model: USPTO
Time taken: 3.26 s
59
Do more reactions equal better performance?
61
Overall
Performance as
measured for each
dataset
Thakkar, A.; Kogej, T.; Reymond, J.-L. L.; Engkvist, O.; Bjerrum,
E. J. Datasets and Their Influence on the Development of
Computer Assisted Synthesis Planning Tools in the
Pharmaceutical Domain. Chem. Sci. 2019, 11 (1).
https://doi.org/10.1039/C9SC04944D .
62
Making the tool available The Value: Chemists can quickly get
Web-GUI based on MIT MLDPS consortium tools suggested routes/ideas to
purchasable compounds.
Cheminformaticians can filter datasets
i t “sy th sizab / t-synthasizable”
Scripting access via Python Objects
Jupyter based GUI
Open-sourced next week (+ Chemrxiv)

63
Improving the route-finding: Ring Breaker
• Extract templates corresponding to ring formations
• Create model specific to ring formations

• Model learns mapping to ring formations
without “distractions” from other reactions
Thakkar, A.; mi, .; R ym d, J.; gkvist, .; Bj rr m, . J. “ Ri g Br ak r ” : Neural

64 Network Driven Synthesis Prediction of the Ring System Chemical Space. 2020.
https://doi.org/10.1021/acs.jmedchem.9b01919 .
Improving the route finding 2: Artificial Labels for filtering
T1
P R1 + R2
T2
P
Bjerrum, Esben Jannik; Thakkar, Amol; Engkvist, Ola (2020):

Artificial Applicability Labels for Improving Policies in
Retrosynthesis Prediction. ChemRxiv. Preprint.
65 960 000 000 000 REACTION MATCHES https://doi.org/10.26434/chemrxiv.12249458.v1
Template-free NLP approach to reaction prediction
• Using SMILES a chemical

reaction can be formulated
as a translation
• Direct learning from data
• No templates with arbitrary
cutoffs need to be
extracted
• Good performance
(1) Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter, C. A.; Bekas, C.; Lee, A. A. Molecular Transformer: A
69
Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci. 2019, 5 (9), 1572–1583.
https://doi.org/10.1021/acscentsci.9b00576 .
Automating the DMTA cycle
• REINVENT is automated
AI REINVENT
design
• AiZynthFinder is
automated route
planning
Design • iLab is automated
synthesis
Make AiZynthFinder • 2018 De-novo design

Analyze
• 2019 Retrosynthesis
Test
• 2020 Combine with
Automation
• Increased automation is
iLab key to speed
72
Conclusions part 2
• Reaction Prediction is many different tasks

• Data driven retro-synthetic algorithms are
performant
• Public data perform nearly on-par with
proprietary data-sources
• Specialized neural networks can improve route
search performance
• Calculation of artificial labels can improve route
search performance
• SMILES based approaches are an interesting
alternative to extracted reaction core templates
• The increased automation in de novo design

and retrosynthetic planning will pave the way
for automating the DMTA cycle.
73
Toolkits – Source code - Links
ReInvent: https://github.com/MolecularAI/Reinvent
Molvecgen: github.com/Ebjerrum/molvegen
Deep Drug Coder:https://github.com/pcko1/Deep-Drug-Coder
AiZynthFinder: TBR: https://github.com/MolecularAI
Blogposts: www.cheminformania.com
74
Acknowledgements
Rocio Mercado, Post.doc
Molecular AI group: Tomas Bastys, Post.doc
Ola Engkvist, Associate Director, Molecular AI Simon Johansson, Ph.D Student WASP
Panagiotis-Christos Kotsias, Graduate Scientist, Hampus Gummesson Svensson, Ph.D Student WASP
Graduate Programme Sebastian Nilsson, Master Student
Josep Arus Pous, Ph.D student, BIGCHEM Tobias Rastemo, Master Student
Jiazhen He, post.doc. Molecular AI Emil Sandström, Master Student
Amol Thakkar, Ph.D student, BIGCHEM Jonathan Sundkvist, Master Student
Dean Sumner, Graduate Scientist, Graduate Programme Huifang You, Master Student
Veronika Chadimova, Graduate Scientist, Graduate Carl Blomgren, Master Student
Programme
Samuel Genheden, Data Scientist/Software Engineer Collaborators:
Atanas Patronov, Associate Principal Scientist Prof. Dr. Jean-Louis Reymond · Dept. of Chemistry &
Isabella Feierberg, Associate Principal Scientist Biochemistry University of Berne
Thierry Kogej, Associate Principal Scientist Christian Tyrchan, Team Leader - Computational
Preeti Lyer, Machine Learning and Cheminformatics Chemistry
Experts Boris Sattarov, Informatics Programmer, Science Data
Christian Margreitter, Data Scientist Software LLC
Papadopoulos, Kostas, Associate Principal Scientist Hongming Chen, Professor, Centre of Chemistry and
Lewis Mervin, Machine Learning and Cheminformatics Chemical Biology, Guangzhou, China
Expert Nidhal Selmi, Research Outsourcing Specialist, Hit
Christos Kannas Machine Learning/Cheminformatics Discovery
Expert Peter Varkonyi, Senior Research Scientist |
Alexey Voronov, Data Scientist/Software Engineer
75 Computational Chemistry
Questions
76
Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this
this file
file in
in error,
error, please
please notify us and remove
it from your system and note that you must not copy, distribute or take any action in reliance on
on it. Any
Any unauthorized
unauthorized use use or
or disclosure of the
contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 1 Francis Crick Avenue, Cambridge Biomedical Campus,
Cambridge, CB2 0AA, UK, T: +44(0)203 749 5000, www.astrazeneca.com
77
10
GDB and the chemical space
11.20 – 11.50
Jean-Louis Reymond, University of Bern, Switzerland

Jean-Louis Reymond
Bio
https://orcid.org/0000-0003-2724-2942
Jean-Louis Reymond is a Professor of Chemistry at the University of Bern, Switzerland. He studied Chemistry and Biochemistry at the ETH
Zürich and obtained his Ph.D. at the University of Lausanne on natural products synthesis (1989). After a Post-Doc and Assistant Professorship
at the Scripps Research Institute, he joined the University of Bern (1997). His research focuses on the enumeration and visualization of
chemical space for small molecule drug discovery, the synthesis of new molecules from GDB (http://gdb.unibe.ch), and the design and
synthesis of peptide dendrimers and polycyclic peptides as antimicrobials and for nucleic acids delivery. He is the author of > 300 scientific
publications and reviews.
Abstract
GDB and the Chemical Space
Chemical space is a concept to organize molecular diversity by postulating that different molecules occupy different regions of a mathematical
space where the position of each molecule is defined by its properties. Our aim is to develop methods to explicitly explore chemical space in
the area of drug discovery. We have enumerated all possible molecules following simple rules of chemical stability and synthetic feasibility to
form the Generated DataBases (GDB). Exploring GDB in comparison to known molecules reveals that vast areas of chemical space are still
entirely unknown yet are accessible for experimental exploration by straightforward synthetic methods. I will discuss how to visualize chemical
space and exemplify the discovery and synthesis of new scaffolds for drug discovery, and how we use machine learning methods to address
target predictions and synthesis predictions of GDB molecules. http://gdb.unibe.ch
GDB and the Chemical Space
Jean-Louis Reymond
29 May 2020, RISE AI workshop
http://gdb.unibe.ch
1. The GDB project

2. Atom-pairs
3. Molecular shingles
1
Philippe Schwaller
David Kreutter
Josep Arus-Pous
Daniel Probst
Sven Bühlmann
Alice Capecchi
Sacha Javor
http://gdb.unibe.ch
Finton Sirockin (Novartis)
Ola Engkvist (Astra Zeneca)
Matthias Hediger (UniBE) Florian Hollfelder (Cambridge UK)
Pierre Gönczy (EPFL) Anne Imberty (Grenoble)
Amol Thakkar
Roch-Philippe Charles (UniBE) Achim Stocker (UniBE)
Jürg Gertsch (UniBE) Luc Patiny (EPFL)
Dirk Trauner (New York) Andrea Endimiani (UniBE)
Daniel Bertrand (HiQScreen) Christian van Delden (Geneva)
2
Hugues Abriel (UniBE) Runze He (Space Peptides)
1. The GDB project
DB fraction
1) Ring strain / topologies
Graphs
114 B
2) Unsaturations Hydrocarbons
5.4 M
3) Heteroatoms
Skeletons
Claus Benzol 1.3 B
(1867)
GDB-17: Molecules
166.4 B
4) ChEMBL-likeness score ≥ 3.3 CLscore
5) Uniform sampling
26.2 B Subset
ChEMBL-like
GDBChEMBL
10 M
Tobias Fink et al. Angew. Chem. Int. Ed. 2005, 44, 1504-1508, J. Chem. Inf. Model. 2007, 47, 342-353 (GDB-11)
Lorenz C. Blum et al., J. Am. Chem. Soc. 2009, 131, 8732-3 (GDB-13);
Lars Ruddigkeit et al., J. Chem. Inf. Model. 2012, 52, 2864-2875 (GDB-17) Trinorbornane
Ricardo Visini et al., J. Chem. Inf. Model. 2017, 57, 700-709 (FDB17), J. Chem. Inf. Model. 2017, 57, 2707-2718 (GDB4c) (2007)
Mahendra Awale et al., Mol. Inf. 2019, 38, 1900031 (GDBMedChem)
Sven Bühlmann et al., Front. Chem. 2020, doi:10.3389/fchem.2020.00046 3
Molecular quantum numbers (42D)
HO Polar groups
Atoms H-Bond donor atoms 3 1
O
Carbon 17 16 H-Bond donor sites 3 1
H
Fluorine 0 0 H-Bond acceptor atoms 3 4
H Chlorine 0 0
NMe H-Bond acceptor sites 3 7
HO Bromine 0 0 Positive charges 1 0
Morphine Iodine 0 0 Negative charges 0 1
Sulphur 0 1
H Phosphor 0 0
N
S Acyclic nitrogen 0 1
O Cyclic nitrogen 1 1 Topology
N
O Acyclic oxygen 2 4 Acyclic monovalent nodes 3 6
CO2H Cyclic oxygen 1 0 Acyclic divalent nodes 0 2
Penicillin G
Heavy atom count 21 23 Acyclic trivalent nodes 0 2
Acyclic tetravalent nodes 0 0
Cyclic divalent nodes 8 6
Cyclic trivalent nodes 9 6
Bonds Cyclic tetravalent nodes 1 1
Acyclic single bonds 3 8 3-Membered rings 0 0
Acyclic double bonds 0 3 4-Membered rings 0 1
Acyclic triple bonds 0 0 5-Membered rings 1 1
Cyclic single bonds 18 11 6-Membered rings 4 1
Cyclic double bonds 4 3 7-Membered rings 0 0
Cyclic triple bonds 0 0 8-Membered rings 0 0
Rotatable bonds 0 4 9-Membered rings 0 0
 10 membered rings 0 0
Atoms shared by fused rings 7 2
Bonds shared by fused rings 6 1
4
Kong Thong Nguyen et al., ChemMedChem 2009, 4, 1803-1805
5
Daniel Probst et al., Bioinformatics, 2018, 34, 1433-1435, J. Chem. Inf. Model. 2018, 58, 1–7
6
Mahendra Awale et al., Mol. Inf.. 2019, 38, 1900031
Ring systems
GENG
93,463 graphs
≤ 4 cycles, ≤ 16 nodes
1) Ring enlargement
+ filters
2) Aromatization
728,391 saturated
carbocyclic ring systems
≤ 4 cycles, ≤ 30 atoms
GDB4c 3) Stereoisomers
916,130 carbocyclic
ring systems
(as SMILES)
RDB
12,536 known GDB4c3D
ring systems 6,555,929 carbocyclic
ring systems
(as sdf files /SMILES)
Ricardo Visini, Josep Arùs-Pous et al., J. Chem. Inf. Model. 2017, 57, 2707-2718 7
Deep learning
Josep Arus-Pous et al., J. Cheminf. 2019, 11, 20 8

Deep learning
ChEMBL 1) Training
DrugBank
2) Transfer learning
FDB17
commercial fragments One known drug
LSTM
Generative Neural Network
3) Generate new SMILES Table 2. Number of unique/total high similarity drug analogs produced by the different LSTM neural networks.
- retain correct SMILES
- remove duplicates
Neural Network LSTM1 LSTM2 LSTM3 LSTM4 LSTM5 LSTM6
- remove undesirable functional groups Source database ChEMBL ChEMBLs DrugBank Commercial FDB17 All Unique
4) Select high similarity analogs Fragments databases across
training cpds. 344,319 40,000 5,104 40,986 500,000 890,409 LSTMs
Nicotine 0/23 32/82 1/32 32/93 9/47 16/67 166
New drug analogs
Fencamfamine 15/42 126/218 40/96 130/231 92/164 41/114 580
Aminophenazone 5/26 34/96 23/71 38/99 22/66 19/65 223
Sulfadiazine 6/27 19/59 11/37 28/74 8/30 2/25 124
Miconazole 2/10 301/500 268/438 174/336 0/0 153/256 1134
Roflumilast 8/15 319/557 117/283 351/585 0/0 45/166 1126
Lovastatin 0/1 631/986 460/757 352/625 487/728 289/530 2729
Epothilone D 0/1 911/1301 561/831 807/1160 1595/2039 1163/1511 5707
Nilotinib 0/1 506/666 180/321 218/381 0/0 243/355 1362
Erythromycin 0/2 832/1042 174/243 524/709 1243/1444 1105/1288 4190
Mahendra Awale et al., J. Chem. Inf. Model. 2019, 59, 1347-1356 9

2. Atom-pairs
A
AUC value
decane cyclodecane adamantane ≤ 50 % 60 % 75 % ≥ 90 %
Bit Value
200 200 AUC for recovery of ROCS Color200

1,5-dimethylcyclooctane Tanimoto analogues
bis-homocubane
2,7-dimethyloctane decalin iPrEt3CH
Sfp ECfp4 Xfp10 Xfp20 CATS
150 iPr3CH 150
150
Pr3CH
Pr3CH
100 100 100
Bit Value
50 50 50
0 0 0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 A B
B C
adamantane
35%
Bit Value
octane 200 bis-homocubane

ChEMBL AV ChEMBL SD
Bit Value av. / SD

iPrEt3CH
i-PrCEt3 150
% Database
30% ZINC AV ZINC SD

150 ChEMBL GDB-17 AV GDB-17 SD
25% ZINC CSD AV CSD SD
100 GDB-17 100
20%
CSD
5015%
10% 50
0
D8 D9 D10 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 C
5%
0% 0 10
Mahendra Awale
0 et al.,5J. Chem.
10 Inf. Model.
15 2014,
2054, 1892-1907
25 30
Target prediction
NN search (MQN / Predicted
Xfp / ECfp4) in targets
ChEMBL (NN)
Predicted
ECfp4-NB Model
targets
built on 2,000 NN
Query (NN+NB)
molecule
Predicted
ECfp4-NB model
targets
built on ChEMBL
(NB)
Predicted
ECfp4-DNN model
targets
built on ChEMBL
(DNN)
11
Mahendra Awale et al., J. Chem. Inf. Model. 2019, 59, 10-17
12
Mahendra Awale et al., J. Chem. Inf. Model. 2019, 59, 10-17
PPB2 with Xfp similiarity
angiogenesis inhibitor from phenotypic screen

known LPAAT-β inhibitor
13
Marion Poirier et al. ChemMedChem 2019, 14, 224-236
14
Alice Capecchi et al., Mol. Inf. 2019, doi: 10.1002/minf.201900016
15
Antimicrobial peptides
Stéphane Baeriswyl et al., ACS Chem. Biol. 2019, DOI: 10.1021/acschembio.9b00047 16

3
Library (50k) – PCHRG atoms (0%-10%)
Peptide Dendrimers
Enumerate analogs
G3KL (Lys, Leu, deletion: 52,530)
VS top 200 / First Round Sythesis

NN-T7 / NN-T5 / NN-T4
Property space (2DP)
Select nearest neighbours

Synthesis and testing (200)
Cluster
(10)
17
Thissa N. Siriwardena et al., Angew. Chem. Int. Ed. 2018, 57, 8483-8487
Peptide Design Genetic Algorithm (PDGA)
50 random Target sequence Target: tyrocidine A
sequences or MXFP cyclo[D-Phe-Pro-Phe-D-Phe-Asn-Gln-Tyr-Val-Orn-Leu]
XfPFfNQYVOL
evaluation:
1. SMILES SMILES
2. MXFP
3. CBD from target MXFP
MXFP
CBDMXFP= 0 YES
OR Exit
run time = 24h
NO tyrocidine B CBDMXFP = 67
YES
CBDMXFP ≤ 300 ANALOGS
DATABASE retro-loloatin A CBDMXFP = 157
retro-tyrocidine C CBDMXFP = 188

10 survivors 40 new individuals: crossover,
+
selection random generation, and mutation
new generation
A. Capecchi et al., J. Chem. Inf. Model., 2020, doi:10.1021/acs.jcim.9b01014 18

3. Molecular shingles encoding precise atom environments
19
Daniel Probst et al., J. Cheminf. 2018, 10, 66
TMAP of the natural products atlas (24,594×232)
20
Daniel Probst et al., J. Cheminf. 2020, doi:10.1186/s13321-020-0416-x
Atom Pair shingles MAP4 encoding of jk
r1: O=c |15| c(c)c
r2: O=c(c)[nH]|15|c(cc)cc
j
a) DUD, MUV, ChEMBL b) Mutated and scrambled peptides c) All datasets
A. Capecchi et al, ChemRXiv., 2020, doi: 10.26434/chemrxiv.11994630.v1 21

TMAPs of the Metabolome MHFP6
OH count
MAP4
http://tm.gdb.tools/map4/ 22
Summary and Outlook
> GDB
– GDBChEMBL
– MQN (visualize)
– Deep learning
– GDBScaffold
> Atom-pairs
– Target prediction
– Beyond Lipinski
– Peptide discovery
> Molecular shingles

– MHFP6 (MinHash)
– TMAP
– MAP4
23
Machine Learning in the Regulatory
landscape
13.10 – 13.20
George Kass, EFSA, Italy

George Kass
Bio
George Kass was trained as a biochemist. He received his PhD in biochemical toxicology from the Karolinska Institute in Stockholm in 1990.
After a post-doc at the Swiss Federal Institute of Technology in Zurich he returned to the Karolinska Institute as Assistant Professor. In 1994 he
moved to the University of Surrey in the UK as Professor of Toxicology. He moved to the European Food Safety Authority in 2009. And he
knows nothing about machine learning or artificial intelligence.
Abstract
N/A
Machine Learning in the
Regulatory landscape
George Kass & Caroline Merten

European Food Safety Authority
Disclaimer
The views, thoughts and opinions presented are not

necessarily those of EFSA
2
EU agencies
ECDC
ECHA
EMA
EFSA
EFSA IS
The reference body for risk assessment of food

and feed in the European Union. Its work covers
the entire food chain – from field to fork
One of the number of bodies that are responsible

for food safety in Europe
4
WHAT EFSA DOES
Provides independent scientific advice and support for EU

risk managers and policy makers on food and feed safety
Provides independent, timely risk communication
Promotes scientific cooperation
5
THE SCIENTIFIC PANELS
Plant protection
Plant health
GMO
Nutrition
Animal feed
Food Packaging
Animal health & welfare
Food additives
Biological hazards
Chemical contaminants
6
In vivo •ADME studies
biological •Following OECD TG and GLP criteria

•Traditional TK parameters (Tmax, t1/2, AUC,
studies analytical data, etc...)
In vivo •Sub-chronic, chronic, repro-dev studies

•Following OECD TG and GLP criteria
toxicological •Traditional Tox parameters (biochemistry,
histopathology, weight, food consumption,
studies etc...)
•Mainly for genotoxicity and metabolism

In vitro •Following OECD TG and GLP criteria
•Traditional parameters (biochemistry, markers
studies for mutagenesis and chromosomal aberrations,
etc..)
7
Previous
Evaluations
Reports, papers and
evaluations
Reports
Biological studies In depth or
(e.g. ADME) and systematic
toxicological studies literature
(e.g. genotoxicity, searches
90-day, Papers and reports
developmental etc..)
Data and
evidence to
support
risk
assessment
Quantity of information is small Risk assessor evaluates evidence manually

8
The Challenges for EFSA
Transparency and openness
Need to accelerate pace of risk assessment
New methodologies and tools for risk assessment
One health approach to food and feed safety
9
Impact on regulatory risk assessment:
Opportunities for machine learning
▪ Data and evidence from literature
➢ Transparency and reproducibility
➢ In depth and systematic literature searches
➢ Controversial substances – thousands of publications
➢ Tools for literature screening (Distiller SR) and study quality evaluation (e.g.
SciRAP)
▪ Big data
➢ Whole-genome sequencing data
➢ OMICs data
▪ In silico predictions
➢ (Q)SAR
➢ RAx
➢ Need for better predictive models
Opportunities for machine learning and AI!

10
Machine learning and AI
could lead to better
decisions on chemicals
BUT we must get it right!
11
Emerging risks in the food chain
Opportunities for machine learning
REACH database contains 25,405 unique substances
Only small fraction extensively characterised
Does their release in the environment lead to accumulation in the food chain?
How to identify emerging issues? Alerts through text mining of scientific literature?
Alerts through screening social media and media reports?
Opportunities for machine learning and AI?

12
Why, Where and When Machine Learning
Works in Predictive Toxicology – And When it
Doesn’t….
13.30 – 14.00
Mark Cronin, LJMU, UK

Mark Cronin
Bio
Mark Cronin is Professor of Predictive Toxicology at Liverpool John Moores University. He's been interested and in silico methods to protect toxicity and
ADME endpoints for over 30 years having experience in both human health and environmental endpoints. As a biologist he is keen to find patterns within
data, but also to ensure models have mechanistic meaning and interpretation and are fit for purpose.
Abstract
Why, Where and When Machine Learning Works in Predictive Toxicology – And When it Doesn’t….
Mark Cronin, School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, UK.
E-mail: m.t.cronin@ljmu.ac.uk
We've been using multivariate statistics in predictive toxicology for many decades now, as soon as the computational power was sufficient in the 1980s we
used supervised and unsupervised techniques to analyse what would now be thought of as rather trivial data sets. Neural Networks became relatively
commonplace in QSAR in the early 1990s, with some notorious early studies only proving that it was easier to overfit a model then provide a reliable
predictive technique! Whilst the data resources, hardware and machine learning techniques currently available to us, as well the possibilities they provide,
could not have been imagined 30 years ago, the uses of computational toxicology have largely remained unchanged e.g. supporting product development,
screening and prioritisation, safety assessment. A key use is also in regulatory assessment, and this is an area where potentially there has been the greatest
resistance to the use of Machine Learning QSARs. The purpose of this talk is not to discourage the use of Machine Learning in computational toxicology,
precisely the opposite, but to say we need to use it correctly and in the right place, understanding when other techniques, based on a more mechanistic
approach, may be preferred and/ or acceptable.
Why, Where and When Machine Learning
Works in Predictive Toxicology
– And When it Doesn’t….
Or: From Luddite to Deep Learning
Mark Cronin
Liverpool John Moores University

A Two Minute Rant on ML and AI !!
Machine Learning
Machine Learning
With Thanks to Dr
James Firman, LJMU
Machine Learning
- Social Media
- Epidemiology
• What do you want to learn? - Adverse Drug
Reactions
• Size of data set
• Homogeneity of mechanisms
• Appropriate descriptors
• Overfitting
• Understand experimental error
• Structure of data
Main Message:
Use the Right

Tool for the Job!
http://researchonline.ljmu.ac.uk/id/eprint/4989/
Interest in
Computational
Toxicology
1980s 1990s 2000s 2020s

What I Learned From My PhD
• You can model complex toxicological effects

…and the models are beautiful
• QSAR datasets have a structure

…random numbers don’t
• There are many ways to understand data

…and they all tell you the same thing
fish toxicity
log P
Structure-Toxicity
Relationships
The “Good Old Days”
of Data Collection
And then
came
neural
networks...
Data Capture and Sharing
eTOX 25/10/16
DrugBank Approved v5.0.3
Some Freely Available Toxicity Data Resources
• OECD QSAR Toolbox (https://www.qsartoolbox.org/)
• Requires download and some expertise!
• ChEMBL (https://www.ebi.ac.uk/chembl/)
• Targets: 13,382 - Activities: 16,066,124
• Compounds: 1,961,462 - Documents: 76,086
• PubChem (https://pubchem.ncbi.nlm.nih.gov/)
• Compounds: 102.7 million - Substances: 253 million
• Bioactivities: 268 million
• And there are many others….
Why Does
This Happen?
The Hypothesis
Organised into An Adverse Outcome Pathway
AOPs Networks:
Quantify With NAMs
Spinu N. et al. (2019). Development and analysis of an adverse outcome pathway Slide adapted from Dr
network for human neurotoxicity. Archives of Toxicology 93, 2759–2772 Andrew Worth, EC JRC
Adverse Outcome Pathway Network for Neurotoxicity
Spinu N et al. (2020) Archives of Toxicology

https://doi.org/10.1007/s00204-020-02774-7
Weight of Evidence – A Bayesian Dream?
• Different streams of evidence to predict acute toxicity classification
Cytotoxicity
• When do we have enough “certainty” to make a decision
With thanks to Dr John Doe, for really

making me think about this!
Product Development Regulatory Use
• Screening What are • Prioritisation and

• Lead Optimisation the use Screening
• Risk assessment cases? • Classification and
• Registration Labelling
• Risk Assessment
Regulatory Use of Predictions from In Silico Tools:
Validation and Acceptance
• Opportunities:
• To update assessment / validation
• Utilise knowledge of uncertainties
• Develop frameworks for regulatory use
Inspirations
13 Types of Uncertainty, Variability and Bias of QSARs
49 Assessment Criteria
Definition of Chemical Structures → 2 Biological Data → 7
Creation Physico-Chemical Properties and Structural Descriptors → 5
Compilation of the Data Set → 5 Modelling Approach → 1
Description of Model → 3 Statistical Performance → 2
Characteristics Applicability Domains → 3 Mechanistic Relevance → 3
ADME Effects → 2
Documentation and Reproducibility → 2

Application
Usability → 9 Relevance → 5
Details in: Cronin MTD et al (2019) Reg. Toxicol. Pharmacol. 106: 90-104
Making it Usable
Making it Useful
Uncertainties Confidence
Low High
Key Question:
What is
acceptable level
of uncertainty?
Acceptable
High Low
Fit for Purpose In Silico Models
Risk Assessment
Classification and Labelling
Prioritisation and Screening

Challenges of Predicting
Toxicology
Applications of Machine Learning in Predictive Toxicology
Mining datasets
– Searching for patterns
– Finding analogues
Tools for Screening

– Predicting hazard
Supporting regulatory decision making

– In silico profiling and evidence
– qAOPs, QST
Main Message:
Use the Right Tool for the Job!
• Formulate the Problem

• Understand the Data
• Assess Uncertainty
• Be Transparent
• Simplify – Explain
• Make Even Simpler - Explain Again
• Make Really, Really Simple…
Predictions with Confidence using
Conformal Prediction
14.15 – 14.45
Ulf Norinder, Stockholm University, Sweden

Ulf Norinder
Bio
Ulf Norinder received his Ph.D. degree in organic chemistry from Chalmers University of Technology (CTH) 1984. From 1985 – 2014 he worked
as computational chemist, senior principal scientist and research fellow in the pharmaceutical industry (Karo Bio AB, AstraZeneca AB and H.
Lundbeck A/S) and from 2015 - 2018 as senior research specialist at Swetox Södertälje (Karolinska Institutet).
He is currently affiliate professor at MTM Research Centre, School of Science and Technology, Örebro University, visiting researcher at the Dept
of Pharmaceutical Biosciences, Uppsala University and affiliate researcher at the Dept of Computer and Systems Sciences, Stockholm
University. His areas of expertise include computer-assisted drug design and pattern recognition with special emphasis on multivariate data
analysis and machine learning.
Abstract
Predictions with Confidence using Conformal Prediction.
The presentation will cover the utility of confidence predictors such as Conformal Prediction as an in silico modelling framework for obtaining
predictions with known, and mathematically proven, error rates set by the user as well as the graceful handling of highly imbalanced datasets,
typical in toxicology, without the need for balancing measures such as under- and/or oversampling.
Predictions with Confidence
using Conformal Prediction
Ulf Norinder*
Dept of Computer and Systems Sciences
Stockholm University, Sweden
* Most of the work performed at: Swedish Toxicology Sciences Research Centre (Swetox),
Unit of Toxicology Sciences, Karolinska Institutet, Sweden
as part of the EU-ToxRisk project (Horizon 2020, grant agreement No 681002)
Question of (un)certainty
A prediction without a quantified uncertainty is less useful

• Many methods to assess the average uncertainty

• Probability distributions
• Bayesian models
• Reliability-density neighbourhoods
• Ensemble model variance
However, none of these methods guarantee the error rate

for new instances*!
* = compounds
Desired:
• A method that deliver predictions with well defined

uncertainties on a instance by instance basis
Why Conformal Prediction?

• Win situation
• Statistical guarantees (on validity)
If {Exchangeability} then {conformal predictors are always valid}
Mathematical proof
Vovk V, Gammerman A, Shafer G

(2005) Algorithmic learning in a
random world, Springer, New York
If 20 % prediction errors on validity acceptable --->

CP will give, at most, 20 % errors!!
“What we ideally would like to know is in fact that a particular

prediction is derived from an area of property space from which
reliable predictions are to be expected”
validity
Classes
Active
Inactive
Both {active, inactive)
Empty {null}
Binary classification
In conformal prediction:
If a classification contains the correct class it is correct
both = always correct, empty = always erroneous
Validity = % of correct classifications (for each class)
Efficiency = % of single label classifications (right or wrong)
Why Conformal Prediction?
• Win situation
• Statistical guarantees (on validity)
• CP is instance-based
• The risk is known up-front for the decision taken
• Applicability domain closely linked to model development
CP strictly defines the level of similarity (conformity) needed
No ambiguity anymore
• Gracefully handles (severely) imbalanced datasets
Ratios of 1:100 – 1:1000
No need for over- or undersampling
• CP is a framework (almost any ML algorithm will work)
How does this work?
Data
Train set Test set
Proper Calibration
Train set set
Calibration set CP p-values

Model
predictions (for each class)
U. Norinder, L. Carlsson, S. Boyer, M. Eklund, Introducing Conformal

classification
Prediction in Predictive Modeling. A Transparent and Flexible
Alternative to Applicability Domain Determination, J. Chem. Inf. Model.,
2014, 54 ,596–1603
CP is a framework (almost any ML algorithm will work)
• ML algorithm must provide a ranking

• Use current models, descriptors, algorithms
• Add calibration set -
New examples in time
Example: Predicting Toxicity
Imbalanced dataset
(toxic minority class)
A binary RF classifier (100 trees) gives the

output:
New compound to predict 32 trees: toxic

(is toxic) 68 trees: non-toxic
Calibration set, 6 toxic, 7 non-toxic compounds
N trees predicting correct class
Calibration set
Toxic Non-Toxic
45 91
42 88
36 85
New compound to predict 30 82
(is toxic)
28 79
32 trees: toxic
27 79
68 trees: non-toxic
77
Mondrian Conformal Prediction

Calibration set
Toxic Non-Toxic
45 91
42 88
36 85
(is toxic)
28 79
32 trees: toxic
27 79
68 trees: non-toxic
77

Based on the similarity to the known examples in the calibration set:
Position toxic: 3/7
Position non-toxic: 0/8 Calibration set
Toxic Non-Toxic
45 91
42 88
36 85
(is toxic)
28 79
32 trees: toxic
27 79
68 trees: non-toxic
77

Using 80% confidence level

(0.2 significance level):
3/7 = 0.43 > 0.2 therefore the compound is

assigned to the toxic class
New compound to predict 0/8 = 0.0 < 0.2 therefore the compound is not
(is toxic) assigned to the non-toxic class
32 trees: toxic
68 trees: non-toxic
Several pairs of proper train and calibration sets
Several p-values
(for each class):
New compound to predict Use median p-
(is toxic) value
32 trees: toxic
68 trees: non-toxic
Aggregated Mondrian Conformal Prediction

Mondrian Cross-Conformal Prediction
Binary Mondrian Conformal Prediction p-values
In conformal prediction:
If a classification contains the correct class it is correct
Validity = % of correct classifications (for each class)
Efficiency = % of single label classifications (right or wrong)
PubChem Cytotox Assays
• Results from 16 high throughput cell viability (tox) screens from
PubChem
• On average 0.8% toxic compounds
AID Tested compounds Toxic compounds %active ratio non-tox/tox

624418 386 360 524 0.14 736.3
504648 367 995 600 0.16 612.3
602141 359 040 1302 0.36 274.8
620 86 701 364 0.42 237.2
847 41 152 194 0.47 211.1
903 52 783 338 0.64 155.2
2275 29 938 193 0.64 154.1
588856 404 016 3018 0.75 132.9
1825 290 605 2259 0.78 127.6
2717 299 957 3181 1.06 93.3
648 86 121 924 1.07 92.2
719 84 841 937 1.10 89.5
1486 217 851 2408 1.11 89.5
463 56 465 706 1.25 79.0
430 62 627 1121 1.79 54.9
598 85 162 5139 6.03 15.6
PubChem Cytotox Assays
• Results from 16 high throughput cell viability (tox) screens from
PubChem
• On average 0.8% toxic compounds
• RDKit descriptors
• Random Forest, 500 trees, ensemble of 100 models
• 80 % training set, 20 % external test set

Validity of the predictions (test sets) at the 80% confidence
level. Models are valid for both classes.
PubChem & Hansen Datasets
• Four dataset of different sizes and class imbalances
• 10 % randomly selected training sets
• Signature descriptors of heights 0–2 for chemical structure

characterization
• Support vectors machines (SVM) C-SVC, RBF kernel, parameters C =50,

gamma = 0.002
• Ensemble of 100 SVM models

Four dataset of different sizes and class imbalances
ratio inactive:active compounds
0.9 4.1 39.6 911.2
Size and imbalance differs considerably

between the datasets.
0.9 4.1
39.6 911.2
Fraction predicted active and inactive compounds.

Results are similar across the datasets despite
the varying imbalance.
#compounds in both class & empty class
@acceptable significance level:

Results from new data 
• Many predictions in empty class  outside AD of current model  measure and update model
• Many predictions in both class  inside AD of current model  lack of information 
add new information (features),
develop better model (classifier, algorithm)
@acceptable significance level (decided by the user)
Efficiency = % of single label classifications

If a classification contains the correct class it is correct (right or wrong)
Not over-optimistic models
Validity minority class Validity majority class
Training set
Test set
Signif. level 0.2


• Training and test data:

the relationship between input and output in training and test data is the same.

the relationship between input and output in training and test data is the same.
If however –
• The new predictions are not valid

the relationship between input and output in training and test data is different.
• Trigger to update the current model

Molecular descriptors for organic reactions
15.15 – 15.40
Fernando Huerta, RISE, Sweden

Fernando F. Huerta
Bio
Fernando F. Huerta received his BSc degree (1993) and his PhD degree (1998) in chemistry from the University of Alicante under the
supervision of Professor Miguel Yus. After a postdoctoral period at Stockholm University with Professor Jan-E. Bäckvall (1998-2000)
working on dynamic kinetic resolution of secondary alcohols, he took an associate professor position at the University of Alicante (2000-2002).
In 2002 he moved to Stockholm where he started his career as medicinal chemist at AstraZeneca (2002-2012). During this period, Fernando
was involved in a team for the evaluation of different synthesis planning software packages. In particular, he became part of the AZ team
collaborating with InfoChem GmbH in the development of ICSynth. Together with two former AstraZeneca colleagues, Fernando founded
Chemnotia AB, Stockholm, Sweden, in 2012. In this new role and in collaboration with InfoChem GmbH (proprietary owner) Fernando was a
key player in the development of IC Forward Reaction Prediction. From 2016 Fernando works as Senior Researcher/Project Leader at Research
Institutes of Sweden (RISE), Bioscience and Materials Division in Stockholm, Sweden.
Abstract:
Molecular descriptors for organic reactions.
Reactions are complex processes, the number of factors that can affect the outcome of a reaction goes from collision theory, kinetics, enthalpy,
or entropy to the use of catalysts or additives to vary the reaction mechanism. Different approaches have been reported in the literature to
describe organic reactions as vectors to enable datamining or machine learning processes. However, the big question still remains, do we have
enough molecular descriptors to develop predictive models?
MOLECULAR DESCRIPTORS FOR ORGANIC REACTIONS
Fernando F. Huerta
Bioscience & Materials Division (RISE)
fernando.huerta@ri.se
• Easy to calculate molecular properties

• Accessible with open access software
1. Definition & Examples

2. Reaction Matrix
3. Organic Reactions Understanding
4. Work in Progress (RISE)
5. Goal
I’M A CHEMIST

2. Reaction Matrix
5. Goal
Molecular descriptors are formally mathematical representations of a

molecule obtained by a well-specified algorithm applied to a defined
molecular representation or a well-specified experimental procedure

what type of descriptors are we talking about?
Mannhold LogP Largest Chain
Moreau-Broto Autocorrelation (mass) Largest Pi Chain
descriptors Petitjean Number
Atomic Polarizabilities Rotatable Bonds Count
Bond Count Topological Polar Surface Area
Moreau-Broto Autocorrelation (charge) Molecular Weight
descriptors XLogP
Moreau-Broto Autocorrelation Zagreb Index
(polarizability) descriptors Molar Mass
Charged Partial Surface Areas (3D) SP3 Character
Eccentric Connectivity Index Roftatable Bonds Count (non terminal)
Fragment Complexity … up to 200 descriptors
VABC Volume Descriptor
Indigo, CDK and RDKit nodes Molecular descriptors

KNIME for organic reactions
why?


2. Reaction Matrix
5. Goal

2. Reaction Matrix
5. Goal

5. Goal
• Predict the outcome of a chemical reaction
• Predict the starting materials for the synthesis of

a complex molecule (retrosynthetic analysis)
a) Coley, C. W., Green, W. H., & Jensen, K. F. Accounts of Chemical Research, 51(5), 1281–1289 (2018). https://doi.org/10.1021/acs.accounts.8b00087.
b)1. Grzybowski,
DefinitionB.&A.,
Examples
Scientific Reports, 7(1), 1–9 (2017). https://doi.org/10.1038/s41598-017-02303-0.
c) Warr, W. A. Molecular Informatics, 33(6–7), 469–476 (2014). https://doi.org/10.1002/minf.201400052.
how?


2. Reaction Matrix
5. Goal

each reaction has to be described by an UNIQUE reaction matrix
a
in a typical reaction A+B C y
a -> set or reaction conditions
y -> yield
reaction representation using molecular descriptors
reaction matrix
n n
(DA DB DC)
n
(a and/or y)
1.
2. Definition
Reaction Matrix
& Examples
for organic reactions
what type of molecules are A, B, C
n n n
unique reaction matrix (DA DB DC)
1.
2. Definition
Reaction Matrix
& Examples
constitutional isomers can be described by means of a unique array of molecular descriptors
1.
2. Definition
Reaction Matrix
& Examples
however?

Stereoisomers cannot be distinguished by unique reaction matrix
No unique reaction matrix for stereoselective reactions
1.
2. Definition
Reaction Matrix
& Examples
5. Goal

…for reactions with no stereoselective issues

b)1. Grzybowski,
DefinitionB.&A.,
Examples
Reaction product ≠ f(reaction conditions)
Suzuki-Miyaura reaction
• No chemical selectivity issues

• Reaction conditions will only affect yield (hypothesis)
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
reaction matrix based Yield class

on starting materials
and product Yield classification model using
ML with up to 95% accuracy*
2.
1. Reaction
DefinitionMatrix (Examples)
& Examples *ACS Omega submitted
Buchwald-Hartwig coupling
90% (AUC)
Negishi coupling
83% (AUC)
2.
1. Reaction
DefinitionMatrix (Examples)
& Examples *ACS Omega submitted
other reactions
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
Reaction product = f(reaction conditions)
N- versus O- alkylation reactions
75%
• Selectivity depending on
reaction conditions
2
74-86%
(reagents used).
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
Matrix difference lies only on the product molecular properties
P SM1 SM2
2
Reaction product = f(reaction concentration)
Lactamization versus amide coupling
Product depends on the reaction concentration
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
Reaction product = f(reaction conditions)
• Predictive models plausible (unique reaction matrix)
• Not reliable
• Lack of data
1.
2. Definition
Reaction Matrix
& Examples
(Examples)
5. Goal

b)1. Grzybowski,
DefinitionB.&A.,
Examples
5. Goal

b)1. Grzybowski,
DefinitionB.&A.,
Examples
are all organic reactions as simple?
a
A+B C y
y -> yield
• ca 446 reactions (with name)
1.
2. Definition
Reaction Matrix
& Examples
I’M STILL A CHEMIST
are all organic reactions as simple?
a
A+B C y
y -> yield
• ca 446 reactions (with name)

• chemists have full knowledge of reaction
mechanism before setting up an experiment
1.
2. Definition
Reaction Matrix
& Examples

2. Reaction Matrix
5. Goal

Reaction Thermodynamics
• Catalyst and/or additives
• Solvent
• Temperature
• Etc…
1. Organic
3. Definition
Reactions
& Examples
Understanding
• Solvent
• Temperature
• Etc…
1. Organic
3. Definition
Reactions
& Examples
Understanding
• Solvent
• Temperature
• Etc…
1. Organic
3. Definition
Reactions
& Examples
Understanding
• Solvent
• Temperature
• Etc…
1. Organic
3. Definition
Reactions
& Examples
Understanding
Reaction Thermodynamics Reaction Kinetics
• Catalyst and/or additives • Concentration
• Solvent • Polar surface area
• Temperature • Temperature
• Etc… • Etc…
1. Organic
3. Definition
Reactions
& Examples
Understanding

2. Reaction Matrix
5. Goal
need for different descriptors
HYPOTHESIS
• reaction classification
• key reagents
• analysis of bad
• identification of “bad”
• generation of “descriptor/s”
Work in progress with

the Suzuki reactions!!!
need for different descriptors
HYPOTHESIS PROBLEMS
• no bad reactions reported
• curation
• lack of data
• number of reaction clusters
• and many more…
Work in progress with

the Suzuki reactions!!!

2. Reaction Matrix
5. Goal
5. Goal

b)1. Grzybowski,
DefinitionB.&A.,
Examples
5. Goal
1. Normalization of organic reactions does not help models

2. Relative good results for specific transformations has been reported
b)1. Grzybowski,
DefinitionB.&A.,
Examples
5. Goal

1. Some attempts published with theoretical good results
2. Best practical results using LHASA, ICSYNTH, ChemPlanner
b)1. Grzybowski,
DefinitionB.&A.,
Examples
https://www.pinterest.co.uk/McrHCC/co
mmunity-engagement/
ACKNOWLEDGEMENTS
ALEXANDER MINIDIS
SAMUEL HALLINDER
ULF TEDEBARK
ULF NORINDER
fernando.huerta@ri.se
THANKYOU
THANK YOU
Pitfalls when applying machine learning
to chemistry
15.50 – 16.10
Martin Nilsson, RISE, Sweden

Martin Nilsson
Bio
Martin Nilsson, RISE, Assoc. Prof., Mathematical physicist. M.Sc. at KTH, Stockholm, 1983; PhD at Tokyo University, 1989. Primary research
interests are mathematical aspects of machine learning and biologically inspired approaches to representation learning.
Abstract:
Pitfalls when applying machine learning to chemistry
Despite its recent popularity, machine learning (ML) is not a free lunch. Obnoxious problems often pop up when applying ML in general and no
less when applied to chemistry. I will discuss a selection of such problems that may require more attention, including computational efficiency,
data quality, debugging, descriptors/representations, and evaluation.
Pitfalls when applying machine
learning to chemistry
Martin Nilsson, RISE 2020-05-29

Frequently Overlooked Features of
Machine Learning
(Image credits: xkcd.com)

Three subtopics:
•Scaling
•Data quality
•Model problems
1. Scaling
Parameter distributions in chemistry
are often exponential (or worse),
while linear-algebra-based algorithms
prefer normal distributions.
Use a transform (e.g., log) –

Don’t trust your ML package
to do this automatically!
2. Data Quality
Some factors influencing yield:
• Rate and order of adding reactants

• Temperature
• Quality of reactants
• Number of equiv. to drive reaction The problem is:
• Solvent and its quality
• Reactant surplus over time Available datasets
• Humidity rarely provide all
• ... that data!
So then: How well can we learn?
What is a lower bound on the error?
Bayes’ error = best possible error for any learning algorithm
• Hard to compute exactly, but

• can be estimated, using, e.g.,
• Friedman-Rafsky statistic (Skoraczynski et al. 2017)
3. Model problems
• Toy problems
• Known solutions
• Solvable by classical methods
• Almost any ML algorithm applicable
• Not requiring huge amounts of data
• Debuggable & explainable
For image recognition: MNIST and CIFAR-10
?
Conclusions
• Three problems that deserve attention:
Scaling, Data Quality, and Model Problems
• References: Skoraczynski et al.: “Predicting the outcomes of organic reactions via

machine learning: are current descriptors sufficient?”, Scientific Reports 7:3582
(2017). (Don’t miss the Supplementary Info!)
Friedman & Rafsky: “Multivariate generalizations of the Wald-Wolfowitz and

Smirnov two-sample tests”, Ann. Stat. 7, 697-717 (1979).
• Credits to RISE colleagues: Andreas Thore, Ulf Tedebark, Fernando Huerta,

Alexander Minidis, Sverker Janson, Erik Ylipää, Swapnil Chavan
Concluding remarks
16.20 – 16.30

RISE organizing committee
Ulf.Tedebark@ri.se (Biomolecules)
Sverker.Janson@ri.se (Computer Science)
Ian.Cotgreave@ri.se (Chemical and Pharmaceutical Toxicology)
Alexander.Minidis@ri.se (Medicinal/Process Chemistry and Data-Science)
Erik.Ylipaa@ri.se (Deep Learning)
Fernando.Huerta@ri.se (Medicinal/Process Chemistry and Data-Science)
Swapnil.Chavan@ri.se (Computational Toxicology)
Andreas.Thore@ri.se (Materials Science and Computational Chemistry)
Martin.Nilsson@ri.se (Algorithms and Mathematics)
RISE is Sweden’s research institute and innovation partner. Through our international collaboration programs with industry, academia and the
public sector, we ensure the competitiveness of the Swedish business community on an international level and contribute to a sustainable
society. Our 2,800 employees engage in and support all types of innovation processes. RISE is an independent, state-owned research institute,
which offers unique expertise and over 100 testbeds and demonstration environments for future-proof technologies, products and services.
ISBN: 978-91-89167-42-1

Accelerating Chemical Design and Synthesis Using Artificial Intelligence RISE Open Workshop 2020-05-29 Conference Binder

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Accelerating Chemical Design and Synthesis Using Artificial Intelligence RISE Open Workshop 2020-05-29 Conference Binder

Uploaded by

Copyright:

Available Formats

Accelerating

1 RISE — Research Institutes of Sweden

Take care and stay safe!

RISE organizing committee

• Moderator: Ian Cotgreave, RISE • 13.10 – 13.20

3 RISE — Research Institutes of Sweden

Ian Cotgreave, RISE, Sweden

2 RISE — Research Institutes of Sweden

Erik Ylipää, RISE, Sweden

4 RISE — Research Institutes of Sweden

Erik Ylipää (erik.ylipaa@ri.se) 2020-05-29

• Deep Neural Networks are Representation

3 RISE — Research Institutes of Sweden

22584 2 178 95 130 90 3 3 0 0 1 1

there exists an underlying functional

Parameterized models parameterized by w

approximates f(x) parameters

Parameterized models parameterized by w

This element correspond to Settings corresponding

• A parametric model like a neural network

good approximations of f(x)

RISE — Research Institutes of Sweden

• Many different representations exist for

– Geometrical representations C1=CC(=C(C=C1CCN)O)O

10 RISE — Research Institutes of Sweden

• These representations vary in size

• Traditional machine learning methods

11 RISE — Research Institutes of Sweden

12 RISE — Research Institutes of Sweden

13 RISE — Research Institutes of Sweden

• Learn the important

RISE — Research Institutes of Sweden

Learnt Learnt “It is a truth

Target text string

• Learn an encoder which embeds the data point in a high

together should correspond to similar transcriptions)

Learnt Learnt Mrs.

different for many different decoders

Linear prediction (e.g.

layers for images) function (dense layer)

Simple linear classifier

Neural image embedding

20 RISE — Research Institutes of Sweden

RISE — Research Institutes of Sweden

• A sequence neural network (recurrent

Retrosynthesis." arXiv preprint arXiv:1910.09688 (2019).

• The graph is represented a set of vectors, additionally the edges

26 RISE — Research Institutes of Sweden

The neural network is applied to

each node and its context

29 RISE — Research Institutes of Sweden

• The non-uniqueness of SMILES representation could be positive, it allows for straight

31 RISE — Research Institutes of Sweden

• One important application of machine learning for chemistry is generative

• A SMILES representation introduces an ordering of the generated elements,

32 RISE — Research Institutes of Sweden

33 RISE — Research Institutes of Sweden

Chen, Benson, Regina Barzilay, and Tommi Jaakkola. "Path-augmented

34 RISE — Research Institutes of Sweden

prediction." ACS central science 5.9 (2019): 1572-1583.

35 RISE — Research Institutes of Sweden

RISE — Research Institutes of Sweden