Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

ll

Article
A Structure-Based Platform for Predicting
Chemical Reactivity
Frederik Sandfort, Felix
Strieth-Kalthoff, Marius
Kühnemund, Christian Beecks,
Frank Glorius

glorius@uni-muenster.de

HIGHLIGHTS
Quantitative modeling of reaction
outcomes via machine learning

Prediction of properties, yields,


stereoselectivities, and relative
conversion

Multiple fingerprint features as a


versatile and robust molecular
representation

Readily applicable machine


learning tool, directly starting
from molecular structures

Although machine learning has a long-standing history in chemical research with


respect to the prediction of molecular properties and biological activities, the
quantitative modeling of reactivity has only been approached recently, and current
models suggest that complex and specific parameterization is inevitable. As
opposed to this, we report a simple machine learning model for predicting various
reaction outcomes, such as yields and stereoselectivities. Being based on a solely
structural input, our model should be transferable to diverse problems related to
organic molecules.

Sandfort et al., Chem 6, 1379–1390


June 11, 2020 ª 2020 Elsevier Inc.
https://doi.org/10.1016/j.chempr.2020.02.017
ll

Article
A Structure-Based Platform
for Predicting Chemical Reactivity
Frederik Sandfort,1,4 Felix Strieth-Kalthoff,1,4 Marius Kühnemund,2,3,4 Christian Beecks,2
and Frank Glorius1,5,*

SUMMARY The Bigger Picture


Despite their enormous potential, machine learning methods have Statistical data-based prediction
only found limited application in predicting reaction outcomes, models have found widespread
because current models are often highly complex and, most impor- application in nearly all areas of
tantly, are not transferable to different problem sets. Here, we pre- science, including chemistry. In
sent a structure-based machine learning platform for diverse appli- this context, the prediction of
cations in organic chemistry. Therefore, an input based on multiple molecular properties or biological
fingerprint features (MFFs) as a versatile molecular representation activities for a target molecule
was developed that was shown to be applicable over a range of (quantitative structure-property
diverse problem sets. First, molecular properties across a diverse relationships [QSPRs]) has been
array of molecules could be predicted accurately. Next, reaction widely investigated, with great
outcomes such as stereoselectivities and yields were predicted for focus on developing new and
experimental datasets that were previously evaluated using (com- general molecular
plex) problem-oriented descriptor models. As a final application, a representations. However, the
systematic high-throughput dataset was investigated as a ‘‘real- underlying fundamentals have not
world problem,’’ and good correlation was observed when using been transferred to the prediction
the structure-based model. of chemical reactivity. In contrast,
although recent progress in high-
throughput data generation has
INTRODUCTION enabled the generation of uniform
Although chemical intuition, based on experience, expertise, and mechanistic un- reaction-based datasets, current
derstanding, has driven the discovery of new transformations in organic synthesis, prediction models suggest that
the accurate prediction of the outcome of a single chemical reaction remains a major complex parameterization is
challenge for models based on human instinct and computers.1 In this regard, the required for each individual case
optimization of an organic transformation therefore requires the collection of large to achieve good results. Applying
amounts of empirical data. Even for well-established methodologies, experienced universal (structure-based)
chemists frequently fail to predict whether a (complex) substrate might undergo molecular representations to the
the desired transformation, making the field of chemical synthesis highly challenging prediction of chemical reactivity
and laborious.2,3 Although qualitative estimations based on mechanistic under- could provide a readily applicable
standing can be accurate, the quantitative prediction of chemical reactivity is almost model, which could potentially
impossible with chemical intuition alone, mainly because of the enormously compli- decrease the barriers to apply
cated correlation between structure and reactivity. machine learning techniques in
organic synthesis.
Because of this complexity, chemists have tried to simplify the overall problem by
correlating (known or easily accessible) molecular properties with a compound’s
reactivity. In this quantitative approach toward reactivity prediction, linear free en-
ergy relationships (LFERs) are identified and solved using multivariate linear regres-
sion (MLR) models.4,5 This statistical approach, which has been established by
Sigman and co-workers, relies on physically meaningful parameters to represent
the structures of interest.6–8 These molecular descriptors are electronic or steric pa-
rameters, which can be determined by experimental or computational means
(mainly density functional theory [DFT] calculations). Parameter selection and model

Chem 6, 1379–1390, June 11, 2020 ª 2020 Elsevier Inc. 1379


ll
Article

development are typically carried out in a lengthy iterative workflow until good cor-
relation is achieved. The major strong point of this process is that in many cases,
valuable information on the underlying mechanism can be obtained.9,10 Further-
more, after the successful identification of mechanistically relevant descriptors,
the amount of data points required for an MLR prediction model can be compara-
tively small. However, this requires a representative set of training data, which has
to be carefully considered.6

A more general, statistical approach to predict reactivity building on a given dataset


is machine learning. Because of their ability to recognize complex patterns, machine
learning algorithms have been widely used in many scientific fields.11 In the context
of chemoinformatics, they were applied for use in drug discovery,12,13 computer-
aided synthesis planning (CASP),14–16 the prediction of possible organic reaction
products,17–20 and molecular design.21,22 Within these fields, the quantitative
prediction of properties or biological activities for a given molecule has been
investigated extensively, referred to as quantitative structure-activity or property
relationships (QSARs or QSPRs) (Figure 1A).23,24 The fundamental concept is that
all parameters and molecular characteristics can eventually be traced back to a com-
pound’s Lewis structure, which contains the connectivity of all atoms in a simplified
molecular topology. Although string representations such as simplified molecular
input line entry specification (SMILES) accurately represent the (2D) molecular graph,
their direct use as inputs for machine learning models has only been of limited suc-
cess.25,26 Thus, a great focus of scientific research was the development and use of
general and readily applicable representations to allow for algorithm-based pattern
recognition in molecular structures.27,28 In many fields, the so-called molecular fin-
gerprints are highly established.29,30 These bit vectors have originally been de-
signed for substructure and similarity searches in virtual screening,31–33 and modern
implementations such as extended-connectivity fingerprints (ECFPs) have been
proven well suited as inputs for machine learning models (Figure 1A).34 Furthermore,
the generation of fingerprints from the 2D molecular graph (referred to as encoding)
can be carried out efficiently on a subsecond timescale.

In contrast to (physical) properties or biological activities, the quantitative modeling


of chemical reactivity (e.g., yield and selectivity) with machine learning algorithms
has not been approached until very recently, primarily due to the lack of available
data.39 In general, the generation of data points in chemistry is traditionally rather
expensive. Although molecular properties can usually be obtained from DFT calcu-
lations, there is no other cheap alternative for the generation of reaction-specific
data better than physically going to the laboratory and running the experiment.
Thus, to generate these data, technical solutions for high-throughput experimenta-
tion under batch and flow conditions have been developed to carry out thousands of
reactions in a short time by using just a few milligrams of material.40–42 Such tools,
1Organisch-Chemisches Institut, Westfälische
combined with in vitro and in silico compound libraries, have recently opened the
Wilhelms-Universität Münster, Corrensstraße 40,
field of reaction development to machine learning models.35,43,44 48149 Münster, Germany
2Institut für Informatik, Westfälische

In pioneering work, Doyle and co-workers were able to predict the reaction yields of Wilhelms-Universität Münster, Einsteinstraße 62,
48149 Münster, Germany
C–N cross coupling reactions by using a dataset of more than 4,000 reactions.37 3Institut für Wirtschaftsinformatik, Westfälische
Furthermore, Denmark and co-workers could predict enantioselectivities by using Wilhelms-Universität Münster,
chiral phosphoric acid (CPA) catalysts based on a dataset with more than 1,000 ex- Leonardo-Campus 3, 48149 Münster, Germany
periments.38 Similar to MLR studies, these methods rely on physically meaningful 4These authors contributed equally
5Lead
parameters (Figure 1B). In a thorough analysis, steric and electronic parameters Contact
are selected on the basis of the underlying mechanism of the transformation and *Correspondence: glorius@uni-muenster.de
relevant properties of each reactant. In a next step, these parameters are https://doi.org/10.1016/j.chempr.2020.02.017

1380 Chem 6, 1379–1390, June 11, 2020


ll
Article

Figure 1. Quantitative Modeling in Organic Chemistry Based on Machine Learning


(A) Quantitative prediction of molecular properties or biological activities. Molecular fingerprints
as a structural representation for machine learning models.
(B) Quantitative modeling of reaction outcomes via machine learning. Previous models rely on one-
hot encoding 35,36 or (physical) parameters 37,38 to represent chemical reactions. The aim of this work
was to apply a generally applicable structural representation such as molecular fingerprints.

determined by means of DFT calculations, as uniform sets of experimental values are


not usually available for such a large number of molecules. However, although DFT
calculation of properties is considered fast for a specific compound, the generation
of multiple descriptors for a library of substrates can be time consuming, especially
so in cases where more complex structures are involved. Thus, DFT computations us-
ing established methods can easily become time limiting, as these methods typically
scale with a computation time complexity of O(N4) depending on the size (N) of the
system.45 It should be noted that certain descriptors for a variety of functional
groups can either be found in the literature or calculated in a simple way. In this
regard, Grzybowski and co-workers could predict the major isomers formed in
Diels-Alder reactions by using Hammett constants and topological steric effect
indices (TSEIs) as inputs.46

Contrary to QSPRs, the quantitative modeling of reaction outcomes such as


selectivities and yields can be considered a multi-dimensional problem, as each
data point relies on multiple molecules as inputs. In such combinatorial datasets,
the actual amount of data points can outnumber the total number of molecules by
several orders of magnitude (Figure 1B). A chemical reaction can thus be
represented by a one-hot encoded vector, which only contains the information

Chem 6, 1379–1390, June 11, 2020 1381


ll
Article

whether a specific compound is included in the reaction. Although not chemically


meaningful, such models can reveal inherent statistical correlation in combinatorial
datasets and should thus be used to validate chemically inspired approaches. In this
regard, Chuang and Keiser showed that at least in some cases, a one-hot encoded
model was as good in predicting yields as the original descriptor model by Doyle
and co-workers mentioned above.36,47 The use of one-hot encoding for successful
pattern recognition in chemical reactions has also been utilized by Cronin and co-
workers.35

In contrast to one-hot encoding, the use of physicochemical parameters or descrip-


tors includes chemical information and therefore allows for the prediction of new
molecules that have not been included in the training set. However, (biased) manual
selection of experimental or calculated descriptors can oversimplify the problem,
which can introduce systematic errors through the loss of seemingly unimportant in-
formation. Moreover, as every molecule and reaction is unique, the selection of a
consistent and general set of physical properties as universal molecular descriptors
is highly challenging.

These problems could potentially be circumvented by applying structural representa-


tions, such as molecular fingerprints, for each individual reaction component
(Figure 1B). In terms of quantitative prediction of reaction outcomes, fingerprint-
based models have so far only been applied for classification of reactions and with
limited success.48 Still, QSAR/QSPR studies suggest that all chemical information
can eventually be traced back to a compound’s (2D) Lewis structure.33 In fact, human
chemical intuition has often been based on understanding and rationalizing 2D
connectivity. However, although human receptivity is limited, machine learning
algorithms bear the potential of identifying new and unknown patterns within
molecular structures.11 Thus, we hypothesized that a concatenation of structural
representations, inspired by QSAR/QSPR studies,49 could as well serve as inputs for
the quantitative modeling of reaction outcomes. Such a model would be both chem-
ically meaningful and readily applicable while at the same time significantly reducing
the computational cost compared with models based on DFT-generated descriptors.

RESULTS AND DISCUSSION


Here, we present a structure-based machine learning platform for property and
reactivity prediction in (organic) chemistry. This approach can be easily adopted
and applied to existing problem sets because it relies only on SMILES representa-
tions of all involved molecules as inputs, which are automatically converted into
the corresponding molecular fingerprints. Specifically, for each molecule an array
of 24 diversely configured fingerprints is generated using RDKit as a python pack-
age.50,51 A concatenation of multiple fingerprint features (MFFs) (one MFF for
each [reaction] component), matched to the observed experimental data, is used
to train a machine learning model, which is eventually capable of predicting proper-
ties or reactivity beyond the training dataset (Figure 2).

A vast number of machine learning algorithms have been developed over the last
decades, which can be loosely categorized into distance-based and non-distance-
based methods.52 Distance-based approaches build on the assumption that similar
input generates similar output, and vice versa. However, in organic chemistry,
structural similarity does not necessarily correlate with similar reactivity. Thus, we
assumed that non-distance-based algorithms, such as random forests or neural net-
works that rely on (complex) decision trees or networks, were best suited for our

1382 Chem 6, 1379–1390, June 11, 2020


ll
Article

Figure 2. The MFF Model


(A) Use of fingerprints as a molecular representation for machine learning models. The MFF can be
applied to diverse problem sets.
(B) Standard workflow for the prediction model based on MFF. Only features are depicted, and
targets are left out for clarity. Model selection is performed by 5-fold nested cross validation (CV).
See also Figures S1 and S2; Tables S1–S16. ML, machine learning.

predictive tool. Preliminary results indicated that both random forest and neural
network algorithms can deliver good regression models via the MFF input. We
selected the random forest model53 for further investigations, as it was shown to
be more robust over a variety of different problem sets and less prone to overfitting,
i.e., a perfect modeling of the training data (R2train = 1), which typically results in poor
accuracy for predicting the test set.54 This can be problematic, especially when tack-
ling small (reaction-based) datasets using a large number of features. Although a
high degree of overfitting is observed when using neural networks trained with
the MFF input, the random forest model appears to be less sensitive, which is in
agreement with previous literature reports.55 Further indication is obtained by the
fact that generally no trees with a high tree depth are built (a high tree depth is usu-
ally considered as an indicator for overfitting).54 Despite the seemingly low sensi-
tivity to overfitting, it should be noted that the extrapolative properties of a random
forest beyond the trained target space are inherently limited. Moreover, hyperpara-
meter optimization for model selection of the machine learning algorithms was
based solely on the training data in any case (nested cross-validation, Figure 2B).
The implementation was conducted using the Python package Scikit-learn.56

At the outset of our investigations, we observed that every single fingerprint per-
formed differently for each problem set and that no universally applicable fingerprint
existed (Figure 2A). Even within a single specific dataset, the best-performing finger-
print differed for different train-test splittings. Furthermore, the selection of the best
single fingerprint for the prediction of an unknown (out-of-sample) test set, solely
based on the training data, could not be achieved in many cases.54 To circumvent
this problem, we sought to use an array of multiple fingerprints in order to generate
an input that would provide a more accurate and robust representation of the Lewis
structure. In the context of reactivity prediction, this model did also outperform
learnable representations, such as a graph-convolutional neural network
(GCNN),54,57 which have recently gained a lot of attention.20,28,58

Chem 6, 1379–1390, June 11, 2020 1383


ll
Article

A B

Figure 3. Prediction of Orbital Energies


(A) Calculated HOMO-LUMO gaps as an explicit molecular property. Calculated HOMO and LUMO
geometries shown for 2-fluoro-5-nitroaniline.
(B) Performance evaluation of the MFF model on the QM9 dataset 59 and our group’s inventory.
Correlation measure for ECFP (diameter 4, ECFP4, using a random forest regressor) was published
by von Lilienfeld et al. 60 MFF correlation measures were averaged over five (QM9) and ten (group
inventory) random divisions. Plots are given for the MFF model for one explicit division. DE and
MAE (mean absolute error) in kcal/mol.
See also Figures S3–S27; Tables S17 and S18.

Prediction of HOMO-LUMO Gaps


Current machine learning models for the quantitative modeling of reaction outcomes
based on DFT-calculated descriptors require molecules that have at least one struc-
tural motif, atom, or functional group—a reactive center—in common.37,38 Thus, a
model, which was originally designed to predict yields for reactions of aryl halides,
will—without major modifications—not be able to assess, e.g., chiral catalyst systems
(and vice versa). Our fingerprint-based model, however, was developed to be appli-
cable to a variety of organic chemical prediction problems. Therefore, as a first step,
its applicability to a series of structurally different molecules was demonstrated in a
QSPR study. We decided to investigate the highest occupied molecular orbital-lowest
unoccupied molecular orbital (HOMO-LUMO) gap, as obtained from DFT calculations
(Figure 3A). Such predictions of DFT-calculated properties are often performed on
large datasets with more than 100,000 molecules, e.g., the QM9 dataset reported
by von Lilienfeld and co-workers.59 For predicting HOMO-LUMO gaps from this data-
set, the MFF model resulted in lower prediction errors than previously investigated
molecular representations that are only based on the 2D molecular graph (R2 =
0.97).60 As reaction-based datasets usually contain significantly less data points
(<5,000), we further decided to investigate a database of around 2,900 small organic
molecules from our group’s chemical inventory. Here, we were pleased to find that the
MFF model was still capable of predicting HOMO-LUMO gaps: for ten random 70/30
splits of the abovementioned dataset, an average R2 of 0.89 between observed and
predicted HOMO-LUMO gaps was obtained (Figure 3B).

Given that the HOMO-LUMO gap is a property of the overall molecule, the MFF
model seems to represent and compare not only local substructures of a molecule
but also global molecular characteristics. This result served to reinforce the original
hypothesis that (computed) molecular properties can eventually be traced back to
patterns in the (2D) Lewis structure.

It should also be noted that orbital energies have commonly been used as descrip-
tors in MLR and machine learning models.6,37 This indicates that our model should
also be applicable to more complex reactivity-based datasets. More precisely, we

1384 Chem 6, 1379–1390, June 11, 2020


ll
Article

Figure 4. Prediction of Enantioselectivities


(A) Asymmetric N,S-acetal formation using CPA catalysts by Denmark et al.38
(B) Comparison of the original model, a one-hot encoded model as statistical probe and the MFF
model (MAE in kcal/mol given as correlation measure). Plots for the MFF model are given. The MFF
model uses a concatenation of MFF vectors for imine, thiol, and CPA catalyst as input.
See also Figures S28–S55; Table S19.

aimed to investigate diverse datasets of increasing complexity available in the liter-


ature that had successfully been applied in machine learning studies. It should be
emphasized that we do not intend to question the relevance and importance of
the reported models but rather wish to present an alternative approach in order
to provide a simple and readily applicable prediction model.

Prediction of Enantioselectivities
The prediction of stereoselectivities of catalytic reactions has been of great interest to
the chemical community and a major focus of many MLR models.4–10 Recently,
Denmark and co-workers described a machine-learning-based approach for the pre-
diction of enantioselectivity by using CPA catalysts.38 The possibility to predict parts
of these data via MLR models trained with other nucleophiles was later demonstrated
by Sigman and co-workers.61 In the initial work, Denmark et al. chose an asymmetric
N,S-acetal formation as a model reaction. The training set included combinatorial
variations of 43 CPA catalysts, five N-acyl imines, and five thiols, resulting in a total
of 1,075 reactions (Figure 4A). A new steric parameter, the average steric occupancy
(ASO), based on DFT-computed 3D representations of multiple conformers was
developed to represent the catalysts. Weighted grid point occupancies in combina-
tion with calculated electronic parameters were used to train a machine learning
model in order to predict enantioselectivity (DDG in kcal/mol). A distance-based sup-
port vector machine algorithm was found to perform best on a random 600/475 split
of training and test data (<MAE> = 0.152 kcal/mol, average over ten random
divisions).38 When applying a similar random splitting of the dataset, we found that
our model performed with slightly higher accuracy (<MAE> = 0.144 kcal/mol,
average over ten random divisions) by using a random forest algorithm, whereas a
one-hot encoded model as statistical probe resulted in lower correlation
(<MAE> = 0.163 kcal/mol, average over ten random divisions) (Figure 4B). In their
original work, the authors divided the data for out-of-sample prediction into a com-
mon training set, a test set for substrates (sub), a test set for catalysts (cat), and one for

Chem 6, 1379–1390, June 11, 2020 1385


ll
Article

both (sub-cat). The same division was analyzed by using our MFF model, which
showed good accuracy in all three test sets. In particular, the performance for the
most challenging catalyst out-of-sample predictions (cat, sub-cat) stands out.
Although a one-hot encoded model resulted in low correlation measures, the simple
MFF model performed nearly as well as the original complex descriptor model.

Visualization and interpretation of prediction data obtained by machine learning


methods is not straightforward and represents an ongoing challenge in computer
research.62 Accordingly, an in-depth chemical interpretation and comparison of
the structure-based approach with the original descriptor model is difficult and
might not allow general conclusions. However, the observed trends can give hints
on the strengths and weaknesses of each method. The original descriptor model
is mainly focused on representing the CPA catalysts. Thus, a slightly higher accuracy
for predicting catalyst out-of-sample test sets (cat, sub-cat) can be observed. In
contrast, the structure-based MFF model represents all reactants and CPA catalysts
in an equal fashion, which might be the reason for the superior performance
regarding the random and the substrate out-of-sample test sets.

Prediction of Yields
In comparison to stereoselectivities, the quantitative prediction of yields can be
even more demanding, because they are influenced by many parameters and do
not rely on one elementary step alone. In a recent report, Doyle and co-workers
described a machine learning approach to predict reaction performance in C–N
cross coupling reactions.37 The training data, including combinatorial variation of
four reaction components applying an additive-based approach,63 were collected
by using high-throughput experimentation. All possible combinations of 15 aryl ha-
lides, four ligands, three bases, and 23 isoxazole additives were evaluated in a total
of 4,140 reactions (Figure 5A). The molecules were represented by electronic,
atomic, and vibrational descriptors that were extracted from DFT calculations. A va-
riety of regression models was subjected to a random 70/30 split into training and
test data, and a random forest model was found to show the best performance in
predicting product yields (R2 = 0.92). Moreover, we found that our simplified model
based on a structure-based input resulted in comparable correlation (Figure 5B,
<R2> = 0.93, average over ten random divisions).

However, such a random 70/30 split of the entire combinatorial data, consisting of
only 46 different molecules, results in a training set that is likely to contain all mole-
cules at least once. Consequently, a one-hot encoded model as statistical probe
showed slightly lower but still very good performance (<R2> = 0.89, average over
ten random divisions).36 The appropriate test to prove the relevance of chemical
features in such models is out-of-sample prediction, i.e., the prediction of reactivity
for molecules that were not included in the training dataset. Thus, the authors split
the isoxazole additives into a variety of representative training and test sets and
could prove good performance of the chemical feature model in these cases.47
The same division for out-of-sample prediction via MFF as input, showed compara-
ble correlation in three of four test sets.64 In contrast, one-hot encoded models were
less accurate (Figure 5B). As both, the original descriptor model and the structure-
based MFF approach, gave similar trends regarding the additive test sets, they
seem to represent the isoxazole additives equally well.

Prediction of Relative Conversion


As a last—and most demanding—application, we aimed to use an experimental
dataset that had neither been specifically designed nor used for machine learning.

1386 Chem 6, 1379–1390, June 11, 2020


ll
Article

Figure 5. Prediction of Yields


(A) C–N cross coupling reactions of 4-methylaniline with various aryl halides by Doyle and
co-workers. 37
(B) Comparison of the original model, a one-hot encoded model as statistical probe and the MFF
model (R 2 given as correlation measure). Plots for the MFF model are given. The MFF model uses a
concatenation of MFF vectors for aryl halide, Pd-catalyst, base, and isoxazole additive as input.
See also Figures S56–S91; Table S20.

In 2015, Cernak, Dreher et al. performed an automated high-throughput screening


on a nanomole scale in order to find suitable coupling conditions for C–heteroatom
bond forming reactions.40 In a palladium-catalyzed reaction, coupling of one elec-
trophile, 3-bromopyridine, with 16 different nitrogen, oxygen, carbon, phosphorus,
and sulfur nucleophiles was evaluated (Figure 6A). For this, 16 catalysts and six bases
were investigated, giving a total of 1,536 reactions that were carried out on nano-
mole scale by using around 0.2 mg of material per reaction in less than 1 day. The
relative conversion, determined by liquid chromatography-mass spectrometry
(LC-MS) analysis, was used for quantification. This exemplifies a ‘‘real-world prob-
lem,’’ as for unknown complex molecules an accurate yield determination can hardly
be carried out before the synthesis of the compound. Thus, we aimed to directly pre-
dict the relative conversion by using our MFF model in a similar manner to the pre-
viously reported yield prediction. Encouragingly, a random 70/30 split of the dataset
resulted in good correlation for reactivity prediction (Figure 6B, <R2> = 0.76,
average over ten random divisions), whereas a one-hot encoded model as the sta-
tistical probe showed significantly lower performance (<R2> = 0.59, average over
ten random divisions). Moreover, out-of-sample prediction for catalysts could be
achieved. Therefore, the data of twelve catalysts were used to predict the reaction
outcomes of the remaining four catalysts (test cat). Although a one-hot encoded
model showed no statistical correlation (R2 = 0.17), the MFF model gave satisfac-
tory performance (R2 = 0.64), further underlining its abilities to learn from chemical
structures.

In summary, we report a machine-learning-based prediction platform for diverse ap-


plications in organic chemistry. A numerical representation of 2D Lewis structures,
the MFF, was developed, which can be computed efficiently within seconds for a
large number of molecules. Because this model is solely based on the assumption
that reactivity can be directly derived from molecular structures, it should be

Chem 6, 1379–1390, June 11, 2020 1387


ll
Article

Figure 6. Prediction of Reactivity


(A) Nanomole-scale reactivity evaluation of C heteroatom-coupling reactions by Cernak,
Dreher et al. 40
(B) Comparison of a one-hot encoded model as statistical probe and the MFF model (R 2 given as
correlation measure). Plots for the MFF model are given. The MFF model uses a concatenation of
MFF vectors for base, nucleophile, and Pd catalyst as input.
See also Figures S92–S113; Table S21.

transferable to any problem related to (small) organic molecules. In this work, we


demonstrate its applicability to four examples of increasing complexity by using a
random forest algorithm. First, the ability of the MFF input to represent and compare
diverse molecular structures was demonstrated by the prediction of HOMO-LUMO
gaps as an explicit molecular property. Furthermore, prediction of reaction perfor-
mance, enantioselectivities, and yields could be achieved with similar accuracy to es-
tablished descriptor-based models, which rely on problem-oriented parameter se-
lection. It should be noted that the generation of the simple and intuitive
structure-based model is several orders of magnitude faster. Finally, relative conver-
sion was predicted based on a dataset (obtained from high-throughput experimen-
tation), which had not been used for machine learning before. To aid the rapid up-
take of this approach, we provide a readily applicable software tool, and the
development of an extended software package is ongoing in our group. In the light
of rapid development of improved machine learning algorithms and new molecular
representations such as graph-convolutional neural networks, we believe that this
structure-based approach will further encourage synthetic chemists to adopt ma-
chine-learning-based prediction models.

DATA AND CODE AVAILABILITY


The entire computational code used within this study is provided as .zip files and is
accessible online at https://zivgitlab.uni-muenster.de/m_kueh11/fp-dm-tool.

SUPPLEMENTAL INFORMATION
Supplemental Information can be found online at https://doi.org/10.1016/j.chempr.
2020.02.017.

ACKNOWLEDGMENTS
Financial support by Fonds der Chemischen Industrie (fellowship to F.S.) and Deut-
sche Forschungsgemeinschaft / SPP2102 (F.S.-K.) and Leibniz Award (F.G.) is

1388 Chem 6, 1379–1390, June 11, 2020


ll
Article

gratefully acknowledged. The authors thank Tiffany Paulisch, Philipp Pflüger, Dr.
Eloisa Serrano, Dr. Michael Teders, Dr. Michael J. James, Professor Dr. Herbert
Kuchen (all WWU Münster), Dr. Tobias Gensch (University of Utah), and Professor
Dr. Bartosz A. Grzybowski (Polish Academy of Sciences) for helpful discussions. All
computations were carried out using the high-performance computing system
‘‘Palma II’’ of the WWU Münster.

AUTHOR CONTRIBUTIONS
The underlying concept was developed by all authors. The software package was
developed by M.K. and C.B. and coded by M.K.; all datasets were prepared and
evaluated by F.S., F.S.-K., and M.K.; the final manuscript was prepared by all authors.

DECLARATION OF INTERESTS
The authors declare no competing interests.

Received: October 21, 2019


Revised: January 10, 2020
Accepted: February 20, 2020
Published: March 17, 2020

REFERENCES
1. Davies, I.W. (2019). The digitization of organic 11. Jordan, M.I., and Mitchell, T.M. (2015). Machine neural network model for the prediction of
synthesis. Nature 570, 175–181. learning: trends, perspectives, and prospects. chemical reactivity. Chem. Sci. 10, 370–377.
Science 349, 255–260.
2. Markó, I.E. (2001). The art of total synthesis. 21. Gómez-Bombarelli, R., Wei, J.N., Duvenaud,
Science 294, 1842–1843. 12. Lavecchia, A. (2015). Machine-learning D., Hernández-Lobato, J.M., Sánchez-
approaches in drug discovery: methods and Lengeling, B., Sheberla, D., Aguilera-
3. Wender, P.A., and Miller, B.L. (2009). Synthesis applications. Drug Discov. Today 20, 318–331. Iparraguirre, J., Hirzel, T.D., Adams, R.P., and
at the molecular frontier. Nature 460, 197–201. Aspuru-Guzik, A. (2018). Automatic chemical
13. Chen, H., Engkvist, O., Wang, Y., Olivecrona, design using a data-driven continuous
4. Sigman, M.S., Harper, K.C., Bess, E.N., and M., and Blaschke, T. (2018). The rise of deep representation of molecules. ACS Cent. Sci. 4,
Milo, A. (2016). The development of learning in drug discovery. Drug Discov. Today 268–276.
multidimensional analysis tools for asymmetric 23, 1241–1250.
catalysis and beyond. Acc. Chem. Res. 49, 22. Elton, D.C., Boukouvalas, Z., Fuge, M.D., and
1292–1301. 14. Coley, C.W., Green, W.H., and Jensen, K.F. Chung, P.W. (2019). Deep learning for
(2018). Machine learning in computer-aided molecular design–a review of the state of the
5. Denmark, S.E., Gould, N.D., and Wolf, L.M. synthesis planning. Acc. Chem. Res. 51, 1281– art. Mol. Syst. Des. Eng. 4, 828–849.
(2011). A systematic investigation of quaternary 1289.
ammonium ions as asymmetric phase-transfer 23. Ma, J., Sheridan, R.P., Liaw, A., Dahl, G.E., and
15. Segler, M.H.S., Preuss, M., and Waller, M.P. Svetnik, V. (2015). Deep neural nets as a
catalysts. Application of quantitative structure
(2018). Planning chemical syntheses with deep method for quantitative structure activity
activity/selectivity relationships. J. Org. Chem.
neural networks and symbolic AI. Nature 555, relationships. J. Chem. Inf. Model. 55, 263–274.
76, 4337–4357.
604–610.
24. Lo, Y.C., Rensi, S.E., Torng, W., and Altman,
6. Santiago, C.B., Guo, J.Y., and Sigman, M.S.
16. Liu, B., Ramsundar, B., Kawthekar, P., Shi, J., R.B. (2018). Machine learning in
(2018). Predictive and mechanistic multivariate
Gomes, J., Luu Nguyen, Q., Ho, S., Sloane, J., chemoinformatics and drug discovery. Drug
linear regression models for reaction
Wender, P., and Pande, V. (2017). Discov. Today 23, 1538–1546.
development. Chem. Sci. 9, 2398–2412.
Retrosynthetic reaction prediction using neural
sequence-to-sequence models. ACS Cent. Sci. 25. Weininger, D. (1988). SMILES, a chemical
7. Milo, A., Bess, E.N., and Sigman, M.S. (2014).
3, 1103–1113. language and information system. 1.
Interrogating selectivity in catalysis using Introduction to methodology and encoding
molecular vibrations. Nature 507, 210–214. 17. Kayala, M.A., Azencott, C.A., Chen, J.H., and rules. J. Chem. Inf. Comput. Sci. 28, 31–36.
Baldi, P. (2011). Learning to predict chemical
8. Harper, K.C., and Sigman, M.S. (2011). Three- reactions. J. Chem. Inf. Model. 51, 2209–2222. 26. O’Boyle, N., and Dalke, A. (2018).
dimensional correlation of steric and electronic DeepSMILES: an adaptation of SMILES for use
free energy relationships guides asymmetric 18. Wei, J.N., Duvenaud, D., and Aspuru-Guzik, A. in machine-learning of chemical structures.
propargylation. Science 333, 1875–1878. (2016). Neural networks for the prediction of ChemRxiv. https://doi.org/10.26434/chemrxiv.
organic chemistry reactions. ACS Cent. Sci. 2, 7097960.v1.
9. Milo, A., Neel, A.J., Toste, F.D., and Sigman, 725–732.
M.S. (2015). A data-intensive approach to 27. Senese, C.L., Duca, J., Pan, D., Hopfinger, A.J.,
mechanistic elucidation applied to chiral anion 19. Coley, C.W., Barzilay, R., Jaakkola, T.S., Green, and Tseng, Y.J. (2004). 4D-fingerprints,
catalysis. Science 347, 737–743. W.H., and Jensen, K.F. (2017). Prediction of universal QSAR and QSPR descriptors.
organic reaction outcomes using machine J. Chem. Inf. Comput. Sci. 44, 1526–1539.
10. Bess, E.N., Bischoff, A.J., and Sigman, M.S. learning. ACS Cent. Sci. 3, 434–443.
(2014). Designer substrate library for 28. Duvenaud, D., Maclaurin, D., Aguilera-
quantitative, predictive modeling of reaction 20. Coley, C.W., Jin, W., Rogers, L., Jamison, T.F., Iparraguirre, J., Gómez-Bombarelli, R., Hirzel,
performance. Proc. Natl. Acad. Sci. USA 111, Jaakkola, T.S., Green, W.H., Barzilay, R., and T., Aspuru-Guzik, A., and Adams, R.P. (2015).
14698–14703. Jensen, K.F. (2019). A graph-convolutional Convolutional networks on graphs for learning

Chem 6, 1379–1390, June 11, 2020 1389


ll
Article

molecular fingerprints. arXiv. https://arxiv.org/ Reconfigurable system for automated 52. Mitchell, J.B.O. (2014). Machine learning
abs/1509.09292v2. optimization of diverse chemical reactions. methods in chemoinformatics. Wiley
Science 361, 1220–1225. Interdiscip. Rev. Comput. Mol. Sci. 4, 468–481.
29. Myint, K.Z., Wang, L., Tong, Q., and Xie, X.Q.
(2012). Molecular fingerprint-based artificial 42. Perera, D., Tucker, J.W., Brahmbhatt, S., Helal, 53. Breiman, L. (2001). Random forests. Mach.
neural networks QSAR for ligand biological C.J., Chong, A., Farrell, W., Richardson, P., and Learn. 45, 5–32.
activity predictions. Mol. Pharm. 9, 2912–2923. Sach, N.W. (2018). A platform for automated
nanomole-scale reaction screening and 54. See the supporting information for exemplified
30. Liu, R., and Zhou, D. (2008). Using molecular micromole-scale synthesis in flow. Science 359, studies.
fingerprint as descriptors in the QSPR study of 429–434.
lipophilicity. J. Chem. Inf. Model. 48, 542–549. 55. Svetnik, V., Liaw, A., Tong, C., Culberson, J.C.,
43. Macarron, R., Banks, M.N., Bojanic, D., Burns, Sheridan, R.P., and Feuston, B.P. (2003).
31. Cereto-Massagué, A., Ojeda, M.J., Valls, C.,
D.J., Cirovic, D.A., Garyantes, T., Green, D.V.S., Random forest: a classification and regression
Mulero, M., Garcia-Vallvé, S., and Pujadas, G.
Hertzberg, R.P., Janzen, W.P., Paslay, J.W., tool for compound classification and QSAR
(2015). Molecular fingerprint similarity search in
et al. (2011). Impact of high-throughput modeling. J. Chem. Inf. Comput. Sci. 43, 1947–
virtual screening. Methods 71, 58–63.
screening in biomedical research. Nat. Rev. 1958.
32. Melville, J.L., Burke, E.K., and Hirst, J.D. (2009). Drug Discov. 10, 188–195.
Machine learning in virtual screening. Comb. 56. Pedregosa, F., Varoquaux, G., Gramfort, A.,
Chem. High Throughput Screen. 12, 332–343. 44. Awale, M., Sirockin, F., Stiefl, N., and Reymond, Michel, V., Thirion, B., Grisel, O., Blondel, M.,
J.L. (2019). Medicinal chemistry database Prettenhofer, P., Weiss, R., Dubourg, V., et al.
33. Venkatraman, V., Pérez-Nueno, V.I., Mavridis, GDBMedChem. ChemRxiv. https://doi.org/10. (2011). Scikit-learn: machine learning in Python.
L., and Ritchie, D.W. (2010). Comprehensive 26434/chemrxiv.7770809.v1. J. Mach. Learn. Res. 12, 2825–2830.
comparison of ligand-based virtual screening
tools against the DUD data set reveals 45. Jensen, F. (2017). Introduction to 57. Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes,
limitations of current 3D methods. J. Chem. Inf. Computational Chemistry (Wiley VCH Verlag). J., Geniesse, C., Pappu, A.S., Leswing, K., and
Model. 50, 2079–2093. Pande, V. (2018). MoleculeNet: a benchmark
46. Beker, W., Gajewska, E.P., Badowski, T., and for molecular machine learning. Chem. Sci. 9,
34. Rogers, D., and Hahn, M. (2010). Extended- Grzybowski, B.A. (2019). Prediction of major 513–530.
connectivity fingerprints. J. Chem. Inf. Model. regio-, site-, and diastereoisomers in Diels-
50, 742–754. alder reactions by using machine-learning: the 58. Roszak, R., Beker, W., Molga, K., and
importance of physically meaningful Grzybowski, B.A. (2019). Rapid and accurate
35. Granda, J.M., Donina, L., Dragone, V., Long, descriptors. Angew. Chem. Int. Ed. 58, 4515– prediction of pKa values of C–H acids using
D.L., and Cronin, L. (2018). Controlling an 4519. graph convolutional neural networks. J. Am.
organic synthesis robot with machine learning Chem. Soc. 141, 17142–17149.
to search for new reactivity. Nature 559, 47. Estrada, J.G., Ahneman, D.T., Sheridan, R.P.,
377–381. Dreher, S.D., and Doyle, A.G. (2018). Response 59. Ramakrishnan, R., Dral, P.O., Rupp, M., and von
to comment on ‘‘Predicting reaction Lilienfeld, O.A. (2014). Quantum chemistry
36. Chuang, K.V., and Keiser, M.J. (2018). performance in C–N cross-coupling using structures and properties of 134 kilo molecules.
Comment on ‘‘Predicting reaction machine learning’’. Science 362, eaat8763. Sci. Data 1, 140022.
performance in C–N cross-coupling using
machine learning’’. Science 362, eaat8603. 48. Skoraczynski, G., Dittwald, P., Miasojedow, B., 60. Faber, F.A., Hutchison, L., Huang, B., Gilmer, J.,
Szymkuc, S., Gajewska, E.P., Grzybowski, B.A., Schoenholz, S.S., Dahl, G.E., et al. (2017).
37. Ahneman, D.T., Estrada, J.G., Lin, S., Dreher, and Gambin, A. (2017). Predicting the Prediction errors of molecular machine
S.D., and Doyle, A.G. (2018). Predicting outcomes of organic reactions via machine learning models lower than hybrid DFT error.
reaction performance in C–N cross-coupling learning: are current descriptors sufficient? Sci. J. Chem. Theory Comput. 13, 5255–5264.
using machine learning. Science 360, 186–190. Rep. 7, 3582.
61. Reid, J.P., and Sigman, M.S. (2019). Holistic
38. Zahrt, A.F., Henle, J.J., Rose, B.T., Wang, Y.,
49. Elton, D.C., Boukouvalas, Z., Butrico, M.S., prediction of enantioselectivity in asymmetric
Darrow, W.T., and Denmark, S.E. (2019).
Fuge, M.D., and Chung, P.W. (2018). Applying catalysis. Nature 571, 343–348.
Prediction of higher-selectivity catalysts by
machine learning techniques to predict the
computer-driven workflow and machine
properties of energetic materials. Sci. Rep. 8, 62. Hastie, T., Tibshirani, R., and Friedman, J.
learning. Science 363, eaau5631.
9059. (2009). The Elements of Statistical Learning:
39. Raccuglia, P., Elbert, K.C., Adler, P.D.F., Falk, Data Mining, Inference, and Prediction
C., Wenny, M.B., Mollo, A., Zeller, M., Friedler, 50. RDKit: open-source chemoinformatics and (Springer).
S.A., Schrier, J., and Norquist, A.J. (2016). machine learning. http://www.rdkit.org.
Machine-learning-assisted materials discovery 63. Collins, K.D., and Glorius, F. (2013). A
51. Using a concatenation of 24 different robustness screen for the rapid assessment of
using failed experiments. Nature 533, 73–76.
fingerprints results in a rather long bit array chemical reactions. Nat. Chem. 5, 597–601.
40. Buitrago Santanilla, A., Regalado, E.L., Pereira, which might contain redundant information.
T., Shevlin, M., Bateman, K., Campeau, L.C., Therefore, to increase computational 64. The authors performed the selection of the
et al. (2015). Nanomole-scale high-throughput efficiency, features which are identical additive test sets after thorough analysis of the
chemistry for the synthesis of complex throughout the whole data set, and therefore results and learning about the most influential
molecules. Science 347, 49–53. cannot influence the predictive performance of parameters for the descriptor-based
the random forest model, are removed prior to model.42,47 Thus, these splits are, to some
41. Bédard, A.C., Adamo, A., Aroh, K.C., Russell, training the model. This can decrease the size extent, unbalanced, and the comparable
M.G., Bedermann, A.A., Torosian, J., Yue, B., of the feature array by up to 90% (see the performance of the unbiased MFF model on
Jensen, K.F., and Jamison, T.F. (2018). supporting information for further details). most of these test sets is remarkable.

1390 Chem 6, 1379–1390, June 11, 2020

You might also like