Professional Documents
Culture Documents
PIIS2451929420300851
PIIS2451929420300851
Article
A Structure-Based Platform for Predicting
Chemical Reactivity
Frederik Sandfort, Felix
Strieth-Kalthoff, Marius
Kühnemund, Christian Beecks,
Frank Glorius
glorius@uni-muenster.de
HIGHLIGHTS
Quantitative modeling of reaction
outcomes via machine learning
Article
A Structure-Based Platform
for Predicting Chemical Reactivity
Frederik Sandfort,1,4 Felix Strieth-Kalthoff,1,4 Marius Kühnemund,2,3,4 Christian Beecks,2
and Frank Glorius1,5,*
development are typically carried out in a lengthy iterative workflow until good cor-
relation is achieved. The major strong point of this process is that in many cases,
valuable information on the underlying mechanism can be obtained.9,10 Further-
more, after the successful identification of mechanistically relevant descriptors,
the amount of data points required for an MLR prediction model can be compara-
tively small. However, this requires a representative set of training data, which has
to be carefully considered.6
In pioneering work, Doyle and co-workers were able to predict the reaction yields of Wilhelms-Universität Münster, Einsteinstraße 62,
48149 Münster, Germany
C–N cross coupling reactions by using a dataset of more than 4,000 reactions.37 3Institut für Wirtschaftsinformatik, Westfälische
Furthermore, Denmark and co-workers could predict enantioselectivities by using Wilhelms-Universität Münster,
chiral phosphoric acid (CPA) catalysts based on a dataset with more than 1,000 ex- Leonardo-Campus 3, 48149 Münster, Germany
periments.38 Similar to MLR studies, these methods rely on physically meaningful 4These authors contributed equally
5Lead
parameters (Figure 1B). In a thorough analysis, steric and electronic parameters Contact
are selected on the basis of the underlying mechanism of the transformation and *Correspondence: glorius@uni-muenster.de
relevant properties of each reactant. In a next step, these parameters are https://doi.org/10.1016/j.chempr.2020.02.017
A vast number of machine learning algorithms have been developed over the last
decades, which can be loosely categorized into distance-based and non-distance-
based methods.52 Distance-based approaches build on the assumption that similar
input generates similar output, and vice versa. However, in organic chemistry,
structural similarity does not necessarily correlate with similar reactivity. Thus, we
assumed that non-distance-based algorithms, such as random forests or neural net-
works that rely on (complex) decision trees or networks, were best suited for our
predictive tool. Preliminary results indicated that both random forest and neural
network algorithms can deliver good regression models via the MFF input. We
selected the random forest model53 for further investigations, as it was shown to
be more robust over a variety of different problem sets and less prone to overfitting,
i.e., a perfect modeling of the training data (R2train = 1), which typically results in poor
accuracy for predicting the test set.54 This can be problematic, especially when tack-
ling small (reaction-based) datasets using a large number of features. Although a
high degree of overfitting is observed when using neural networks trained with
the MFF input, the random forest model appears to be less sensitive, which is in
agreement with previous literature reports.55 Further indication is obtained by the
fact that generally no trees with a high tree depth are built (a high tree depth is usu-
ally considered as an indicator for overfitting).54 Despite the seemingly low sensi-
tivity to overfitting, it should be noted that the extrapolative properties of a random
forest beyond the trained target space are inherently limited. Moreover, hyperpara-
meter optimization for model selection of the machine learning algorithms was
based solely on the training data in any case (nested cross-validation, Figure 2B).
The implementation was conducted using the Python package Scikit-learn.56
At the outset of our investigations, we observed that every single fingerprint per-
formed differently for each problem set and that no universally applicable fingerprint
existed (Figure 2A). Even within a single specific dataset, the best-performing finger-
print differed for different train-test splittings. Furthermore, the selection of the best
single fingerprint for the prediction of an unknown (out-of-sample) test set, solely
based on the training data, could not be achieved in many cases.54 To circumvent
this problem, we sought to use an array of multiple fingerprints in order to generate
an input that would provide a more accurate and robust representation of the Lewis
structure. In the context of reactivity prediction, this model did also outperform
learnable representations, such as a graph-convolutional neural network
(GCNN),54,57 which have recently gained a lot of attention.20,28,58
A B
Given that the HOMO-LUMO gap is a property of the overall molecule, the MFF
model seems to represent and compare not only local substructures of a molecule
but also global molecular characteristics. This result served to reinforce the original
hypothesis that (computed) molecular properties can eventually be traced back to
patterns in the (2D) Lewis structure.
It should also be noted that orbital energies have commonly been used as descrip-
tors in MLR and machine learning models.6,37 This indicates that our model should
also be applicable to more complex reactivity-based datasets. More precisely, we
Prediction of Enantioselectivities
The prediction of stereoselectivities of catalytic reactions has been of great interest to
the chemical community and a major focus of many MLR models.4–10 Recently,
Denmark and co-workers described a machine-learning-based approach for the pre-
diction of enantioselectivity by using CPA catalysts.38 The possibility to predict parts
of these data via MLR models trained with other nucleophiles was later demonstrated
by Sigman and co-workers.61 In the initial work, Denmark et al. chose an asymmetric
N,S-acetal formation as a model reaction. The training set included combinatorial
variations of 43 CPA catalysts, five N-acyl imines, and five thiols, resulting in a total
of 1,075 reactions (Figure 4A). A new steric parameter, the average steric occupancy
(ASO), based on DFT-computed 3D representations of multiple conformers was
developed to represent the catalysts. Weighted grid point occupancies in combina-
tion with calculated electronic parameters were used to train a machine learning
model in order to predict enantioselectivity (DDG in kcal/mol). A distance-based sup-
port vector machine algorithm was found to perform best on a random 600/475 split
of training and test data (<MAE> = 0.152 kcal/mol, average over ten random
divisions).38 When applying a similar random splitting of the dataset, we found that
our model performed with slightly higher accuracy (<MAE> = 0.144 kcal/mol,
average over ten random divisions) by using a random forest algorithm, whereas a
one-hot encoded model as statistical probe resulted in lower correlation
(<MAE> = 0.163 kcal/mol, average over ten random divisions) (Figure 4B). In their
original work, the authors divided the data for out-of-sample prediction into a com-
mon training set, a test set for substrates (sub), a test set for catalysts (cat), and one for
both (sub-cat). The same division was analyzed by using our MFF model, which
showed good accuracy in all three test sets. In particular, the performance for the
most challenging catalyst out-of-sample predictions (cat, sub-cat) stands out.
Although a one-hot encoded model resulted in low correlation measures, the simple
MFF model performed nearly as well as the original complex descriptor model.
Prediction of Yields
In comparison to stereoselectivities, the quantitative prediction of yields can be
even more demanding, because they are influenced by many parameters and do
not rely on one elementary step alone. In a recent report, Doyle and co-workers
described a machine learning approach to predict reaction performance in C–N
cross coupling reactions.37 The training data, including combinatorial variation of
four reaction components applying an additive-based approach,63 were collected
by using high-throughput experimentation. All possible combinations of 15 aryl ha-
lides, four ligands, three bases, and 23 isoxazole additives were evaluated in a total
of 4,140 reactions (Figure 5A). The molecules were represented by electronic,
atomic, and vibrational descriptors that were extracted from DFT calculations. A va-
riety of regression models was subjected to a random 70/30 split into training and
test data, and a random forest model was found to show the best performance in
predicting product yields (R2 = 0.92). Moreover, we found that our simplified model
based on a structure-based input resulted in comparable correlation (Figure 5B,
<R2> = 0.93, average over ten random divisions).
However, such a random 70/30 split of the entire combinatorial data, consisting of
only 46 different molecules, results in a training set that is likely to contain all mole-
cules at least once. Consequently, a one-hot encoded model as statistical probe
showed slightly lower but still very good performance (<R2> = 0.89, average over
ten random divisions).36 The appropriate test to prove the relevance of chemical
features in such models is out-of-sample prediction, i.e., the prediction of reactivity
for molecules that were not included in the training dataset. Thus, the authors split
the isoxazole additives into a variety of representative training and test sets and
could prove good performance of the chemical feature model in these cases.47
The same division for out-of-sample prediction via MFF as input, showed compara-
ble correlation in three of four test sets.64 In contrast, one-hot encoded models were
less accurate (Figure 5B). As both, the original descriptor model and the structure-
based MFF approach, gave similar trends regarding the additive test sets, they
seem to represent the isoxazole additives equally well.
SUPPLEMENTAL INFORMATION
Supplemental Information can be found online at https://doi.org/10.1016/j.chempr.
2020.02.017.
ACKNOWLEDGMENTS
Financial support by Fonds der Chemischen Industrie (fellowship to F.S.) and Deut-
sche Forschungsgemeinschaft / SPP2102 (F.S.-K.) and Leibniz Award (F.G.) is
gratefully acknowledged. The authors thank Tiffany Paulisch, Philipp Pflüger, Dr.
Eloisa Serrano, Dr. Michael Teders, Dr. Michael J. James, Professor Dr. Herbert
Kuchen (all WWU Münster), Dr. Tobias Gensch (University of Utah), and Professor
Dr. Bartosz A. Grzybowski (Polish Academy of Sciences) for helpful discussions. All
computations were carried out using the high-performance computing system
‘‘Palma II’’ of the WWU Münster.
AUTHOR CONTRIBUTIONS
The underlying concept was developed by all authors. The software package was
developed by M.K. and C.B. and coded by M.K.; all datasets were prepared and
evaluated by F.S., F.S.-K., and M.K.; the final manuscript was prepared by all authors.
DECLARATION OF INTERESTS
The authors declare no competing interests.
REFERENCES
1. Davies, I.W. (2019). The digitization of organic 11. Jordan, M.I., and Mitchell, T.M. (2015). Machine neural network model for the prediction of
synthesis. Nature 570, 175–181. learning: trends, perspectives, and prospects. chemical reactivity. Chem. Sci. 10, 370–377.
Science 349, 255–260.
2. Markó, I.E. (2001). The art of total synthesis. 21. Gómez-Bombarelli, R., Wei, J.N., Duvenaud,
Science 294, 1842–1843. 12. Lavecchia, A. (2015). Machine-learning D., Hernández-Lobato, J.M., Sánchez-
approaches in drug discovery: methods and Lengeling, B., Sheberla, D., Aguilera-
3. Wender, P.A., and Miller, B.L. (2009). Synthesis applications. Drug Discov. Today 20, 318–331. Iparraguirre, J., Hirzel, T.D., Adams, R.P., and
at the molecular frontier. Nature 460, 197–201. Aspuru-Guzik, A. (2018). Automatic chemical
13. Chen, H., Engkvist, O., Wang, Y., Olivecrona, design using a data-driven continuous
4. Sigman, M.S., Harper, K.C., Bess, E.N., and M., and Blaschke, T. (2018). The rise of deep representation of molecules. ACS Cent. Sci. 4,
Milo, A. (2016). The development of learning in drug discovery. Drug Discov. Today 268–276.
multidimensional analysis tools for asymmetric 23, 1241–1250.
catalysis and beyond. Acc. Chem. Res. 49, 22. Elton, D.C., Boukouvalas, Z., Fuge, M.D., and
1292–1301. 14. Coley, C.W., Green, W.H., and Jensen, K.F. Chung, P.W. (2019). Deep learning for
(2018). Machine learning in computer-aided molecular design–a review of the state of the
5. Denmark, S.E., Gould, N.D., and Wolf, L.M. synthesis planning. Acc. Chem. Res. 51, 1281– art. Mol. Syst. Des. Eng. 4, 828–849.
(2011). A systematic investigation of quaternary 1289.
ammonium ions as asymmetric phase-transfer 23. Ma, J., Sheridan, R.P., Liaw, A., Dahl, G.E., and
15. Segler, M.H.S., Preuss, M., and Waller, M.P. Svetnik, V. (2015). Deep neural nets as a
catalysts. Application of quantitative structure
(2018). Planning chemical syntheses with deep method for quantitative structure activity
activity/selectivity relationships. J. Org. Chem.
neural networks and symbolic AI. Nature 555, relationships. J. Chem. Inf. Model. 55, 263–274.
76, 4337–4357.
604–610.
24. Lo, Y.C., Rensi, S.E., Torng, W., and Altman,
6. Santiago, C.B., Guo, J.Y., and Sigman, M.S.
16. Liu, B., Ramsundar, B., Kawthekar, P., Shi, J., R.B. (2018). Machine learning in
(2018). Predictive and mechanistic multivariate
Gomes, J., Luu Nguyen, Q., Ho, S., Sloane, J., chemoinformatics and drug discovery. Drug
linear regression models for reaction
Wender, P., and Pande, V. (2017). Discov. Today 23, 1538–1546.
development. Chem. Sci. 9, 2398–2412.
Retrosynthetic reaction prediction using neural
sequence-to-sequence models. ACS Cent. Sci. 25. Weininger, D. (1988). SMILES, a chemical
7. Milo, A., Bess, E.N., and Sigman, M.S. (2014).
3, 1103–1113. language and information system. 1.
Interrogating selectivity in catalysis using Introduction to methodology and encoding
molecular vibrations. Nature 507, 210–214. 17. Kayala, M.A., Azencott, C.A., Chen, J.H., and rules. J. Chem. Inf. Comput. Sci. 28, 31–36.
Baldi, P. (2011). Learning to predict chemical
8. Harper, K.C., and Sigman, M.S. (2011). Three- reactions. J. Chem. Inf. Model. 51, 2209–2222. 26. O’Boyle, N., and Dalke, A. (2018).
dimensional correlation of steric and electronic DeepSMILES: an adaptation of SMILES for use
free energy relationships guides asymmetric 18. Wei, J.N., Duvenaud, D., and Aspuru-Guzik, A. in machine-learning of chemical structures.
propargylation. Science 333, 1875–1878. (2016). Neural networks for the prediction of ChemRxiv. https://doi.org/10.26434/chemrxiv.
organic chemistry reactions. ACS Cent. Sci. 2, 7097960.v1.
9. Milo, A., Neel, A.J., Toste, F.D., and Sigman, 725–732.
M.S. (2015). A data-intensive approach to 27. Senese, C.L., Duca, J., Pan, D., Hopfinger, A.J.,
mechanistic elucidation applied to chiral anion 19. Coley, C.W., Barzilay, R., Jaakkola, T.S., Green, and Tseng, Y.J. (2004). 4D-fingerprints,
catalysis. Science 347, 737–743. W.H., and Jensen, K.F. (2017). Prediction of universal QSAR and QSPR descriptors.
organic reaction outcomes using machine J. Chem. Inf. Comput. Sci. 44, 1526–1539.
10. Bess, E.N., Bischoff, A.J., and Sigman, M.S. learning. ACS Cent. Sci. 3, 434–443.
(2014). Designer substrate library for 28. Duvenaud, D., Maclaurin, D., Aguilera-
quantitative, predictive modeling of reaction 20. Coley, C.W., Jin, W., Rogers, L., Jamison, T.F., Iparraguirre, J., Gómez-Bombarelli, R., Hirzel,
performance. Proc. Natl. Acad. Sci. USA 111, Jaakkola, T.S., Green, W.H., Barzilay, R., and T., Aspuru-Guzik, A., and Adams, R.P. (2015).
14698–14703. Jensen, K.F. (2019). A graph-convolutional Convolutional networks on graphs for learning
molecular fingerprints. arXiv. https://arxiv.org/ Reconfigurable system for automated 52. Mitchell, J.B.O. (2014). Machine learning
abs/1509.09292v2. optimization of diverse chemical reactions. methods in chemoinformatics. Wiley
Science 361, 1220–1225. Interdiscip. Rev. Comput. Mol. Sci. 4, 468–481.
29. Myint, K.Z., Wang, L., Tong, Q., and Xie, X.Q.
(2012). Molecular fingerprint-based artificial 42. Perera, D., Tucker, J.W., Brahmbhatt, S., Helal, 53. Breiman, L. (2001). Random forests. Mach.
neural networks QSAR for ligand biological C.J., Chong, A., Farrell, W., Richardson, P., and Learn. 45, 5–32.
activity predictions. Mol. Pharm. 9, 2912–2923. Sach, N.W. (2018). A platform for automated
nanomole-scale reaction screening and 54. See the supporting information for exemplified
30. Liu, R., and Zhou, D. (2008). Using molecular micromole-scale synthesis in flow. Science 359, studies.
fingerprint as descriptors in the QSPR study of 429–434.
lipophilicity. J. Chem. Inf. Model. 48, 542–549. 55. Svetnik, V., Liaw, A., Tong, C., Culberson, J.C.,
43. Macarron, R., Banks, M.N., Bojanic, D., Burns, Sheridan, R.P., and Feuston, B.P. (2003).
31. Cereto-Massagué, A., Ojeda, M.J., Valls, C.,
D.J., Cirovic, D.A., Garyantes, T., Green, D.V.S., Random forest: a classification and regression
Mulero, M., Garcia-Vallvé, S., and Pujadas, G.
Hertzberg, R.P., Janzen, W.P., Paslay, J.W., tool for compound classification and QSAR
(2015). Molecular fingerprint similarity search in
et al. (2011). Impact of high-throughput modeling. J. Chem. Inf. Comput. Sci. 43, 1947–
virtual screening. Methods 71, 58–63.
screening in biomedical research. Nat. Rev. 1958.
32. Melville, J.L., Burke, E.K., and Hirst, J.D. (2009). Drug Discov. 10, 188–195.
Machine learning in virtual screening. Comb. 56. Pedregosa, F., Varoquaux, G., Gramfort, A.,
Chem. High Throughput Screen. 12, 332–343. 44. Awale, M., Sirockin, F., Stiefl, N., and Reymond, Michel, V., Thirion, B., Grisel, O., Blondel, M.,
J.L. (2019). Medicinal chemistry database Prettenhofer, P., Weiss, R., Dubourg, V., et al.
33. Venkatraman, V., Pérez-Nueno, V.I., Mavridis, GDBMedChem. ChemRxiv. https://doi.org/10. (2011). Scikit-learn: machine learning in Python.
L., and Ritchie, D.W. (2010). Comprehensive 26434/chemrxiv.7770809.v1. J. Mach. Learn. Res. 12, 2825–2830.
comparison of ligand-based virtual screening
tools against the DUD data set reveals 45. Jensen, F. (2017). Introduction to 57. Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes,
limitations of current 3D methods. J. Chem. Inf. Computational Chemistry (Wiley VCH Verlag). J., Geniesse, C., Pappu, A.S., Leswing, K., and
Model. 50, 2079–2093. Pande, V. (2018). MoleculeNet: a benchmark
46. Beker, W., Gajewska, E.P., Badowski, T., and for molecular machine learning. Chem. Sci. 9,
34. Rogers, D., and Hahn, M. (2010). Extended- Grzybowski, B.A. (2019). Prediction of major 513–530.
connectivity fingerprints. J. Chem. Inf. Model. regio-, site-, and diastereoisomers in Diels-
50, 742–754. alder reactions by using machine-learning: the 58. Roszak, R., Beker, W., Molga, K., and
importance of physically meaningful Grzybowski, B.A. (2019). Rapid and accurate
35. Granda, J.M., Donina, L., Dragone, V., Long, descriptors. Angew. Chem. Int. Ed. 58, 4515– prediction of pKa values of C–H acids using
D.L., and Cronin, L. (2018). Controlling an 4519. graph convolutional neural networks. J. Am.
organic synthesis robot with machine learning Chem. Soc. 141, 17142–17149.
to search for new reactivity. Nature 559, 47. Estrada, J.G., Ahneman, D.T., Sheridan, R.P.,
377–381. Dreher, S.D., and Doyle, A.G. (2018). Response 59. Ramakrishnan, R., Dral, P.O., Rupp, M., and von
to comment on ‘‘Predicting reaction Lilienfeld, O.A. (2014). Quantum chemistry
36. Chuang, K.V., and Keiser, M.J. (2018). performance in C–N cross-coupling using structures and properties of 134 kilo molecules.
Comment on ‘‘Predicting reaction machine learning’’. Science 362, eaat8763. Sci. Data 1, 140022.
performance in C–N cross-coupling using
machine learning’’. Science 362, eaat8603. 48. Skoraczynski, G., Dittwald, P., Miasojedow, B., 60. Faber, F.A., Hutchison, L., Huang, B., Gilmer, J.,
Szymkuc, S., Gajewska, E.P., Grzybowski, B.A., Schoenholz, S.S., Dahl, G.E., et al. (2017).
37. Ahneman, D.T., Estrada, J.G., Lin, S., Dreher, and Gambin, A. (2017). Predicting the Prediction errors of molecular machine
S.D., and Doyle, A.G. (2018). Predicting outcomes of organic reactions via machine learning models lower than hybrid DFT error.
reaction performance in C–N cross-coupling learning: are current descriptors sufficient? Sci. J. Chem. Theory Comput. 13, 5255–5264.
using machine learning. Science 360, 186–190. Rep. 7, 3582.
61. Reid, J.P., and Sigman, M.S. (2019). Holistic
38. Zahrt, A.F., Henle, J.J., Rose, B.T., Wang, Y.,
49. Elton, D.C., Boukouvalas, Z., Butrico, M.S., prediction of enantioselectivity in asymmetric
Darrow, W.T., and Denmark, S.E. (2019).
Fuge, M.D., and Chung, P.W. (2018). Applying catalysis. Nature 571, 343–348.
Prediction of higher-selectivity catalysts by
machine learning techniques to predict the
computer-driven workflow and machine
properties of energetic materials. Sci. Rep. 8, 62. Hastie, T., Tibshirani, R., and Friedman, J.
learning. Science 363, eaau5631.
9059. (2009). The Elements of Statistical Learning:
39. Raccuglia, P., Elbert, K.C., Adler, P.D.F., Falk, Data Mining, Inference, and Prediction
C., Wenny, M.B., Mollo, A., Zeller, M., Friedler, 50. RDKit: open-source chemoinformatics and (Springer).
S.A., Schrier, J., and Norquist, A.J. (2016). machine learning. http://www.rdkit.org.
Machine-learning-assisted materials discovery 63. Collins, K.D., and Glorius, F. (2013). A
51. Using a concatenation of 24 different robustness screen for the rapid assessment of
using failed experiments. Nature 533, 73–76.
fingerprints results in a rather long bit array chemical reactions. Nat. Chem. 5, 597–601.
40. Buitrago Santanilla, A., Regalado, E.L., Pereira, which might contain redundant information.
T., Shevlin, M., Bateman, K., Campeau, L.C., Therefore, to increase computational 64. The authors performed the selection of the
et al. (2015). Nanomole-scale high-throughput efficiency, features which are identical additive test sets after thorough analysis of the
chemistry for the synthesis of complex throughout the whole data set, and therefore results and learning about the most influential
molecules. Science 347, 49–53. cannot influence the predictive performance of parameters for the descriptor-based
the random forest model, are removed prior to model.42,47 Thus, these splits are, to some
41. Bédard, A.C., Adamo, A., Aroh, K.C., Russell, training the model. This can decrease the size extent, unbalanced, and the comparable
M.G., Bedermann, A.A., Torosian, J., Yue, B., of the feature array by up to 90% (see the performance of the unbiased MFF model on
Jensen, K.F., and Jamison, T.F. (2018). supporting information for further details). most of these test sets is remarkable.