Professional Documents
Culture Documents
(W-10321) Uncertainty Quantification in Molecular Simulations With Dropout Neural Network Potentials
(W-10321) Uncertainty Quantification in Molecular Simulations With Dropout Neural Network Potentials
CrossMark
View Export
Online Citation
AFFILIATIONS
1
Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, 518055 Shenzhen, China
2
Department of Mathematics and Computer Science, Freie Universität Berlin, Arnimallee 6, 14195 Berlin, Germany
3
Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University,
100871 Beijing, China
4
Beijing Advanced Innovation Center for Genomics, Peking University, 100871 Beijing, China
5
Biomedical Pioneering Innovation Center, Peking University, 100871 Beijing, China
a)
Authors to whom correspondence should be addressed: yangyi@szbl.ac.cn and gaoyq@pku.edu.cn
ABSTRACT
I. INTRODUCTION 1
F(s) = − [log p(s) + log Z], (2)
β
Molecular simulations, particularly all-atom and ab initio
molecular dynamics (MD), have furthered our understanding of where Z = ∫ e−βU(R) dR is the partition function, β is the inverse
many chemical and bio-physical processes.1,2 In molecular sim- temperature, and δ denotes the Dirac-delta function. Equation (2)
ulations, interactions between particles (e.g., atoms, residues, or holds up to an arbitrary additive constant. Usually, s is selected to
molecules) are described by an (potential) energy function U(R) of be slowly changing variables governing the process of interest, and
configuration R. To investigate such systems, one is often not inter- the rest of degrees of freedom can be treated in the mean-field fash-
ested in the exact potential or kinetic energy but in the free energy ion.5,6 Under this setting, F(s) becomes a CG description of the
(or the equilibrium distribution) of some reduced descriptors, e.g., original thermodynamic system, and simulations performed under
collective variables (CV)3 or CG variables,4 s(R), F(s) are generally much faster than those run on the FG potential
[i.e., U(R)] because Dim(s)≪Dim(R) [where Dim(⋅) denotes the
dimensionality] despite the loss of finer details. Therefore, Eqs. (1)
e−βU(R) δ(s − s(R))dR e−βF(s) and (2) are also known as the principle of thermodynamic con-
p(s) = ∫ −βU(R) dR
= , (1)
∫ e Z sistency for coarse graining.5 However, practical implementation
of Eqs. (1) and (2) is hindered by two critical issues: (1) How extension to the well-established energy-based CG methods includ-
can one approximate a reliable analytical form for the CG poten- ing the inversed Monte Carlo and relative entropy approaches, and
tial F(s) given access to samples drawn from U(R)? (2) How can we delivered a new solution to the sampling problem in the EBMs
one draw equilibrium samples from U(R), which are further used by virtue of deep learning. On the other hand, from the aspect of
to infer F(s)? The former is known as the coarse-graining prob- machine learning, our approach generalizes the well-known Gen-
lem, while the latter is known as the importance sampling problem, erative Adversarial Networks (GANs),28 and we overcome some
and both are of particular importance in physics, chemistry, and severe issues (e.g., mode-dropping and inability for density esti-
biology. mation) of GANs. Furthermore, based on reinforcement learn-
Conventionally, if s is low-dimensional [say, Dim(s) ≤ 3], non- ing, our approach allows simulations on different scales to ben-
parametric methods such as kernel density estimation (KDE)7 can efit from each other so that the inferred CG potential can, in
be adopted to infer F(s). However, due to the “curse of dimen- turn, help enhance the sampling of the FG model in a reinforced
sionality,”8 they become quickly infeasible as Dim(s) increases. For manner.
instance, the number of samples required by KDE to achieve con-
vergence is known to grow exponentially as Dim(s) increases.9
Artificial neural networks (ANNs) and deep learning may offer II. METHODS
extra flexibility and expressivity to this end.10,11 For instance, sev- A. Variational adversarial density estimation (VADE)
eral recent studies proposed supervised learning approaches (e.g.,
We propose a deep learning approach to approximate the
force-matching12,13 ) to fit F(s) by ANNs. However, computing the
(possibly high-dimensional) FES, F(s), by using parametric mod-
mean force F(s) requires a large amount of samples from U(R)
els. Specifically, we denote the approximate free energy func-
at the neighborhood of s, which may be computationally pro-
tion and the associated probability distribution as F θ (s) and pθ (s)
hibitive if Dim(s) is large. Worse still, in many cases (particularly
∝ exp(−βF θ (s)), respectively (where θ are optimizable parameters),
in experiments), measurement of the mean forces is hardly prac-
and define a strict divergence D(p∣∣pθ ) between p and pθ . A strict
tical, thus negating the possibility of fitting F(s) as a regression
divergence, including but is not limited to Kullback–Leibler diver-
problem.
gence DKL (p∣∣pθ )29 and Earth Mover’s distance DW (p∣∣pθ ),30 satisfies
An alternative view of inferring p(s) is provided by statisti-
the condition that D(p∣∣pθ ) ≥ 0 where the equality holds if and only
cal modeling and machine learning, where density estimation of
if p = pθ , and hence it can be used as a variational objective.26,31,32
high-dimensional data has been a long-standing goal.11,14 Particu-
FIG. 1. Workflow of variational adversarial density estimation (VADE). (a) VADE: the free energy approximator F θ simultaneously incorporates positive samples [s ∼ pFG (s)]
and negative samples [s ∼ pθ (s)]. The negative samples are generated by Monte Carlo (MC) or Langevin Dynamics (LD) simulations over F θ . Red arrows indicate the
direction of data flow or the computational graph. (b) VADE with a neural sampler (VADE-NS): identical to VADE, except that the negative samples are drawn by using a
neural sampler fψ . (c) Reinforced VADE (RE-VADE): Given simulation samples from fine-grained (FG) models (pFG ), a coarse-grained (CG) potential F θ can be approximated
through VADE. In turn, a target distribution pT (s;θ) can be defined based on F θ , according to which a bias potential V ϕ can be variationally optimized to boost the FG
simulation.
low-dimensional, the CG simulations are usually computationally distribution q(z), we can compute qψ (s) according to the change-
economic and converge relatively fast; hence, brute-force simula- of-variable formula [Eq. (5)],18
tions via Monte Carlo (MC) or Langevin dynamics (LD) generally
suffice. ∂fψ
log qψ (s) = log q(z) − log∣det( )∣, (5)
∂z
and optimizing F θ directly through Eq. (3). More details about the where γ > 1 is known as the well-tempered factor so that the less vis-
training and model choices for VADE-NS can be found in the ited regions are more encouraged by the target distribution. We then
supplementary material, Part I, Sec. II. aim to optimize a bias potential V ϕ (s), which can drive the sampling
Equations (3), (6), and (7) are the main optimization objec- toward pT , and this can be done by minimizing a strict divergence,
tives of VADE. Equation (3) dictates the nature of VADE to be for instance, DKL (pT ∣∣pϕ ),
energy-based; thus, one can account for the invariance of molec-
ular systems and pre-condition on various physics restraints by
designing proper functional forms for the VADE potential F θ . Com- ∇ϕ DKL (pT ∣∣pϕ ) = ⟨β∇ϕ Vϕ (s)⟩p − ⟨β∇ϕ Vϕ (s)⟩p , (10)
T ϕ
pared to other energy-based CG approaches, VADE is able to deal
with a high-dimensional input without suffering the sampling issues
where pϕ denotes the Boltzmann distribution induced by (U + V ϕ ),
owing to the neural sampler. Moreover, implementation of Eqs. (3),
which can be approximated through MD sampling. The separate
(6), and (7) only requires samples of molecular configurations
parameterization allows us to use an imbalanced learning schedule
although additional information such as forces could also be incor-
for the free energy function F θ and the bias potential V ϕ . Particu-
porated [Eq. (S10)]. This attribute makes VADE particularly useful
larly, we can train F θ (s) with a higher rate based on the latest samples
in many CG cases where no information such as forces or dynamics
generated by the biased MD simulations and update V ϕ (s) in a more
other than molecular configurations is accessible. In contrast, force-
conservative manner. In terms of imitation learning, F θ (s) plays the
matching and dynamics-based CG approaches would fail in these
role of a leader that coins a moving target based on the current den-
cases.
sity estimation, while V ϕ (s) learns to tune the policy in order to
catch up [Fig. 1(c)]. A noteworthy advantage of such “leader–chaser”
C. Reinforced VADE (RE-VADE) scheme lies in the fact that F θ along with pT (s;θ) is constructed
In a more common setting where FG samples (pFG ) are not based on the samples drawn from pϕ (i.e., MD simulations under
available beforehand, we have to perform sampling over U(R) from V ϕ ), so pT and pϕ always share substantial overlap; otherwise, DKL
scratch. Since Dim(R) is usually very large, i.e., Dim(R)≫Dim(s), would fall victim to the notorious vanishing gradient issue.30 Such
hence, we can model F θ (s) with deep bijective models for simplic-
ity [Eq. (8)]. Besides, to attack the issue of irregular gradients while
maximally harnessing the expressivity of ANNs as energy functions,
the architecture of ANNs should be carefully designed, and spe-
cial regularization techniques may be needed to smooth the gra-
dients (see the supplementary material, Part I, Sec. IV, for more
details).
but also appears quite smooth as the original PES, thanks to the B. Coarse graining a mini-protein by VADE based
gradient regularizations we imposed over F θ (see the supplemen- on contributed simulation data
tary material for more details). We also showed the negative sam- Particle-based CG models are widely adopted in bio-molecular
ples generated by MC sampling over Fθ∗ [Fig. 2(c)], which, indeed, modeling.55 Conventionally, such models are premised on some
cannot be visually distinguished from those out of the training empirical forms of force fields and fitted with respect to experi-
set [Fig. 2(b)]. mental and/or high-level calculation data. Since the force fields are
Furthermore, we chose a continuous normalizing-flow model generally designed to reflect certain physics restraints, the num-
with a free-form Jacobian40 as the neural sampler and performed ber of parameters is relatively limited and bottlenecks the expres-
VADE-NS over the same training set (see the supplementary mate- sivity and flexibility of the resulting models. In contrast, it is also
rial, Part III, Sec. I, for more details about the model setup and train- possible to construct a CG potential entirely through an ANN in
ing details). For a fair comparison, VADE-NS was trained under order to improve the expressivity; however, the model would require
the same settings with VADE. Tracking the optimization process, excessive amount of samples for training and might not general-
we found that DKL between pθ (which was yielded by VADE-NS) ize well due to the large parameter space. VADE can help ally the
and pFG also quickly diminished and even dropped more rapidly best of the two worlds if the CG potential takes a hybrid form
than VADE in the early stage, signifying a higher initial training [Eq. (11)],
efficiency [Fig. 2(f)]. However, as the training proceeded, VADE-
NS did not reach the same minimum level in DKL as achieved
by VADE possibly due to insufficient expressivity commonly suf- Fθ (s) = FΘ (s) + Fprior (s), (11)
fered by deep bijective models. Intriguingly, the distributions of F θ
(yielded by VADE-NS) over positive and negative samples still over- where F Θ (s) is a trainable parametric model, while F prior (s) is a force-
lapped well [Fig. 2(e), panel 2], indicating that VADE-NS also found field-like term accounting for some a priori knowledge or physics
a minimum to the optimization problem in Eq. (3), but the solu- restraints. Equation (11) can be interpreted from two perspectives:
tion is sub-optimal due to the insufficient variational flexibility of On the one hand, F Θ (s) can be regarded as a correction term for the
the bijective model. The optimized Fθ∗ (x, y) obtained by VADE- traditional force fields. On the other hand, F prior (s) serves as a rea-
NS was plotted in Fig. 2(d). Fθ∗ obtained by VADE-NS was not so sonable prior distribution, which effectively restricts the hypothesis
smooth as the model obtained by VADE because it is non-trivial space of F θ , thus expediting the training.
to impose the gradient regularizations over the bijective model in We tested this idea on a mini-protein chignolin using long all-
FIG. 3. Coarse-grained chignolin by VADE. (a) FG protein structure of chignolin. The CG particles are the 10 Cα atoms, which are highlighted by colored beads and indexed
accordingly. (b) Distributions of VADE potential F θ for FG MD simulation samples (black solid line) and for CG structures sampled from F θ (red dotted line). (c) 2D FES
spanned by RMSD and Rg of FG MD simulation samples. (d) 2D FES spanned by RMSD and the radius of gyration (Rg) of VADE CG structures. (e) Distributions of the bond
length between Cα 4–Cα 5 (upper panel), the angle between Cα 4–Cα 5–Cα 6 (middle), and the torsion between Cα 4–Cα 5–Cα 6–Cα 7 (bottom). Black histograms correspond to
FG samples and white histograms to VADE samples. Statistics were performed over 10 000 FG MD samples and VADE CG samples, respectively. (f) VADE CG structures
are shown in beads [colored according to Fig. 3(a)] and purple bonds, superimposed with their best-aligned FG counterparts (shown in yellow ribbons) selected from the MD
trajectories.
[Fig. 3(a)]. We compared the distributions of these local struc- simulation over F θ (s) to generate negative samples from pθ (s). V ϕ (s)
FIG. 4. RE-VADE sampling of Ala2 in explicit water. (a) Chemical structure of Ala2 and the coarse-graining atomic sites (annotated by green symbols). Φ (red symbols) is
the torsion defined by the C0–N1–Cα–C1 dihedral angle, and Ψ (blue symbols) is the torsion defined by the N1–Cα–C1–N2 dihedral angle. The torsions Φ and Ψ along with
five non-bonded pair-wise distances [d 1 = d(C0 − Cα), d 2 = d(C0 – C1), d 3 = d(C0 – N2), d 4 = d(Cα − N2), and d 5 = d(N1 – N2); d(A − B) denotes the distance between
the A and B atoms] between the coarse-graining sites are collectively chosen as a seven-dimensional CG variable s. (b) Distributions of F θ (s) for all-atom MD simulation
samples [pFG (s), black solid line] and for CG samples generated by using the neural sampler [pθ (s), red dotted line]. (c) The averaged bias force (ABF) over the seven
components of the CG variable s. Statistical uncertainties are shown in bars. (d) 1-ns simulation trajectories projected on torsion Φ. Blue squares correspond to vanilla MD,
and red dots correspond to MD biased by V ϕ . (e) The contour map of V̄ϕ (Φ, Ψ), which is obtained by effectively integrating out other dimensions in the optimized V ϕ (s) (see
the supplementary material for more details). Superimposed black dots are representative samples produced by the enhanced MD simulation under V ϕ . (f) The optimized
VADE potential F θ (s) obtained by RE-VADE was marginalized onto (Φ, Ψ) by effectively integrating out other dimensions in s. Superimposed black dots are random samples
produced by using the neural sampler.
the motion of the corresponding component, which needs to be Ala2 with respect to (Φ, Ψ) and quantitatively agrees well with the
overcome by the exerted bias forces. As shown in Fig. 4(c), the reference FES [Fig. S3(a)], except that the one of the least popu-
two torsional angles, (Φ, Ψ), which are known to be slowly chang- lated metastable conformations [highlighted by the purple circles in
turn in the folded structure [residues 4–7 in Fig. 3(a)] exhibited based merely on equilibrium FG samples, thus involving less arti-
slightly larger ABF, implying that the hairpin formation may involve facts and computational cost than existing methods. This feature
high free energy barrier, which was overcome by V ϕ (s). Besides, allows researchers to fully exploit the available atomistic simulations
unlike Ala2 where the dominant components were rather localized, in order to construct transferrable CG models.
almost all the 18 backbone torsions of chignolin non-negligibly con- We note here that VADE can be viewed as a general exten-
tributed to the roughness of V ϕ in light of the ABF [Fig. 5(d)], sug- sion of GANs, where the VADE potential serves as a “discrimina-
gesting strong non-local correlations and collective motions among tor” to tell the difference between real samples and those faked by
these backbone torsions. We then projected the 18-dimensional RE- using the “generator” (i.e., the neural sampler). One merit of VADE
VADE bias potential V ϕ onto the reduced space Ω(s) [V̄ϕ (Ω(s)) in over vanilla GANs is that the VADE potential actually yields a den-
Fig. 5(g)] and found the marginalized V ϕ appeared roughly com- sity estimation of the input variables, whereas GANs cannot. The
plementary to the reference [Fig. 5(e)], indicating that the collective mode-dropping issue suffered by GANs can also be avoided through
motions between torsions can be facilitated by the optimized V ϕ ; active mode-exploration [Eq. (S12)] in VADE. In addition, VADE is
hence, the sampling was expedited. In contrast, in a bias-exchange closely connected to some well-established energy-based CG meth-
metadynamics simulation of chignolin, where 18 one-dimensional ods developed in statistical physics, and we overcome the sampling
bias potentials were added to each torsion locally and separately, we issues suffered by these existing EBMs (such as inversed Monte
did not observe accelerated folding/unfolding dynamics [Fig. S6(b)], Carlo and relative entropy approaches) by virtue of the neural sam-
in line with the fact that more refined or delicately designed CVs that pler. Consequently, VADE can now be exploited to optimize existing
can capture the collective behavior of the local torsions are required CG force fields in a computationally efficient way. The neural sam-
for metadynamics to work.62 This comparison reveals the advantage plers presented in this paper are all deep bijective models, which may
that, as an enhanced sampling approach, RE-VADE is able to opti- invoke time-consuming training in practice. Further development
mize a possibly high-dimensional bias potential directly as a func- of more efficient and even training-free neural samplers is highly
tion of the local CG variables (e.g., backbone torsions of proteins) desired. Since VADE is energy-based and only requires samples of
in order to boost rare events governed by the collective motions of molecular configurations from the reference distribution without
these CG variables. the need of knowing the (mean) forces, it allows one to construct
CG models or guide atomistic simulations according to experimen-
tal observations, which is worthy of investigation. Moreover, RE-
VADE allows one to construct CG models even without access to
18
GANs, derivation of (RE-)VADE training objectives, neural sam- D. J. Rezende and S. Mohamed, “Variational inference with normalizing flows,”
plers, model setups, and simulation details. in International Conference on Machine Learning (ICML, 2015), pp. 1530–1538.
19
M. E. Abbasnejad, Q. Shi, A. v. d. Hengel, and L. Liu, presented at the The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR).
20
R. Salakhutdinov, A. Mnih, and G. Hinton, presented at the Proceedings of the
AUTHORS’ CONTRIBUTIONS 24th International Conference on Machine Learning, 2007.
21
J.Z., Y.K.L., Y.I.Y, and Y.Q.G. designed the research; J.Z., R. Salakhutdinov and G. Hinton, in Proceedings of the Twelth International
Y.K.L., and Y.I.Y. performed the research; J.Z., Y.K.L., Y.I.Y., and Conference on Artificial Intelligence and Statistics, edited by D. David van and
W. Max (PMLR, Proceedings of Machine Learning Research, 2009), Vol. 5,
Y.Q.G. analyzed the data; J.Z., Y.K.L., Y.I.Y., and Y.Q.G. wrote the pp. 448–455.
paper. 22
J. J. Hopfield, Proc. Natl. Acad. Sci. U. S. A. 79(8), 2554–2558 (1982).
The authors declare no conflict of interest. 23
D. Wales, Energy Landscapes: Applications to Clusters, Biomolecules and Glasses
(Cambridge University Press, 2003).
24
Y. Du and I. Mordatch, preprint arXiv:1903.08689 (2019).
ACKNOWLEDGMENTS 25
A. P. Lyubartsev, Eur. Biophys. J. 35(1), 53 (2005).
26
M. S. Shell, J. Chem. Phys. 129(14), 144108 (2008).
The authors thank Xing Che, Frank Noé, Lijiang Yang, and 27
D. Reith, M. Pütz, and F. Müller-Plathe, J. Comput. Chem. 24(13), 1624–1636
Weinan E for useful discussion. This research was supported (2003).
by the National Natural Science Foundation of China (Grant 28
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
Nos. 21927901, 21821004, and 21873007 to Y.Q.G.), the National A. Courville, and Y. Bengio, presented at the Advances in Neural Information
Key Research and Development Program of China (Grant No. Processing Systems, 2014.
2017YFA0204702 to Y.Q.G.), and the Guangdong Basic and Applied 29
S. Kullback and R. A. Leibler, Ann. Math. Stat. 22(1), 79–86 (1951).
30
Basic Research Foundation (Grant No. 2019A1515110278 to Y.I.Y.). M. Arjovsky, S. Chintala, and L. Bottou, presented at the International Confer-
J.Z. thanks the Alexander von Humboldt Foundation for supporting ence on Machine Learning, 2017.
31
part of this research. A. Chaimovich and M. S. Shell, J. Chem. Phys. 134(9), 094112 (2011).
32
O. Valsson and M. Parrinello, Phys. Rev. Lett. 113(9), 090601 (2014).
33
J. Zhang, Y. I. Yang, and F. Noé, J. Phys. Chem. Lett. 10(19), 5791–5797
(2019).
DATA AVAILABILITY 34
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, J. Am. Stat. Assoc. 112(518),
52 60
L. Bonati, Y.-Y. Zhang, and M. Parrinello, Proc. Natl. Acad. Sci. U. S. A. 116(36), P. Shaffer, O. Valsson, and M. Parrinello, Proc. Natl. Acad. Sci. U. S. A. 113(5),
17641–17647 (2019). 1150–1155 (2016).
53 61
P. Tiwary and B. J. Berne, J. Chem. Phys. 147(15), 152701 (2017). J. Zhang, Y.-K. Lei, X. Che, Z. Zhang, Y. I. Yang, and Y. Q. Gao, J. Phys. Chem.
54
S. Singer and J. Nelder, Scholarpedia 4(7), 2928 (2009). Lett. 10(18), 5571–5576 (2019).
55 62
S. Kmiecik, D. Gront, M. Kolinski, L. Wieteska, A. E. Dawid, and A. Kolinski, F. Palazzesi, O. Valsson, and M. Parrinello, J. Phys. Chem. Lett. 8(19),
Chem. Rev. 116(14), 7898–7936 (2016). 4752–4756 (2017).
56 63
K. Lindorff-Larsen, S. Piana, R. O. Dror, and D. E. Shaw, Science 334(6055), H. Wu, A. Mardt, L. Pasquali, and F. Noe, presented at the Advances in Neural
517, 520 (2011). Information Processing Systems, 2018.
57 64
S. Piana and A. Laio, J. Phys. Chem. B 111(17), 4553–4559 (2007). H. Sidky, W. Chen, and A. L. Ferguson, Chem. Sci. 11(35), 9459–9467
58 (2020).
C. M. Davis, S. Xiao, D. P. Raleigh, and R. B. Dyer, J. Am. Chem. Soc. 134(35),
65
14476–14482 (2012). J. M. L. Ribeiro, P. Bravo, Y. Wang, and P. Tiwary, J. Chem. Phys. 149(7), 072301
59 (2018).
H. Nguyen, J. Maier, H. Huang, V. Perrone, and C. Simmerling, J. Am. Chem.
66
Soc. 136(40), 13959–13962 (2014). L. Zhang, H. Wang, and W. E, J. Chem. Phys. 148(12), 124113 (2018).