(W-10321) Uncertainty Quantification in Molecular Simulations With Dropout Neural Network Potentials

RESEARCH ARTICLE | NOVEMBER 05 2020
Deep learning for variational multiscale molecular modeling


Jun Zhang ; Yao-Kun Lei; Yi Isaac Yang  ; Yi Qin Gao 
J. Chem. Phys. 153, 174115 (2020)

https://doi.org/10.1063/5.0026836
CrossMark
 
View Export
Online Citation
02 February 2024 01:28:38

The Journal
ARTICLE scitation.org/journal/jcp
of Chemical Physics
Deep learning for variational multiscale

molecular modeling
Cite as: J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836
Submitted: 1 September 2020 • Accepted: 18 October 2020 •
Published Online: 5 November 2020
Jun Zhang,1,2 Yao-Kun Lei,3 Yi Isaac Yang,1,a) and Yi Qin Gao1,3,4,5,a)
AFFILIATIONS
1
Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, 518055 Shenzhen, China
2
Department of Mathematics and Computer Science, Freie Universität Berlin, Arnimallee 6, 14195 Berlin, Germany
3
Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University,
100871 Beijing, China
4
Beijing Advanced Innovation Center for Genomics, Peking University, 100871 Beijing, China
5
Biomedical Pioneering Innovation Center, Peking University, 100871 Beijing, China
a)
Authors to whom correspondence should be addressed: yangyi@szbl.ac.cn and gaoyq@pku.edu.cn
ABSTRACT
02 February 2024 01:28:38

Molecular simulations are widely applied in the study of chemical and bio-physical problems. However, the accessible timescales of atomistic
simulations are limited, and extracting equilibrium properties of systems containing rare events remains challenging. Two distinct strategies
are usually adopted in this regard: either sticking to the atomistic level and performing enhanced sampling or trading details for speed by
leveraging coarse-grained models. Although both strategies are promising, either of them, if adopted individually, exhibits severe limitations.
In this paper, we propose a machine-learning approach to ally both strategies so that simulations on different scales can benefit mutually
from their crosstalks: Accurate coarse-grained (CG) models can be inferred from the fine-grained (FG) simulations through deep generative
learning; in turn, FG simulations can be boosted by the guidance of CG models via deep reinforcement learning. Our method defines a
variational and adaptive training objective, which allows end-to-end training of parametric molecular models using deep neural networks.
Through multiple experiments, we show that our method is efficient and flexible and performs well on challenging chemical and bio-molecular
systems.
Published under license by AIP Publishing. https://doi.org/10.1063/5.0026836., s
I. INTRODUCTION 1
F(s) = − [log p(s) + log Z], (2)
β
Molecular simulations, particularly all-atom and ab initio
molecular dynamics (MD), have furthered our understanding of where Z = ∫ e−βU(R) dR is the partition function, β is the inverse
many chemical and bio-physical processes.1,2 In molecular sim- temperature, and δ denotes the Dirac-delta function. Equation (2)
ulations, interactions between particles (e.g., atoms, residues, or holds up to an arbitrary additive constant. Usually, s is selected to
molecules) are described by an (potential) energy function U(R) of be slowly changing variables governing the process of interest, and
configuration R. To investigate such systems, one is often not inter- the rest of degrees of freedom can be treated in the mean-field fash-
ested in the exact potential or kinetic energy but in the free energy ion.5,6 Under this setting, F(s) becomes a CG description of the
(or the equilibrium distribution) of some reduced descriptors, e.g., original thermodynamic system, and simulations performed under
collective variables (CV)3 or CG variables,4 s(R), F(s) are generally much faster than those run on the FG potential
[i.e., U(R)] because Dim(s)≪Dim(R) [where Dim(⋅) denotes the
dimensionality] despite the loss of finer details. Therefore, Eqs. (1)
e−βU(R) δ(s − s(R))dR e−βF(s) and (2) are also known as the principle of thermodynamic con-
p(s) = ∫ −βU(R) dR
= , (1)
∫ e Z sistency for coarse graining.5 However, practical implementation
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-1

Published under license by AIP Publishing
The Journal
of Chemical Physics
of Eqs. (1) and (2) is hindered by two critical issues: (1) How extension to the well-established energy-based CG methods includ-
can one approximate a reliable analytical form for the CG poten- ing the inversed Monte Carlo and relative entropy approaches, and
tial F(s) given access to samples drawn from U(R)? (2) How can we delivered a new solution to the sampling problem in the EBMs
one draw equilibrium samples from U(R), which are further used by virtue of deep learning. On the other hand, from the aspect of
to infer F(s)? The former is known as the coarse-graining prob- machine learning, our approach generalizes the well-known Gen-
lem, while the latter is known as the importance sampling problem, erative Adversarial Networks (GANs),28 and we overcome some
and both are of particular importance in physics, chemistry, and severe issues (e.g., mode-dropping and inability for density esti-
biology. mation) of GANs. Furthermore, based on reinforcement learn-
Conventionally, if s is low-dimensional [say, Dim(s) ≤ 3], non- ing, our approach allows simulations on different scales to ben-
parametric methods such as kernel density estimation (KDE)7 can efit from each other so that the inferred CG potential can, in
be adopted to infer F(s). However, due to the “curse of dimen- turn, help enhance the sampling of the FG model in a reinforced
sionality,”8 they become quickly infeasible as Dim(s) increases. For manner.
instance, the number of samples required by KDE to achieve con-
vergence is known to grow exponentially as Dim(s) increases.9
Artificial neural networks (ANNs) and deep learning may offer II. METHODS
extra flexibility and expressivity to this end.10,11 For instance, sev- A. Variational adversarial density estimation (VADE)
eral recent studies proposed supervised learning approaches (e.g.,
We propose a deep learning approach to approximate the
force-matching12,13 ) to fit F(s) by ANNs. However, computing the
(possibly high-dimensional) FES, F(s), by using parametric mod-
mean force F(s) requires a large amount of samples from U(R)
els. Specifically, we denote the approximate free energy func-
at the neighborhood of s, which may be computationally pro-
tion and the associated probability distribution as F θ (s) and pθ (s)
hibitive if Dim(s) is large. Worse still, in many cases (particularly
∝ exp(−βF θ (s)), respectively (where θ are optimizable parameters),
in experiments), measurement of the mean forces is hardly prac-
and define a strict divergence D(p∣∣pθ ) between p and pθ . A strict
tical, thus negating the possibility of fitting F(s) as a regression
divergence, including but is not limited to Kullback–Leibler diver-
problem.
gence DKL (p∣∣pθ )29 and Earth Mover’s distance DW (p∣∣pθ ),30 satisfies
An alternative view of inferring p(s) is provided by statisti-
the condition that D(p∣∣pθ ) ≥ 0 where the equality holds if and only
cal modeling and machine learning, where density estimation of
if p = pθ , and hence it can be used as a variational objective.26,31,32
high-dimensional data has been a long-standing goal.11,14 Particu-
02 February 2024 01:28:38

Particularly, the gradient of DKL (p∣∣pθ ) with respect to θ takes the
larly, a recent burst of research sparks new ideas to exploit deep
form (see the supplementary material, Part I, Sec. I, or Ref. 31, for
learning to this end, giving rise to various deep generative mod-
the derivation)
els including variational auto-encoders and15 auto-regressive16 and
normalizing-flow models.17,18 These generative models enjoy the
merits of fast sampling and scale well to data of large dimensional- ∇θ DKL (p∣∣pθ ) = ⟨β∇θ Fθ (s)⟩p(s) − ⟨β∇θ Fθ (s)⟩pθ (s) , (3)
ity. The price paid for such easy-sampling is that these methods have
to premise on certain simple prior distributions; hence, the com-
plexity of the distributions resulted from these methods is bound where ⟨ f (x)⟩p(x) denotes the expectation value of f, a function
by the manifold structure of the prior distribution and complex- of x, over a distribution p(x). ∇θ DW (p∣∣pθ ) has been separately
ity of the neural network models. Therefore, they are known to derived in Targeted Adversarial Learning Optimized Sampling
suffer issues such as mode-dropping and often assign probability (TALOS).33 Notably, Eq. (3) bears intriguing similarity with Wasser-
mass to areas unwarranted by the data.19 On the other hand, gen- stein Generative Adversarial Networks [W-GANs; Eq. (S27)].30 As
erative learning with energy-based methods (EBMs) can be dated in GANs and W-GANs (see a brief introduction to GAN and
back to even longer before,20–22 which, in principle, can fit arbi- W-GAN in the supplementary material, Part I, Sec. III), opti-
trarily complex distributions due to the flexibility and plasticity mizing Eq. (3) requires both positive samples [drawn from the
of the energy landscapes.23,24 Due to the flexibility, preserving the ground-true distribution p(s)] and negative samples [drawn from
symmetry and invariance of many-particle systems that is diffi- the model distribution pθ (s)], thus effectively avoiding overfit-
cult for other generative models remains tractable in EBMs. More- ting in an adversarial manner. Given this observation, we refer
over, EBM can be easily conditioned on a priori restraints, thanks to Eq. (3) as variational adversarial density estimation (VADE),
to the additive compositionality of the energy functions. Despite and F θ optimized via Eq. (3) as the VADE potential hereafter
these advantages, implementation of EBMs is often hindered by the [Fig. 1(a)].
intractable sampling issue (i.e., evaluating the partition function), In practice, VADE requires the evaluation of the two expec-
thus making most energy-based CG approaches, such as inversed tation values in Eq. (3). If we have access to, say, equilibrium
Monte Carlo,25 relative entropy,26 and iterative Boltzmann inver- MD samples drawn from U(R) or experimental observations of
sion,27 less practical for problems in which Dim(s) is large and F(s) is s, the distribution of which is denoted as pFG (s), we can use pFG
complicated. to approximate p and optimize F θ with respect to a surrogate
In this paper, combining the strengths of both deep gener- objective DKL (pFG ∣∣pθ ), minimizing which is equivalent to max-
ative models and EBMs, we propose a variational approach to imum likelihood estimation. The remaining task is to calculate
the CG problem, i.e., to infer an analytic form for the complex ⟨∇θ Fθ (s)⟩pθ (s) . Since the analytical forms of F θ (s) and ∇s F θ (s) are
free energy surface (FES) without supervision. On the one hand, both available, we can perform CG simulations on F θ (s) to gener-
from the point view of statistical physics, our approach is an ate negative samples and estimate ⟨∇θ Fθ (s)⟩pθ (s) . If s is relatively
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-2

The Journal
of Chemical Physics
FIG. 1. Workflow of variational adversarial density estimation (VADE). (a) VADE: the free energy approximator F θ simultaneously incorporates positive samples [s ∼ pFG (s)]
and negative samples [s ∼ pθ (s)]. The negative samples are generated by Monte Carlo (MC) or Langevin Dynamics (LD) simulations over F θ . Red arrows indicate the
direction of data flow or the computational graph. (b) VADE with a neural sampler (VADE-NS): identical to VADE, except that the negative samples are drawn by using a
neural sampler fψ . (c) Reinforced VADE (RE-VADE): Given simulation samples from fine-grained (FG) models (pFG ), a coarse-grained (CG) potential F θ can be approximated
through VADE. In turn, a target distribution pT (s;θ) can be defined based on F θ , according to which a bias potential V ϕ can be variationally optimized to boost the FG
simulation.
low-dimensional, the CG simulations are usually computationally distribution q(z), we can compute qψ (s) according to the change-
economic and converge relatively fast; hence, brute-force simula- of-variable formula [Eq. (5)],18
tions via Monte Carlo (MC) or Langevin dynamics (LD) generally
suffice. ∂fψ
log qψ (s) = log q(z) − log∣det( )∣, (5)
∂z
02 February 2024 01:28:38

B. Scale up VADE via neural samplers (VADE-NSs)
In contrast to GANs where negative samples can be easily where det(∂f /∂z) is the determinant of the Jacobian matrix ∂f /∂z.
generated by drawing simple independent random noises,28 VADE In practice, q(z) can be chosen as a tractable distribution (e.g., inde-
entails sampling over F θ (s) through, say, MC or LD simulations, pendent multi-variate Gaussians) so that sampling z from qψ (s)
which could be intractable as Dim(s) grows. Worse still, after each is convenient. Many deep bijective models17,37–40 can be adopted
update of θ, Eq. (3) requires re-sampling over the newly updated as f ψ to approximate pθ through Eqs. (4) and (5), and we term
F θ ; thus, optimizing F θ through MC or LD sampling could be them as the neural sampler (NS) hereafter (see the supplemen-
time-consuming. Similar problems are also pronounced in other tary material, Part I, Sec. II, for more details about NS). Equipping
energy-based CG methods including inversed MC and iterative VADE with a neural sampler (VADE-NS) [Fig. 1(b)], we arrive
Boltzmann inversion. To conquer this sampling issue, we pro- at a nested optimization problem similar to GANs: Given a fixed
pose to equip VADE with a neural sampler, i.e., a deep genera- F θ , a neural sampler qψ is optimized according to the following
tive model, in replacement of the less efficient MC or LD sim- equation (see the supplementary material, Part I, Sec. II, for the
ulations. Specifically, given a fixed-form energy function F θ , one derivation):
can approximate an upper bound for the free energy (or equiv-
alently, the normalized equilibrium distribution) via an optimiz- ∇ψ DKL (qψ ∣∣pθ ) = ⟨∇ψ log q(fψ (z))⟩q(z) + ⟨β∇ψ Fθ (fψ (z))⟩q(z) , (6)
able surrogate distribution qψ (where ψ denotes the optimizable
parameters) by minimizing the KL-divergence between qψ and the
fixed pθ , where q(z) is a multivariate normal distribution. Given an optimized
neural sampler (qψ ) along with positive samples [i.e., s ∼ pFG (s)],
DKL (qψ ∣∣pθ ) = ⟨log qψ (s) + βFθ (s)⟩qψ (s) + log Zθ . (4) F θ can be updated according to the following equation (see the
supplementary material, Part I, Sec. II, for the derivation):
The first term on the right-hand side of Eq. (4) is known as
the variational free energy of F θ .34,35 Note that calculating the con- ∇θ DKL (pFG ∣∣pθ ) = ⟨β∇θ Fθ (s)⟩pFG (s) − ⟨β∇θ Fθ (fψ (z))⟩q(z) . (7)
stant Zθ = ∫ e−βFθ (s) ds is unnecessary for solving this optimization
(hence circumventing sampling over F θ ). Since DKL (qψ ∣∣pθ ) is lower In this sense, the neural sampler here takes a flavor of the gener-
bounded, optimal qψ (s) can be determined variationally according ator in GANs, whereas F θ behaves like a discriminator. This nested
to Eq. (4) without the concern of overfitting. However, in order optimization problem even allows f ψ to be non-bijective mapping, as
to minimize DKL (qψ ∣∣pθ ), sampling from qψ must be efficient, and shown in recent studies.41 Notably, in some applications of VADE
the entropy term log qψ (s) must be tractable.35,36 Considering that where no a priori restraints are imposed over the functional form
a function f ψ (parameterized by ψ) can map a random variable z of F θ , the nested optimization problem can be further simplified by
bijectively into s, i.e., s = f ψ (z), and that z comes from a known modeling F θ using a black-box deep bijective model [Eq. (6)],
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-3

The Journal
of Chemical Physics
βFθ (s) ≡ −log qψ (s), (8) β

pT (s; θ) ∝ exp(− Fθ (s)), (9)
γ
and optimizing F θ directly through Eq. (3). More details about the where γ > 1 is known as the well-tempered factor so that the less vis-
training and model choices for VADE-NS can be found in the ited regions are more encouraged by the target distribution. We then
supplementary material, Part I, Sec. II. aim to optimize a bias potential V ϕ (s), which can drive the sampling
Equations (3), (6), and (7) are the main optimization objec- toward pT , and this can be done by minimizing a strict divergence,
tives of VADE. Equation (3) dictates the nature of VADE to be for instance, DKL (pT ∣∣pϕ ),
energy-based; thus, one can account for the invariance of molec-
ular systems and pre-condition on various physics restraints by
designing proper functional forms for the VADE potential F θ . Com- ∇ϕ DKL (pT ∣∣pϕ ) = ⟨β∇ϕ Vϕ (s)⟩p − ⟨β∇ϕ Vϕ (s)⟩p , (10)
T ϕ
pared to other energy-based CG approaches, VADE is able to deal
with a high-dimensional input without suffering the sampling issues
where pϕ denotes the Boltzmann distribution induced by (U + V ϕ ),
owing to the neural sampler. Moreover, implementation of Eqs. (3),
which can be approximated through MD sampling. The separate
(6), and (7) only requires samples of molecular configurations
parameterization allows us to use an imbalanced learning schedule
although additional information such as forces could also be incor-
for the free energy function F θ and the bias potential V ϕ . Particu-
porated [Eq. (S10)]. This attribute makes VADE particularly useful
larly, we can train F θ (s) with a higher rate based on the latest samples
in many CG cases where no information such as forces or dynamics
generated by the biased MD simulations and update V ϕ (s) in a more
other than molecular configurations is accessible. In contrast, force-
conservative manner. In terms of imitation learning, F θ (s) plays the
matching and dynamics-based CG approaches would fail in these
role of a leader that coins a moving target based on the current den-
cases.
sity estimation, while V ϕ (s) learns to tune the policy in order to
catch up [Fig. 1(c)]. A noteworthy advantage of such “leader–chaser”
C. Reinforced VADE (RE-VADE) scheme lies in the fact that F θ along with pT (s;θ) is constructed
In a more common setting where FG samples (pFG ) are not based on the samples drawn from pϕ (i.e., MD simulations under
available beforehand, we have to perform sampling over U(R) from V ϕ ), so pT and pϕ always share substantial overlap; otherwise, DKL
scratch. Since Dim(R) is usually very large, i.e., Dim(R)≫Dim(s), would fall victim to the notorious vanishing gradient issue.30 Such
02 February 2024 01:28:38

estimating ensemble averages over p(s) is often infeasible for brute- sort of separate parameterization and two-timescale updated rule
force FG (e.g., all-atom or ab initio) simulations or variational infer- were also adopted in some unsupervised and reinforcement learn-
ence. A plethora of enhanced sampling methods have been devel- ing settings such as GANs,28,49 double Q-learning,50 and parallel
oped trying to solve this issue, and there exist several excellent WaveNets.51
reviews on this topic.3,42–44 Here, we will show that VADE, com- Another major advantage of this separate parameterization
bined with Reinforcement Learning (RL), provides one solution to approach is that, albeit density estimation of high-dimensional data
this problem, and we name this approach RE-VADE. Mathemati- inevitably involves high variance and large uncertainties, RE-VADE
cally, if we have an optimal F θ (s), which is equal to the ground-true still enables enhanced sampling with tolerance of these errors. In
F(s), and perform sampling under (U − F θ ), we will arrive at a RE-VADE, although F θ (s) may be imperfect globally (especially at
uniform distribution over s. Therefore, one is motivated to employ those under-explored regions), it still gives relatively accurate den-
F θ (s) as a bias potential in order to achieve a flattened distribu- sity estimation of the recently visited regions, according to which
tion over s, which may originally involve high free energy barri- V ϕ (s) is improved locally via Eq. (10) and enhances the sampling
ers.45 However, the situation is complicated by the errors in F θ : efficiency. On the other hand, as the FG sampling is enhanced by
Since F θ is optimized with respect to available FG samples (usu- V ϕ (s), we can get a better approximation of p(s), thus gradually
ally corresponding to metastable states in MD simulations), its value pushing F θ (s) to the optimum according to Eq. (3). Specifically,
can be very inaccurate in those under-sampled regions (e.g., tran- if one chooses a well-tempered target distribution as in Eq. (9)
sition regions). Therefore, directly inserting F θ as the bias poten- or Eq. (S31), optimal V ϕ will converge to the well-tempered free
tial could be non- or even counter-productive. We note here that energy [Eq. (S32)]. From this perspective, metadynamics45 can be
similar problems where one has to deal with moving distributions viewed as a special case of RE-VADE if F θ (s) is replaced by KDE
and partial sampling are commonly encountered and addressed and accumulative Gaussians serve as V ϕ (s); or in other words, RE-
in RL.46 Inspired by reinforced imitation learning, we introduce a VADE is a generalization of metadynamics to large Dim(s) and
two-timescale learning scheme, where a bias potential V ϕ (s) with parametric bias potential functions (see the supplementary material,
ϕ denoting optimizable parameters (equivalent to a policy function Part I, Sec. VI, for more information about the connection between
in RL) is separately trained in addition to F θ (s), which can now RE-VADE and other enhanced sampling approaches). The assem-
be viewed as a value function in the spirit of actor-critic RL.47 As bled training algorithm of RE-VADE is summarized in Algorithm
in variationally enhanced sampling (VES)32 or TALOS, we define S1. Compared to existing methods, which also foot on the target
a target distribution pT (s;θ) where the free energy barrier is low- distribution,45,52 VADE-NS allows us to implement Eq. (10) over
ered according to F θ (s). Following the well-tempered (WT) meta- high-dimensional CG space because density estimation and sam-
dynamics,48 one reasonable choice of pT (s;θ) (see also Eq. (S31) and pling of the resulting target distribution (pT ) can be conducted
the supplementary material, Part I, Sec. V, for more information conveniently. Particularly, in many enhanced sampling scenarios,
about pT ) is a priori knowledge of the underlying F(s) is scarcely available;
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-4

The Journal
of Chemical Physics
hence, we can model F θ (s) with deep bijective models for simplic-
ity [Eq. (8)]. Besides, to attack the issue of irregular gradients while
maximally harnessing the expressivity of ANNs as energy functions,
the architecture of ANNs should be carefully designed, and spe-
cial regularization techniques may be needed to smooth the gra-
dients (see the supplementary material, Part I, Sec. IV, for more
details).
III. RESULTS AND DISCUSSION

In the following, we will present three examples to show how
VADE provides new solutions to several challenging problems com-
monly encountered during multiscale molecular modeling. We first
benchmarked VADE as an efficient and reliable parametric den-
sity estimation approach on an illustrative model. Additionally, we
also showed how the NS can be integrated in VADE in order to
overcome the sampling issue. In the second experiment, we aimed
to develop a CG model for the backbone atoms of a protein.
Given the contributed dataset (an all-atom MD simulation trajec-
tory), which merely contains information of the protein configu-
rations but not the forces, force-matching-based regression strate-
gies cannot be deployed straightforwardly. Moreover, since this is
a relatively high-dimensional problem, it poses challenges to the
traditional energy-based CG methods due to the intractable sam-
pling issue. In contrast, we showed that VADE is able to infer
a reasonable CG potential for this challenging problem in tens
02 February 2024 01:28:38

of minutes. In the last two experiments, we investigated the pos-
sibility to build a CG model for complex bio-molecular systems
without any available FG simulation data. We connected simu- FIG. 2. Benchmark VADE on a 2D numerical model. (a) 2D potential energy sur-
lations on multiple scales through RE-VADE and demonstrated face (PES), U(x, y), of the model. The PES was shifted with respect to the global
that RE-VADE is able to efficiently yield accurate high-dimensional minimum so that U(x, y) ≥ 0 everywhere. (b) Kernel density estimation of the train-
ing samples. The estimated free energy surface (FES) is shown in the colored
CG models even when no simulation samples are available contour plot, superimposed with randomly selected training samples (black dots).
a priori. The FES is shifted (and hereafter) as in (a) for a fair comparison. (c) Fθ∗ (x, y)
obtained by VADE that was optimized via Monte Carlo sampling (VADE-MC) is
shown in colored contours. Black dots are random samples generated by MC over
A. Benchmark VADE for density estimation the optimized Fθ∗. (d) Fθ∗ (x, y) obtained by VADE-NS is shown in colored con-
tours, superimposed with random samples generated by using the neural sampler.
We benchmarked VADE on the two-dimensional Tiwary– (e) Panel 1 shows the distribution of the final Fθ∗ (x, y) obtained by VADE-MC
Berne model,53 which consists of three potential wells. The contour over the training samples (black solid line) and those generated by MC (red dotted
map of the potential energy surface (PES), U(x, y), is shown in line); panel 2 shows the distribution of the final Fθ∗ (x, y) obtained by VADE-NS
Fig. 2(a). A Langevin dynamics simulation was conducted on U(x, y) over training samples (black solid line) and those generated by using NS (red
to create the training and validation samples as pFG in Eq. (3) (see the dotted line). (f) KL-divergence DKL (pFG ∣∣pθ ) evolved along the training progress
supplementary material, Part III, Sec. I, for more simulation details). of VADE-MC (blue solid line, and the statistical uncertainty was shown as the
blue-shaded region encompassing the line) and VADE-NS (red solid line, and
We chose s = (x, y) in order to approximate the equilibrium distri- the statistical uncertainty was shown as the red-shaded region encompassing the
bution under U(x, y) via VADE. We first performed kernel density line).
estimation over the training data as a baseline [Fig. 2(b)]. The FES
obtained by KDE [Fig. 2(b)] agrees well with the ground-true PES
[Fig. 2(a)] in all the metastable regions. However, the difference is
relatively large in the transition regions due to the lack of training diminished during training, and the optimization of F θ converged
data in those regions (indicating the insufficient sampling of the LD with 100 epochs, demonstrating the efficiency and efficacy of VADE.
simulations). Besides DKL , another important indicator of the quality of models for
We then performed VADE and implemented importance MC VADE is whether F θ of positive samples (i.e., training data) and neg-
sampling to optimize an ANN, which serves as the density estima- ative samples (generated by using a VADE sampler) are identically
tor, F θ (x, y) (see the supplementary material, Part III, Sec. I, for (or similarly) distributed. Therefore, we examined the two distribu-
more details about the model setup and training details). We trained tions accordingly [Fig. 2(e), panel 1] and found that they overlapped
F θ (x,y) through VADE [Eq. (3)] for 100 epochs and plotted the esti- perfectly, verifying that the resulting F θ is of high quality. The opti-
mated DKL between pθ and pFG along the training progress [Fig. 2(f)]. mized Fθ∗ (x, y) obtained by VADE was shown in Fig. 2(c), and we
It can be seen that DKL between the two distributions quickly found that Fθ∗ not only agrees well with the baseline KDE [Fig. 2(b)]
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-5

The Journal
of Chemical Physics
but also appears quite smooth as the original PES, thanks to the B. Coarse graining a mini-protein by VADE based
gradient regularizations we imposed over F θ (see the supplemen- on contributed simulation data
tary material for more details). We also showed the negative sam- Particle-based CG models are widely adopted in bio-molecular
ples generated by MC sampling over Fθ∗ [Fig. 2(c)], which, indeed, modeling.55 Conventionally, such models are premised on some
cannot be visually distinguished from those out of the training empirical forms of force fields and fitted with respect to experi-
set [Fig. 2(b)]. mental and/or high-level calculation data. Since the force fields are
Furthermore, we chose a continuous normalizing-flow model generally designed to reflect certain physics restraints, the num-
with a free-form Jacobian40 as the neural sampler and performed ber of parameters is relatively limited and bottlenecks the expres-
VADE-NS over the same training set (see the supplementary mate- sivity and flexibility of the resulting models. In contrast, it is also
rial, Part III, Sec. I, for more details about the model setup and train- possible to construct a CG potential entirely through an ANN in
ing details). For a fair comparison, VADE-NS was trained under order to improve the expressivity; however, the model would require
the same settings with VADE. Tracking the optimization process, excessive amount of samples for training and might not general-
we found that DKL between pθ (which was yielded by VADE-NS) ize well due to the large parameter space. VADE can help ally the
and pFG also quickly diminished and even dropped more rapidly best of the two worlds if the CG potential takes a hybrid form
than VADE in the early stage, signifying a higher initial training [Eq. (11)],
efficiency [Fig. 2(f)]. However, as the training proceeded, VADE-
NS did not reach the same minimum level in DKL as achieved
by VADE possibly due to insufficient expressivity commonly suf- Fθ (s) = FΘ (s) + Fprior (s), (11)
fered by deep bijective models. Intriguingly, the distributions of F θ
(yielded by VADE-NS) over positive and negative samples still over- where F Θ (s) is a trainable parametric model, while F prior (s) is a force-
lapped well [Fig. 2(e), panel 2], indicating that VADE-NS also found field-like term accounting for some a priori knowledge or physics
a minimum to the optimization problem in Eq. (3), but the solu- restraints. Equation (11) can be interpreted from two perspectives:
tion is sub-optimal due to the insufficient variational flexibility of On the one hand, F Θ (s) can be regarded as a correction term for the
the bijective model. The optimized Fθ∗ (x, y) obtained by VADE- traditional force fields. On the other hand, F prior (s) serves as a rea-
NS was plotted in Fig. 2(d). Fθ∗ obtained by VADE-NS was not so sonable prior distribution, which effectively restricts the hypothesis
smooth as the model obtained by VADE because it is non-trivial space of F θ , thus expediting the training.
to impose the gradient regularizations over the bijective model in We tested this idea on a mini-protein chignolin using long all-
02 February 2024 01:28:38

VADE-NS. Nonetheless, the overall contour of Fθ∗ is still in line atom MD simulation trajectories contributed by Lindorff-Larsen et
with the baseline KDE, capturing all the three metastable states, al.56 The positions of the ten Cα atoms were selected as the CG vari-
while leaving blank for the transition regions, and the negative ables s [Fig. 3(a)]. F prior (s) contains repulsive restraints to penalize
samples produced by Fθ∗ also resemble those from the training any overlapping between CG particles [Eq. (S35)]. We adopted a
set. deep bijective model38 as F θ (s) and performed VADE-NS accord-
In summary, the surrogate energy functions Fθ∗ yielded by ing to Eq. (S19), a regularized version of Eq. (3), to incorporate the
VADE and VADE-NS successfully preserve all the three local min- restraint F prior (see the supplementary material, Part III, Sec. II, for
ima correctly, showing no signs of mode-dropping. This is a par- more details about model setups and training details). F θ was trained
ticular merit of VADE because metastable states usually play func- over c.a. 500 000 FG samples for 10 epochs. The entire training pro-
tionally important roles for bio-molecules, and dropping any of cess took less than 10 min in wall-clock time on a single NVIDIA
them during coarse graining may lead to failure of the resulting GeForce GTX 1650 GPU card. After training finished, we plotted
model. Now that we obtained the analytical form of F θ based merely the distributions of F θ over the fine-grained MD samples and the
on samples from U(x, y) [rather than knowing the mathematical negative CG samples produced by VADE-NS [Fig. 3(b)]. The two
form of U(x, y)], we can locate free energy minima and cluster distributions agreed qualitatively well, indicating good quality of the
samples. Now that we have the analytical form of Fθ∗ , we mini- resulting model.
mized the simulation samples according to Fθ∗ using the Nelder– In order to examine whether F θ captures the important protein
Mead algorithm54 and observed that all the samples finally fall into conformations, we characterized the generated CG samples from
three distinct local minima [Fig. S1(a)]. We also colored the sim- F θ with two widely used metrics: the root-mean-squared deviation
ulation samples according to their final minimizers, as shown in (RMSD) with respect to the folded structure and the radius of gyra-
Fig. S1(b), and found that the noisily distributed samples were, tion (Rg). We compared the distribution of the generated samples
indeed, assigned to different metastable states quite reasonably. This against the real ones in terms of these two metrics [Fig. 3(c)] and
result implies potentially useful application of VADE in identify- found that the CG potential faithfully reproduced the overall confor-
ing free energy minima (or metastable states) and clustering noisy mational features. Specifically, a sharp local minimum correspond-
high-dimensional samples, which is demanded by many mechanism ing to the compact native conformation (small RMSD and small Rg)
analysis and kinetic modeling methods. Besides, the resulting VADE is well defined by F θ ; meanwhile, the distribution of the extended
potential Fθ∗ exhibited relatively larger uncertainties in those less unfolded structures is preserved as well.
explored regions such as the transition state (Fig. S2) due to the In addition to the overall conformational patterns, we are also
lack of positive samples therein. Therefore, F θ would be further concerned with the fine details produced by the optimized CG
improved if more training samples over the transition regions can model. Particularly, we examined the distributions of the Cα 4–
be obtained, and such consideration is a strong motivation behind Cα 5 bond length, Cα 4–Cα 5–Cα 6 angle, and Cα 4–Cα 5–Cα 6–Cα 7 tor-
RE-VADE. sion, which locate at the turning point of the β-hairpin structure
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-6

The Journal
of Chemical Physics
FIG. 3. Coarse-grained chignolin by VADE. (a) FG protein structure of chignolin. The CG particles are the 10 Cα atoms, which are highlighted by colored beads and indexed
accordingly. (b) Distributions of VADE potential F θ for FG MD simulation samples (black solid line) and for CG structures sampled from F θ (red dotted line). (c) 2D FES
spanned by RMSD and Rg of FG MD simulation samples. (d) 2D FES spanned by RMSD and the radius of gyration (Rg) of VADE CG structures. (e) Distributions of the bond
length between Cα 4–Cα 5 (upper panel), the angle between Cα 4–Cα 5–Cα 6 (middle), and the torsion between Cα 4–Cα 5–Cα 6–Cα 7 (bottom). Black histograms correspond to
FG samples and white histograms to VADE samples. Statistics were performed over 10 000 FG MD samples and VADE CG samples, respectively. (f) VADE CG structures
are shown in beads [colored according to Fig. 3(a)] and purple bonds, superimposed with their best-aligned FG counterparts (shown in yellow ribbons) selected from the MD
trajectories.
[Fig. 3(a)]. We compared the distributions of these local struc- simulation over F θ (s) to generate negative samples from pθ (s). V ϕ (s)
02 February 2024 01:28:38

tural features produced by F θ against the reference FG MD samples is constructed by a multi-layer perceptron, which is initialized to be
[Fig. 3(e)] and confirmed that the optimized model F θ (s) also faith- zero everywhere, while F θ is parameterized by using a deep bijec-
fully reproduced these local structural patterns. Finally, we showed tive model [Eq. (8)] so that negative sampling and estimation of
some randomly sampled CG structures from F θ , superimposed with ⟨∇θ Fθ (s)⟩pθ can be achieved conveniently by using the neural sam-
their best-aligned FG counterparts [Fig. 3(f)]. These CG conforma- pler through Eq. (S13) (see the supplementary material, Part III, Sec.
tions are diverse and realistic comparable to FG ones, including III, for more details about model setups). After a period of FG sam-
the folded structure, partially folded intermediates, and extended pling (20 ps in length) biased by V ϕ , we reweight the MD samples
unfolded structures, demonstrating that the CG model obtained to represent p(s) and optimize F θ according to Eq. (3). Based on the
by VADE is free of mode-dropping and is able to capture the key newly trained F θ , a well-tempered target distribution pT according to
conformational features of the protein. Eq. (S31) (see the supplementary material, Part III, Sec. III, for more
details) is established, with respect to which the bias potential V ϕ is
optimized. The bias potential V ϕ is then fed into the FG simulations
C. Coarse-grained alanine dipeptide by RE-VADE of Ala2 and yields an updated collection of positive samples repre-
without available data senting p(s). This procedure constitutes one iteration of RE-VADE,
Next, we proposed to construct a CG model for a prototypi- and the entire process continues until the convergence criteria are
cal bio-molecular system, alanine dipeptide (Ala2) in explicit water met (Algorithm S1).
without available atomistic samples. Since we are interested in the We tracked how F θ of positive samples (yielded by the all-
distribution of the conformational features, which are subjected to atom MD simulation) and negative samples (yielded by using the
higher-order interactions, a number of non-bonded pair-wise dis- coarse-grained neural sampler) distribute [Fig. 4(b)], provided that
tances between backbone atoms [Fig. 4(a)] along with the back- the similarity (or overlap) between these two distributions is a good
bone torsional angles (Φ, Ψ) [Fig. 4(a)] are collected as a seven- indicator of convergence. We trained F θ and V ϕ by RE-VADE for 8
dimensional CG variable s, and our goal is to construct a reasonable ns (or equivalently, 400 RE-VADE iterations) and found that the two
seven-dimensional VADE potential, F θ (s). However, different from distributions overlap very well and that both spread for a relatively
the previous toy model, the motion of s [particularly, the isomeriza- wide range. Given the optimized bias potential V ϕ (s), we analyzed
tion of (Φ, Ψ)] involves relatively high barrier(s); thus, brute-force the averaged bias force (ABF) of the ith component (si ) in the CG
FG simulations of Ala2 converge too slowly to obtain an accurate variable s, which is defined as
estimate of ⟨∇θ Fθ (s)⟩pFG in order to optimize F θ . Therefore, we
ABF(si ) = ⟨∣∂Vϕ (s)/∂si ∣⟩p . (12)
ϕ (s)
adopted RE-VADE to enhance the sampling over s in the meantime
of inferring the VADE potential F θ . Technically, we simultaneously
launch two simulations: one FG (all-atom) MD simulation under a A component of larger ABF is subjected to larger bias forces
bias potential V ϕ (s) to draw positive samples from p(s), and a CG on average, implying the possible presence of high barriers in
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-7

The Journal
of Chemical Physics
FIG. 4. RE-VADE sampling of Ala2 in explicit water. (a) Chemical structure of Ala2 and the coarse-graining atomic sites (annotated by green symbols). Φ (red symbols) is
the torsion defined by the C0–N1–Cα–C1 dihedral angle, and Ψ (blue symbols) is the torsion defined by the N1–Cα–C1–N2 dihedral angle. The torsions Φ and Ψ along with
five non-bonded pair-wise distances [d 1 = d(C0 − Cα), d 2 = d(C0 – C1), d 3 = d(C0 – N2), d 4 = d(Cα − N2), and d 5 = d(N1 – N2); d(A − B) denotes the distance between
the A and B atoms] between the coarse-graining sites are collectively chosen as a seven-dimensional CG variable s. (b) Distributions of F θ (s) for all-atom MD simulation
samples [pFG (s), black solid line] and for CG samples generated by using the neural sampler [pθ (s), red dotted line]. (c) The averaged bias force (ABF) over the seven
components of the CG variable s. Statistical uncertainties are shown in bars. (d) 1-ns simulation trajectories projected on torsion Φ. Blue squares correspond to vanilla MD,
and red dots correspond to MD biased by V ϕ . (e) The contour map of V̄ϕ (Φ, Ψ), which is obtained by effectively integrating out other dimensions in the optimized V ϕ (s) (see
the supplementary material for more details). Superimposed black dots are representative samples produced by the enhanced MD simulation under V ϕ . (f) The optimized
VADE potential F θ (s) obtained by RE-VADE was marginalized onto (Φ, Ψ) by effectively integrating out other dimensions in s. Superimposed black dots are random samples
produced by using the neural sampler.
the motion of the corresponding component, which needs to be Ala2 with respect to (Φ, Ψ) and quantitatively agrees well with the
overcome by the exerted bias forces. As shown in Fig. 4(c), the reference FES [Fig. S3(a)], except that the one of the least popu-
two torsional angles, (Φ, Ψ), which are known to be slowly chang- lated metastable conformations [highlighted by the purple circles in
02 February 2024 01:28:38

ing, exhibit significantly larger ABF than other components in s, Fig. S3(a)] is not assigned with adequate density in F θ [Fig. 4(f)].
indicating that the fluctuations of these two variables are likely to Such mode-dropping issue is likely to be caused by inadequate train-
be dramatically promoted by the bias potential. Notably, in such a ing or expressivity of the deep bijective F θ . Nonetheless, this example
short simulation time, it is impossible for brute-force MD to pro- demonstrates that simulations on multiple scales can be bridged by
duce equilibrium samples covering all important metastable states RE-VADE and that CG models can be reliably inferred even without
over (Φ, Ψ). To illustrate how RE-VADE enhances the sampling access to FG samples a priori.
of the FG model, we presented a 1-ns trajectory for the torsion Φ Besides, since the ABF enables us to identify the slowly chang-
produced by simulations biased by V ϕ in contrast to that produced ing variables, which constitute the major barriers that hinder the
by vanilla MD in Fig. 4(d). It can be seen that isomerization of FG simulations, another RE-VADE was performed on the reduced
torsion Φ is fairly frequent in the biased MD but hardly found in CG variable s′ = (Φ, Ψ) of which the ABF is largest (see the
vanilla MD. supplementary material, Part III, Sec. III, for more simulation and
One may wonder how V ϕ looks like and why it is able to boost training details). Provided that the dimensionality of s′ is small, we
the FG sampling so efficiently. We first marginalized V ϕ (s) onto the can directly construct a non-bijective VADE potential F θ (s′ ) and
two-dimensional (Φ,Ψ) space [defined as V̄ϕ (Φ, Ψ); see the supple- perform RE-VADE without the need of a neural sampler (see the
mentary material, Part III, Sec. III, for more details] and plotted the supplementary material, Part III, Sec. III, for more details about
contour map of V̄ϕ (Φ, Ψ) in Fig. 4(e). The marginal distribution of model setups). Similar to previous results, RE-VADE over s′ yielded
the optimized V ϕ appears complementary to the ground-true dis- a bias potential that roughly complements to the ground-true FES
tribution of (Φ, Ψ) [a reference FES of (Φ, Ψ) was presented in over (Φ, Ψ) [Fig. S3(c)] and hence successfully boosted the rotation
Fig. S3(a)]. We also superimposed some randomly selected FG sam- of the two slowly changing torsional angles [Fig. S3(b)]. Since the
ples produced under V ϕ over the Ramachandran plot [Fig. 4(e)], reduced CG variable s′ = (Φ, Ψ) consist of only two local torsions,
which excellently cover both the free energy minima and the tran- one can also efficiently boost its transitions and generate positive
sition regions. Therefore, samples from V ϕ are supposed to be better samples using other CV-based enhanced sampling methods such
representatives of p(s) and can be reliably used to optimize the CG as bias-exchanged metadynamics57 [Fig. S6(a)]. Moreover, since the
model. Such a good coverage of transition regions by the RE-VADE requirement of a bijective F θ is removed, the reduced VADE poten-
samples may help reduce the uncertainty and errors in the den- tial model is of larger expressivity, and the resulting variationally
sity estimation (e.g., the VADE potential F θ ) of these low-density approximated FES F θ (Φ, Ψ) captures all the metastable states in the
regions (Fig. S4). The final VADE potential F θ optimized via RE- Ramachandran plot [Fig. S3(d)]. This example shows the possibility
VADE can be regarded as a variationally approximated FES for of implementing RE-VADE in a cascade manner in order to build
Ala2 in explicit water. Similar to V̄ϕ (Φ, Ψ), F θ (s) was also projected multi-scale CG models of increasing accuracy but reduced complex-
onto (Φ, Ψ) for simpler visualization [Fig. 4(f)]. We found that the ity for a molecular system of which the optimal CG variables are not
marginalized F θ captures most of the known metastable states of known a priori.
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-8

The Journal
of Chemical Physics
D. Enhanced sampling of protein conformations

by RE-VADE
Provided that RE-VADE allows inference of a high-dimensional

FES for an unexplored molecular system, we proposed to construct
a CG model for chignolin, presuming no available atomistic sim-
ulation data. Specifically, the backbone torsional angles (Φ, Ψ) of
the 10 residues [a typical folded structure of chignolin is shown
in Fig. 3(a)] were collected together as the CG variable s, which is
18-dimensional in total (the undefined N-terminal Φ torsion and
the C-terminal Ψ torsion were excluded), and we aimed to learn
an 18-dimensional VADE potential F θ (s), which thermodynami-
cally accords with a given atomistic PES [Eqs. (1) and (2)]. For
this purpose, we launched an atomistic MD simulation of chigno-
lin in implicit solvent (see the supplementary material, Part III, Sec.
IV, for more simulation details) in order to draw positive samples.
However, since protein folding and unfolding are rare events, brute-
force atomistic MD simulations could not efficiently generate posi-
tive samples representing the equilibrium distribution pFG in Eq. (3),
without which the CG model could not be built. For example, a
brute-force atomistic MD simulation of this system was performed,
and no folding/unfolding events could be observed during a 20-ns
simulation trajectory [Fig. 5(a)].
We thus introduced an 18-dimensional bias potential V ϕ (s) to
boost the atomistic simulations. The bias potential V ϕ (s) was param-
eterized by a multi-layer perceptron (MLP), whereas the VADE
potential F θ (s) was modeled by using a bijective neural network so
02 February 2024 01:28:38

that negative samples from the target samples from pT (s) [Eqs. (9)
and (S31)] can be conveniently drawn (see the supplementary mate- FIG. 5. RE-VADE sampling of chignolin in implicit solvent. (a) 20-ns brute-force
rial, Part III, Sec. IV, for more details about the models). F θ and V ϕ MD simulation trajectory characterized by RMSD with respect to the folded struc-
were iteratively optimized via RE-VADE according to Algorithm S1 ture. (b) 20-ns RE-VADE MD trajectory biased by V ϕ , characterized by RMSD with
(see the supplementary material, Part III, Sec. IV, for more train- respect to the folded structure. (c) Distributions of F θ (s) for atomistic MD simula-
ing details). After 500 training iterations (40 ps one iteration, and tion samples [pFG (s), black solid line] and for CG samples generated by using the
20 ns in total), the distributions of F θ over positive (s ∼ pFG ) and neural sampler [pθ (s), red dotted line]. (d) The averaged bias force (ABF) over the
18 backbone torsions (Φ in upper panels, Ψ in lower panels) constituting the CG
negative samples (s ∼ pθ ) overlapped well [Fig. 5(c)], indicating the variable s. Statistical uncertainties are shown in bars. (e) Contour map of the dis-
convergence of RE-VADE. The conformational dynamics of chig- tribution of atomistic MD simulation samples pFG (s) projected onto the IDM space
nolin was significantly boosted by RE-VADE, as can be seen from Ω(s), where the metastable conformations are indexed by numbers. The repre-
Fig. 5(b) where multiple folding/unfolding events were recorded sentative structures of each metastable conformation are shown accordingly in
during a 20-ns RE-VADE trajectory. It is noted here that the fold- Fig. S5. (f) Contour map of F̄θ (Ω(s)), the projection of VADE potential F θ (s) onto
ing/unfolding rate of chignolin was reported to be several microsec- the IDM space Ω(s), superimposed with randomly drawn negative samples from
F θ (black dots). (g) Contour map of V̄ϕ (Ω(s)), the projection of RE-VADE bias
onds in literature,56,58 and sampling of the conformational space of potential V ϕ (s) onto the IDM space Ω(s), superimposed with a randomly selected
chignolin usually took hundreds of nanoseconds simulation time in positive atomistic MD sample (black dots).
previous work59,60 even if enhanced sampling techniques or implicit
solvent models were exploited, whereas RE-VADE achieved this
goal within tens of nanosecond-long simulations and hence pro-
vided an alternative way to boost the conformational transitions of
proteins. in Fig. 5(f), and its overall shape was in good agreement with the
To facilitate visualization and comparison, we projected the reference drawn according to the atomistic simulations [Fig. 5(e)].
18-dimensional torsional CV s onto a two-dimensional reduced We also superimposed on Fig. 5(f) some negative samples ran-
space Ω(s) via information distilling of metastability (IDM)61 (see domly drawn from pθ ∝ exp(−βF θ ) and found that they covered well
the supplementary material, Part III, Sec. IV, about detailed IDM both the folded and unfolded regions, demonstrating that although
procedures) and showed the equilibrium distribution of atomistic without access to the equilibrium atomistic MD samples a priori,
simulation samples (or the positive samples) on Ω(s) [Fig. 5(d)]. It RE-VADE is able to learn a reasonable CG potential in terms of
can be seen that RE-VADE successfully sampled diverse atomistic thermodynamics consistency in a self-reinforced manner.
configurations, which adequately represent both the folded struc- Finally, we investigated the reasons for the enhanced sampling
ture and different unfolded conformations of chignolin [Figs. 5(d) efficiency of RE-VADE. According to the ABF analysis [Eq. (12)]
and S5]. On the other hand, the optimized 18-dimensional VADE of the optimized bias potential V ϕ (s) [Fig. 5(d)], we observed that
potential, F θ (s), was also marginalized onto the Ω(s) space F̄θ (Ω(s)) the torsions of the four residues that locate just around the hairpin
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-9

The Journal
of Chemical Physics
turn in the folded structure [residues 4–7 in Fig. 3(a)] exhibited based merely on equilibrium FG samples, thus involving less arti-
slightly larger ABF, implying that the hairpin formation may involve facts and computational cost than existing methods. This feature
high free energy barrier, which was overcome by V ϕ (s). Besides, allows researchers to fully exploit the available atomistic simulations
unlike Ala2 where the dominant components were rather localized, in order to construct transferrable CG models.
almost all the 18 backbone torsions of chignolin non-negligibly con- We note here that VADE can be viewed as a general exten-
tributed to the roughness of V ϕ in light of the ABF [Fig. 5(d)], sug- sion of GANs, where the VADE potential serves as a “discrimina-
gesting strong non-local correlations and collective motions among tor” to tell the difference between real samples and those faked by
these backbone torsions. We then projected the 18-dimensional RE- using the “generator” (i.e., the neural sampler). One merit of VADE
VADE bias potential V ϕ onto the reduced space Ω(s) [V̄ϕ (Ω(s)) in over vanilla GANs is that the VADE potential actually yields a den-
Fig. 5(g)] and found the marginalized V ϕ appeared roughly com- sity estimation of the input variables, whereas GANs cannot. The
plementary to the reference [Fig. 5(e)], indicating that the collective mode-dropping issue suffered by GANs can also be avoided through
motions between torsions can be facilitated by the optimized V ϕ ; active mode-exploration [Eq. (S12)] in VADE. In addition, VADE is
hence, the sampling was expedited. In contrast, in a bias-exchange closely connected to some well-established energy-based CG meth-
metadynamics simulation of chignolin, where 18 one-dimensional ods developed in statistical physics, and we overcome the sampling
bias potentials were added to each torsion locally and separately, we issues suffered by these existing EBMs (such as inversed Monte
did not observe accelerated folding/unfolding dynamics [Fig. S6(b)], Carlo and relative entropy approaches) by virtue of the neural sam-
in line with the fact that more refined or delicately designed CVs that pler. Consequently, VADE can now be exploited to optimize existing
can capture the collective behavior of the local torsions are required CG force fields in a computationally efficient way. The neural sam-
for metadynamics to work.62 This comparison reveals the advantage plers presented in this paper are all deep bijective models, which may
that, as an enhanced sampling approach, RE-VADE is able to opti- invoke time-consuming training in practice. Further development
mize a possibly high-dimensional bias potential directly as a func- of more efficient and even training-free neural samplers is highly
tion of the local CG variables (e.g., backbone torsions of proteins) desired. Since VADE is energy-based and only requires samples of
in order to boost rare events governed by the collective motions of molecular configurations from the reference distribution without
these CG variables. the need of knowing the (mean) forces, it allows one to construct
CG models or guide atomistic simulations according to experimen-
tal observations, which is worthy of investigation. Moreover, RE-
VADE allows one to construct CG models even without access to
02 February 2024 01:28:38

IV. CONCLUDING REMARKS
FG samples a priori based on multiscale reinforcement learning. We
Although all-atom and ab initio MD simulations have assisted remark here that although the CG models obtained this way are able
scientists in gaining insights over many important physical, chemi- to reproduce the thermodynamic properties of finer-scale simula-
cal, and biological processes, their applications to complex systems tions, the dynamics is not necessarily correct. In terms of Langevin
containing rare events are limited because experimentally related dynamics, the CG models obtained via (RE-)VADE only provide
timescales of such systems (such as protein folding and chemi- a reliable description of the drifting field, yet the diffusion field
cal reactions) are well beyond the reachable scope for even the remains unknown. Inferring the diffusion field based on the avail-
most powerful supercomputers. Two distinct strategies are sepa- able drifting field can be an interesting direction for further studies.
rately developed to combat this issue: either to perform enhanced In addition to coarse graining for thermodynamic consistency as dis-
sampling over the atomistic model or to leverage CG models at cussed throughout this paper, coarse graining for dynamics is also
the cost of losing atomic details. However, the traditional enhanced an active research field, and some recent research proposed machine
sampling methods can neither scale well to large system sizes nor learning approaches to extract dynamic information from molecular
transfer well to different system types due to the requirement of simulation trajectories and constructed a generative model, which
system-specific expert knowledge. On the other end, although dif- is able to reproduce the correct dynamic evolutions of molecular
ferent attempts exist to build CG models incorporating atom-level configurations.63,64
knowledge, transitioning from atomistic models to CG models still Last but not least, since RE-VADE is formulated on unsu-
remains challenging. pervised and reinforcement learning, it provides a novel machine-
In this paper, we developed a machine-learning approach, learning viewpoint to re-think many classical enhanced sampling
(RE-)VADE, to connect FG and CG models. In (RE-)VADE, methods. For example, metadynamics can be regarded as a non-
simulations on different scales can benefit from each other: CG parametric counterpart of RE-VADE. In addition to RE-VADE,
models are optimized with respect to the FG simulations, hence some recently developed adaptive sampling methods, such as
incorporating information on finer scales; in turn, FG simulations RaAVE65 and reinforced dynamics,66 also take flavor of reinforce-
are enhanced under the guidance of CG models. Mathematically, ment learning. We, thus, expect that machine learning, particularly
(RE-)VADE belongs to the realm of unsupervised and reinforce- deep unsupervised and reinforcement learning, will create more
ment learning. The variational and self-adaptive training objective possibilities for molecular simulations in the future.
allows end-to-end and online training of parametric models such as
ANNs. Through several experiments, we show that (RE-)VADE is
able to yield flexible CG models more rapidly than the traditional
SUPPLEMENTARY MATERIAL
CG methods; moreover, it can also boost the sampling efficiency
of conformational transitions of biomolecules by several orders of Supplementary texts and figures are provided in the supple-
magnitude. In (RE-)VADE, CG models can be variationally inferred mentary material. The supplementary texts include introduction to
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-10

The Journal
of Chemical Physics
18
GANs, derivation of (RE-)VADE training objectives, neural sam- D. J. Rezende and S. Mohamed, “Variational inference with normalizing flows,”
plers, model setups, and simulation details. in International Conference on Machine Learning (ICML, 2015), pp. 1530–1538.
19
M. E. Abbasnejad, Q. Shi, A. v. d. Hengel, and L. Liu, presented at the The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR).
20
R. Salakhutdinov, A. Mnih, and G. Hinton, presented at the Proceedings of the
AUTHORS’ CONTRIBUTIONS 24th International Conference on Machine Learning, 2007.
21
J.Z., Y.K.L., Y.I.Y, and Y.Q.G. designed the research; J.Z., R. Salakhutdinov and G. Hinton, in Proceedings of the Twelth International
Y.K.L., and Y.I.Y. performed the research; J.Z., Y.K.L., Y.I.Y., and Conference on Artificial Intelligence and Statistics, edited by D. David van and
W. Max (PMLR, Proceedings of Machine Learning Research, 2009), Vol. 5,
Y.Q.G. analyzed the data; J.Z., Y.K.L., Y.I.Y., and Y.Q.G. wrote the pp. 448–455.
paper. 22
J. J. Hopfield, Proc. Natl. Acad. Sci. U. S. A. 79(8), 2554–2558 (1982).
The authors declare no conflict of interest. 23
D. Wales, Energy Landscapes: Applications to Clusters, Biomolecules and Glasses
(Cambridge University Press, 2003).
24
Y. Du and I. Mordatch, preprint arXiv:1903.08689 (2019).
ACKNOWLEDGMENTS 25
A. P. Lyubartsev, Eur. Biophys. J. 35(1), 53 (2005).
26
M. S. Shell, J. Chem. Phys. 129(14), 144108 (2008).
The authors thank Xing Che, Frank Noé, Lijiang Yang, and 27
D. Reith, M. Pütz, and F. Müller-Plathe, J. Comput. Chem. 24(13), 1624–1636
Weinan E for useful discussion. This research was supported (2003).
by the National Natural Science Foundation of China (Grant 28
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
Nos. 21927901, 21821004, and 21873007 to Y.Q.G.), the National A. Courville, and Y. Bengio, presented at the Advances in Neural Information
Key Research and Development Program of China (Grant No. Processing Systems, 2014.
2017YFA0204702 to Y.Q.G.), and the Guangdong Basic and Applied 29
S. Kullback and R. A. Leibler, Ann. Math. Stat. 22(1), 79–86 (1951).
30
Basic Research Foundation (Grant No. 2019A1515110278 to Y.I.Y.). M. Arjovsky, S. Chintala, and L. Bottou, presented at the International Confer-
J.Z. thanks the Alexander von Humboldt Foundation for supporting ence on Machine Learning, 2017.
31
part of this research. A. Chaimovich and M. S. Shell, J. Chem. Phys. 134(9), 094112 (2011).
32
O. Valsson and M. Parrinello, Phys. Rev. Lett. 113(9), 090601 (2014).
33
J. Zhang, Y. I. Yang, and F. Noé, J. Phys. Chem. Lett. 10(19), 5791–5797
(2019).
DATA AVAILABILITY 34
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, J. Am. Stat. Assoc. 112(518),
02 February 2024 01:28:38

The data that support the findings of this study are available 859–877 (2017).
35
from the corresponding author upon reasonable request. D. Wu, L. Wang, and P. Zhang, Phys. Rev. Lett. 122(8), 080602 (2019).
36
F. Noé, S. Olsson, J. Köhler, and H. Wu, Science 365(6457), eaaw1147 (2019).
37
L. Dinh, J. Sohl-Dickstein, and S. Bengio, in International Conference on
REFERENCES Learning Representations, 2017.
38
G. Papamakarios, T. Pavlakou, and I. Murray, in Neural Information Processing
1
D. Frenkel and B. Smit, Understanding Molecular Simulation: From Algorithms Systems (NIPS, 2017), pp. 2338–2347.
to Applications (Elsevier, 2001). 39
T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, in Neural Informa-
2
M. E. Tuckerman, J. Phys.: Condens. Matter 14(50), R1297 (2002). tion Processing Systems (NIPS, 2018), pp. 6572–6583.
3 40
G. Fiorin, M. L. Klein, and J. Hénin, Mol. Phys. 111(22-23), 3345–3362 (2013). W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud, in
4
G. A. Voth, Coarse-Graining of Condensed Phase and Biomolecular Systems International Conference on Learning Representations, 2019.
(CRC Press, 2008). 41
M. E. Abbasnejad, Q. Shi, A. v. d. Hengel, and L. Liu, in Computer Vision and
5
W. G. Noid, J.-W. Chu, G. S. Ayton, V. Krishna, S. Izvekov, G. A. Voth, A. Das, Pattern Recognition (IEEE, 2019), pp. 10782–10791.
and H. C. Andersen, J. Chem. Phys. 128(24), 244114 (2008). 42
C. Abrams and G. Bussi, Entropy 16(1), 163–199 (2014).
6
G. M. Torrie and J. P. Valleau, J. Comput. Phys. 23(2), 187–199 (1977). 43
Y. Okamoto, J. Mol. Graphics Modell. 22(5), 425–439 (2004).
7
B. W. Silverman, Density Estimation for Statistics and Data Analysis (Routledge, 44
Y. I. Yang, Q. Shao, J. Zhang, L. Yang, and Y. Q. Gao, J. Chem. Phys. 151(7),
2018). 070902 (2019).
8
R. Bellman, “Adaptive control processes: a guided tour,” (Princeton, New Jersey, 45
A. Laio and M. Parrinello, Proc. Natl. Acad. Sci. U. S. A. 99(20), 12562–12566
USA, 1961). (2002).
9
J. J. Crabbe, “Handling the curse of dimensionality in multivariate kernel density 46
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (MIT
estimation,” Ph.D. dissertation, Oklahoma State University, 2013. Press, 2018).
10
Y. LeCun, Y. Bengio, and G. Hinton, Nature 521(7553), 436 (2015). 47
I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, IEEE Trans. Syst.
11
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016). Man, Cybernetics, Part C 42(6), 1291–1307 (2012).
12
E. Schneider, L. Dai, R. Q. Topper, C. Drechsel-Grau, and M. E. Tuckerman, 48
A. Barducci, G. Bussi, and M. Parrinello, Phys. Rev. Lett. 100(2), 020603
Phys. Rev. Lett. 119(15), 150601 (2017). (2008).
13
J. Wang, S. Olsson, C. Wehmeyer, A. Pérez, N. E. Charron, G. De Fabritiis, 49
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
F. Noé, and C. Clementi, ACS Cent. Sci. 5, 755 (2019). presented at the Advances in Neural Information Processing Systems, 2017.
14
V. N. Vapnik, IEEE Trans. Neural Networks 10(5), 988–999 (1999). 50
H. Van Hasselt, A. Guez, and D. Silver, presented at the Thirtieth AAAI
15
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2nd Inter- Conference on Artificial Intelligence, 2016.
national Conference on Learning Representations, ICLR 2014, Banff, AB, Canada 51
A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, k. kavukcuoglu,
(2014). G. v. d. Driessche, E. Lockhart, L. Cobo, F. Stimberg, N. Casagrande, D. Grewe,
16
A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King,
presented at the Advances in Neural Information Processing Systems, 2016. T. Walters, D. Belov, and D. Hassabis, in International Conference on Machine
17
L. Dinh, D. Krueger, and Y. Bengio, preprint arXiv:1410.8516 (2014). Learning (ICML, 2018), pp. 3915–3923.
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-11

The Journal
of Chemical Physics
52 60
L. Bonati, Y.-Y. Zhang, and M. Parrinello, Proc. Natl. Acad. Sci. U. S. A. 116(36), P. Shaffer, O. Valsson, and M. Parrinello, Proc. Natl. Acad. Sci. U. S. A. 113(5),
17641–17647 (2019). 1150–1155 (2016).
53 61
P. Tiwary and B. J. Berne, J. Chem. Phys. 147(15), 152701 (2017). J. Zhang, Y.-K. Lei, X. Che, Z. Zhang, Y. I. Yang, and Y. Q. Gao, J. Phys. Chem.
54
S. Singer and J. Nelder, Scholarpedia 4(7), 2928 (2009). Lett. 10(18), 5571–5576 (2019).
55 62
S. Kmiecik, D. Gront, M. Kolinski, L. Wieteska, A. E. Dawid, and A. Kolinski, F. Palazzesi, O. Valsson, and M. Parrinello, J. Phys. Chem. Lett. 8(19),
Chem. Rev. 116(14), 7898–7936 (2016). 4752–4756 (2017).
56 63
K. Lindorff-Larsen, S. Piana, R. O. Dror, and D. E. Shaw, Science 334(6055), H. Wu, A. Mardt, L. Pasquali, and F. Noe, presented at the Advances in Neural
517, 520 (2011). Information Processing Systems, 2018.
57 64
S. Piana and A. Laio, J. Phys. Chem. B 111(17), 4553–4559 (2007). H. Sidky, W. Chen, and A. L. Ferguson, Chem. Sci. 11(35), 9459–9467
58 (2020).
C. M. Davis, S. Xiao, D. P. Raleigh, and R. B. Dyer, J. Am. Chem. Soc. 134(35),
65
14476–14482 (2012). J. M. L. Ribeiro, P. Bravo, Y. Wang, and P. Tiwary, J. Chem. Phys. 149(7), 072301
59 (2018).
H. Nguyen, J. Maier, H. Huang, V. Perrone, and C. Simmerling, J. Am. Chem.
66
Soc. 136(40), 13959–13962 (2014). L. Zhang, H. Wang, and W. E, J. Chem. Phys. 148(12), 124113 (2018).
02 February 2024 01:28:38
J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-12


(W-10321) Uncertainty Quantification in Molecular Simulations With Dropout Neural Network Potentials

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(W-10321) Uncertainty Quantification in Molecular Simulations With Dropout Neural Network Potentials

Uploaded by

Copyright:

Available Formats

RESEARCH ARTICLE | NOVEMBER 05 2020

Deep learning for variational multiscale molecular modeling

J. Chem. Phys. 153, 174115 (2020)

02 February 2024 01:28:38

Deep learning for variational multiscale

Jun Zhang,1,2 Yao-Kun Lei,3 Yi Isaac Yang,1,a) and Yi Qin Gao1,3,4,5,a)

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-1

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-2

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-3

βFθ (s) ≡ −log qψ (s), (8) β

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-4

III. RESULTS AND DISCUSSION

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-5

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-6

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-7

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-8

D. Enhanced sampling of protein conformations

Provided that RE-VADE allows inference of a high-dimensional

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-9

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-10

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-11

02 February 2024 01:28:38

J. Chem. Phys. 153, 174115 (2020); doi: 10.1063/5.0026836 153, 174115-12

You might also like