Professional Documents
Culture Documents
Minimalist Models For Proteins: A Comparative Analysis: Valentina Tozzini
Minimalist Models For Proteins: A Comparative Analysis: Valentina Tozzini
Valentina Tozzini*
NEST, Istituto Nanoscienze – CNR Scuola Normale Superiore, Piazza San Silvestro 12, I-56127 Pisa, Italy
Abstract. The last decade has witnessed a renewed interest in the coarse-grained (CG)
models for biopolymers, also stimulated by the needs of modern molecular biology, dealing
with nano- to micro-sized bio-molecular systems and larger than microsecond timescale. This
combination of size and timescale is, in fact, hard to access by atomic-based simulations.
Coarse graining the system is a route to be followed to overcome these limits, but the ways of
practically implementing it are many and different, making the landscape of CG models very
vast and complex.
In this paper, the CG models are reviewed and their features, applications and performances
compared. This analysis, restricted to proteins, focuses on the minimalist models, namely those
reducing at minimum the number of degrees of freedom without losing the possibility of
explicitly describing the secondary structures. This class includes models using a single or a few
interacting centers (beads) for each amino acid.
From this analysis several issues emerge. The difficulty in building these models resides in the
need for combining transferability/predictive power with the capability of accurately
reproducing the structures. It is shown that these aspects could be optimized by accurately
choosing the force field (FF) terms and functional forms, and combining different
parameterization procedures. In addition, in spite of the variety of the minimalist models,
regularities can be found in the parameters values and in FF terms. These are outlined and
schematically presented with the aid of a generic phase diagram of the polypeptide in the
parameter space and, hopefully, could serve as guidelines for the development of minimalist
models incorporating the maximum possible level of predictive power and structural accuracy.
1. Introduction 334
* Email : tozzini@nest.sns.it
334 V. Tozzini
7. Acknowledgements 368
8. References 369
1. Introduction
The activity of a living cell consists of a complex network of interactions among bio-molecules
exchanging information and energy through biochemical processes (Russell et al. 2009). These
occur on different scales, spanning about 10 orders of magnitude in the space domain and 15 in
the time domain and requiring the use of very many different modeling techniques, often
combined in the so-called multi-scale approaches (Ayton et al. 2007 ; Cascella & Peraro, 2008 ;
Sherwood et al. 2008 ; Tozzini, 2010). The methods used for the atomic-level descriptions,
namely the quantum mechanics approaches and the force field (FF)-based molecular dynamics
(MD) simulations, are very well established techniques that have reached a satisfactory level of
standard and accuracy (Tozzini, 2010). However, even taking into account the current trend
of computer power increase, the atomistic simulations are not likely to be able to reach the
biologically interesting scales for a long time. This is especially true for the time domain : while
large macromolecular assemblies are currently addressable with all-atom (AA) MD simulations
to the sub-ns timescale, the ms scale is a hard limit even for simulations of single proteins.
This excludes a large portion of the biological processes, generally involving macromolecular
aggregates (>10 nm) and the timescale of y10 ms or more.
In order to overcome these limits, the idea of considering simplified models at less than
atomic resolution arises quite naturally. The reduction of the amount of internal variables used in
the description of the system (the ‘ Coarse Graining ’) brings a saving in computational cost, and
the consequent possibility of simulating large systems for a longer time, in principle with no
limitation, because the upper limit of the run length depends on the level of coarse graining.
However, after its first appearance several decades ago, this idea underwent a long period of
latency. It was reconsidered in the last years, probably triggered by the development of new
experimental techniques for bio-systems proper for the investigation of the nano–micro scale.
Coarse graining can be done at many different levels (Tozzini, 2005). The coarser the
description, the larger the saving in computational cost. But the elimination of internal degrees of
freedom implies that their effect must be taken into account implicitly in the effective forces
acting among the explicit degrees of freedom. This task becomes harder as the level of coarse
graining is made stronger (Tozzini & McCammon, 2008). Different recipes were proposed to
solve the related problems, and a large variety of different CG models, differing by the level of
coarse graining and by the philosophy of the parameterization of the FFs, are available, making
the CG models landscape very complex.
This paper focuses on a sub-class of CG models for proteins, also called ‘ minimalist ’ (Tozzini
& McCammon, 2008). Although in general this term is used with different meanings, in this
Minimalist models for proteins 335
paper ‘ minimalist ’ is attributed to the models that implement the maximum level of coarsening
that still allows us to explicitly represent some fundamental features of the bio-molecule, such as
the secondary structure level. Among these, particularly interesting are the models representing
an amino acid with one single interacting center (bead), i.e. the one-bead (OB) models. Coarser
representations cannot easily describe the secondary structure transitions. In addition, the OB
CG models are the more ‘ natural ’ representation, because the amino acid is the ‘ building block ’
of proteins. In this paper, however, also the 2–3 beads models that represent explicitly the side
chain of the amino acid are considered, because they share with the OB–CG models a similar
description of the backbone and similar parameterization-related problems. Conversely, the CG
models representing explicitly the backbone atoms (4–6 beads and more models) or, on the other
side, the coarser grouping of more than a single amino acid are excluded from the present paper,
because they display very different features.
The advantages of using minimalist models are obtained at the cost of a number of emerging
problems in the parameterization. Combining accuracy and predictive power in a few parameters
reveals a hard task that has been faced with different strategies (Tozzini & McCammon, 2008 ;
Tozzini, 2010), giving rise to a variety of different models and parameterization recipes.
In this paper, these are reviewed and classified according to the number and location of the
beads, the type and form of the FF terms, and the parameterization strategy. Some technicalities
and subtleties underling specific parameterization methods are particularly addressed, with
the aim of including these methods within a rigorous theoretical frame. The performances
and applicability of different models are also compared, and criticalities outlined. Advantages
and disadvantages of the parameterization method emerge, together with regularities in the
relationship between FF terms and parameters values. Overall, this analysis outlines a possible
global strategy to build an optimal minimalist model and systematically assign the parameters
values, which is illustrated with the aid of a schematic phase diagram for polypeptides in the
parameters space.
Table 1. Classification of the minimalist models for proteins according to their component beads. The main
internal variables are indicated in the third column. The main references are indicated. CM, center of mass
5. Zacharias (2003)
1–3 beads
Ca
0–2 for side
chain
(sums over subscripts are implied), where U bond is the term describing the pseudo-peptide bond
energy (often substituted with a constraint), U back describes the conformational energy and U nb
the non-bonded interactions. The latter term in OB models must include very many different
effects : the hydrogen bonding, the excluded volume and hydrophobicity interactions, and the
electrostatics. Consequently, it can be very complex and it is often separated in sub-terms de-
scribing each effect. For instance, the excluded volume interaction is intrinsically non-isotropic,
because the Ca is not located in the center of the amino acid. Models in class 2, i.e. those with the
bead placed on the Cb (or on the ‘ centroid ’ of the amino acid), were proposed to reduce this
problem, and possess a more isotropic excluded volume term. However, additional problems
arise related to the more difficult physical interpretation of the internal variables and backbone
reconstruction. Related to this, the equilibrium value of the Cb—Cb pseudo-bond distance is no
more structure independent and, consequently, the term U bond(ri,i+1) is complex and dependent
on the secondary structure and amino acid type. This is true for all the models (even multiple
bead models) where the backbone description is not based on the Ca positions (e.g. classes 4
and 6).
Adding one or more beads located on the side chains (classes 3–6) allow us to more easily
describe the non-bonded interactions separating the side chain effects and simplify the functional
form of the corresponding FF terms, although of course their number increases, including also
new conformational terms (i.e. those depending on the side chain bond angles hs )
U =U bond (ri, i+1 )+U back (hi , ai )+U hb (rij )+U sc (hsi )+U nb (r s ij ): ð2:2Þ
The hydrogen bond interactions of the backbone U hb(rij) are usually separated from the other
non-bonded interactions, associated with the side chain beads and included in the term U nb(rsij),
which is possibly decomposed in its excluded volume, hydrophobicity and electrostatic com-
ponents (see also Table 2). In these models, the description of the non-bonded interactions is
simpler (more isotropic, simpler functional forms), at the expense of increasing significantly the
number of parameters.
Additional auxiliary beads whose position is constrained to that of the Ca and that do not
increase the number of degrees of freedom are used in models of class 7 to simplify the de-
scription of hydrogen bonding and other non-bonded interactions. This class includes together
one of the more sophisticated (and complex) CG models currently available (UNRES ; Liwo et al.
1997a, b) and one of the first CG models ever reported (Levitt, 1976), to which most of the
current ones are inspired.
Of course, a large variety of other models were considered, using up to five beads for the
backbone and multiple beads representations for the side chain. This paper focuses specifically
on the OB Ca-based models (class 1) and those sharing with them a similar description of the
backbone conformation (classes 3, 5 and 7). A scheme of the FF terms commonly present in the
different classes of models is reported in Table 2.
Table 2. FF terms present in the different classes of minimalist models. In addition to the FF terms defined
in the main text, here the non-bonded interaction for the side chain is split into its components
U nb (rijs )=U sh +U hyd +U el , where U sh describes the hydrogen bonds of the side chains, U el the
electrostatic interactions and U hyd the hydrophobicity and excluded volume interactions, usually not separable.
Optional terms are enclosed in parentheses. In classes where certain FF terms are usually treated together, the
corresponding table cells are merged
of dihedrals). The RP displays densely populated areas corresponding to the main secondary
structures (see Fig. 1 a, colored contours) in which the (w,y) pairs assume typical values reported
in Table 3, while areas outside the contours are sterically forbidden. To validate a protein model,
the protein RP must not display anomalies, such as points in the forbidden areas.
Obviously, the RP cannot be used in the class of models considered in this paper, whose
internal conformational variables are (a,h). However, the chemical constraints introduced by the
peptide bond allow deriving an analytical form for the (w,y)p(a,h) mapping (Tozzini et al.
2006). Under some simplifying conditions, this is described by the following equations:
8
>
> a=w+y+p+c( sin w+ sin y)xc(txp=2)( sin w+ sin y)
>
>
>
< 1
+ c2 ( sin 2w+ sin 2y+4 sin (w+y)),
4 ð3:1Þ
>
> 2 2
>
> cos (h)= cos t[ cos cx sin c cos w cos y]
>
:
+ sin t[ cos c sin c( cos w cos y)]x[ sin2 c sin w sin y],
where t=111x is the NH—Ca—CO angle and cy16x is the angle formed by the Ca—Ca
pseudo-bond and the NH—Ca and Ca—CO bonds (Tozzini et al. 2006). The (w,y)p(a,h)
mapping is graphically represented in Fig. 1 a, b. A uniform density in the (w,y) plane is mapped
onto a non-uniform butterfly-shaped image in the (a,h) plane. As an effect of the mapping, the
allowed areas corresponding to secondary structures are re-shaped and re-sized, as shown in
Fig. 1 b. The (w,y)p(a,h) mapping is not one-to-one : couples of points symmetric with respect
to the main diagonal are mapped onto the same (a,h) point. However, due to the specific relative
location of the forbidden and allowed secondary structure areas in the (w,y) plane, these remain
separated even in the (a,h) plane. This point is very important, because it is precisely what makes
the OB–CG representation meaningful and useful to describe the secondary structures : although
an (a,h) pair corresponds to two (w,y) pairs, only one of them falls out of the forbidden areas.
Consequently, the backbone conformation is uniquely determined for each (a,h) couple and the
AA backbone conformation can be uniquely reconstructed from the CG one.
Figure 1 c, d report the same information, but described in a different way. The color is con-
served in the (w,y)p(a,h) mapping ; thus the comparison of (c) to (d) shows that lines parallel to
the secondary diagonal in (w,y) are mapped in almost vertical lines in (a,h). In addition, the
color is assigned according to the ‘ helicity ’ of the secondary structure, changing from almost flat
Minimalist models for proteins 339
(a) (b)
(c) (d)
Fig. 1. Illustration of the (w,y)p(h,a) mapping. (a) The RP for the generic amino acid: the colored lines
enclose areas where (w,y) couples belonging to defined secondary structures accumulate : blue=extended,
green=right-handed helices, red=left-handed helices. Cyan and yellow lines enclose the weakly allowed
regions ; other areas are sterically forbidden. (b) The (w,y) plane is mapped in the butterfly-shaped region of
the (h,a) plot. Shades of grey show the dishomogeneity introduced by the mapping. The colored lines are
the images of the profiles in (a) mapped in the (a,h) plane, and enclose the areas corresponding to specific
secondary structures, as in (a). The black lines have the same meaning, but evaluated for glycine instead of
for the generic amino acid. (c) and (d) : The same as (a) and (b), with the following variants : the (w,y) plane is
colored in strips at constant (w+y) value and the color is conserved upon mapping in the (a,h) plane.
The dots represent specific kinds of secondary structures (those reported in Table 3) with the following
color code: blue=extended ; cyan=flat ribbon ; green=right-handed helices, red=left-handed helices ;
magenta=proline helices ; yellow=rings. In (d ), the open dots correspond to directly measured data (re-
ported in Table 3), the filled dots are obtained from the analytic mapping of the corresponding dots in
the (w,y) plane. The discrepancy is due to the use of the simplified formula for the analytic mapping
(see Tozzini et al. 2006).
structures (blue to cyan) to positive (green, right-handed helices), to zero (yellow, flat rings), to
negative (red, left-handed helices). Dots with the same color coding are placed in correspon-
dence to typical values of (w,y) for the different secondary structures reported in Table 3. In the
(a,h) plane, the colored strips become almost vertical (compare (c) to (d)), indicating that the
helicity depends only on a, being y180 for the flat structures and decreasing for helices toward
the rings (a=0). Right- and left-handed helices differ by the sign of a. The (a,h) representation
of the secondary structures helicity is rather intuitive. This is evident in the (a,h) plot for glycine,
340 V. Tozzini
Table 3. Conformational backbone variables in the most common secondary structures. The h and a values
were from the structures built with InsightII. The same software was used to build the turns using the data for
(w,y). In this case the extremal residues were added in extended conformation. For the conformations whose
secondary structure is not uniform (turns) or contains peptide bonds in cis conformation the (w,y)p(h,a)
mapping represented in Fig. 1 does not apply. (pro3) in turns VI means that the third residue is proline
v
Structure (deg) w (deg) y (deg) Ca—Ca (Å) h (deg) a (deg)
Extended* 180 180 180 38 146 180
Anti-parallel sheet Mathews et al. 180 x139 135 38 131 179
(2000)
b-strand* 180 x120 120 38 121 178
Parallel sheet (Mathews et al. 2000) 180 x120 113 38 119 177
Flat ribbon (Mathews et al. 2000) 180 x78 59 38 92 163
3–10 helix# 180 x49 x26 38 84 85
3–10 helix (Mathews et al. 2000) 180 x49 x29 38 85 81
3–10 helix* 180 x60 x30 38 88 68
a-helix (Mathews et al. 2000) 180 x57 x47 38 92 52
a-helix* 180 x65 x40 38 92 51
p-helix* 180 x30 x90 38 100 34
p-helix (Mathews et al. 2000) 180 x57 x70 38 99 27
p-helix# 180 x57 x80 38 102 17
6-membered ring* 180 180 0 38 115 0
5-membered ring* 180 x75 x75 38 105 0
5-membered ring (Mathews et al. 180 x60 x105 38 108 0
2000)
Left-handed a-helix (Mathews 180 57 47 38 92 x52
et al. 2000)
Collagen triple helix (Mathews 180 x51 153 38 117 x77
et al. 2000)
Polyproline II (Voet & Voet, 2005) 180 x71 150 38 117 x106
Polyproline II* 180 x71 145 38 117 x107
Polyproline II* 180 x79 150 38 121 x109
Polyproline II (Mathews et al. 2000) 180 x75 145 38 119 x109
Turn-I# 180 x60, x90 x30, 0 38 90, 88 48
Turn-II# 180 x60, 80 120, 0 38 88, 108 1
Turn-III# 180 x60 x30 38 88 68
Turn-V# 180 x80, 80 80, x80 38 98 x63
Turn-Via# 180, 0 x60, x90 120, 0 38, 123, 81 x50
24 (pro3)
Turn-VIb# 180, 0 x120, x60 120, 0 38, 81, 89 x25
24 (pro3)
Turn-VIII# 180 x60, x120 x30, 120 38 121, 88 48
Polyproline I (Mathews et al. 2000) 0 x75 160 29 100 94
Polyproline I* 0 x71 160 29 100 96
the un-chiral amino acid (black contour in Fig. 1 b) for which the complete symmetry with
respect to a is recovered.
In the next sections, the role of the (a,h) plot in validating the CG models and in helping their
parameterization will be clear. It is to be remarked that building a unique AApCG mapping was
possible due to the choice of the Ca as the interacting site: other choices produce more complex
and secondary-structure-dependent mappings.
Minimalist models for proteins 341
4. Parameterization philosophies
The minimalist CG models may have y10–100 parameters, including both the ‘ structural
parameters ’ (equilibrium values of the coordinates) and the ‘ energetic parameters ’ (elastic con-
stants, bonding energies, well depths, barriers, etc.). There are several possible strategies to fix
their values that are described in this section. To some extent, the parameterization strategy
is related to the type of model, defined by the kind and number of bonded terms in the FF
(the ‘ topology ’), their functional form and the number and functional form of the non-bonded
terms. In the following, the CG models are classified and described, restricting to the OB
Ca-based ones, unless otherwise stated.
U =U bond (ri, i+1 )+U back (hi , ai )+U nb, loc (rij )+U nb, non-loc (rij ): ð4:1Þ
The presence and form of the terms depend on the specific model. The separation into local and
non-local parts of U nb(rij) is generally based on a cutoff radius rcut : all the distances rij less than
rcut in the reference structure are treated as local, the others as non-local and the corresponding
FF terms are treated in different ways. In the simplest possible biased model, namely the elastic
network (EN), the U nb,non-loc is absent and U nb,loc is treated with a harmonic distance-depen-
dent potential. U back is also absent, and the correct backbone conformation is maintained by
U nb,loc, which also includes the interaction between second and third neighbors along the
polypeptide chain (1–3 and 1–4 interactions, equivalent of the pseudo-bond-angle and dihedral
interactions, respectively). In the original formulation (Tirion, 1996), all the elastic constants are
set at the same value k, optimized by fitting the calculated root mean-squared fluctuations
(RMSF) onto the experimental temperature B factors. This fit creates an inter-dependence be-
tween k and rcut : increasing rcut, k must be softened with the rule krcut2ycost. In subsequent
works, the rule ‘ the larger the cutoff, the softer the interaction’ was confirmed (Atilgan et al.
2001 ; Soheilifard et al. 2008) although the quantitative relationship does not seem to be so
simple. Average values of k and rcut2 are given in Table 4.
Thanks to its simplicity and robustness, EN models had a great success. Under certain as-
sumptions on the distribution of the fluctuations (the Gaussian network model, GNM, and its
anisotropic version ANM ; Atilgan et al. 2001 ; Soheilifard et al. 2008), they can be analytically
solved, and normal mode analysis (NMA) easily performed. The low-frequency normal modes
obtained from EN are seen to catch the fundamental motions of the system related to its
biological function, in spite of the extreme simplification of the representation. This indicates
that the connectivity and shape of a protein, namely the input of EN models, and not the
structure details, generally determine its biological function. Similar information can also be
obtained from the principal mode analysis (PMA) (Van Aalten et al. 1997) of an MD trajectory,
whose output are the modes ordered by amplitude. Within the harmonic approximation, the first
modes (i.e. largest in amplitude) coincide with the slowest if the trajectory is equilibrated. Thus
NMA and PMA give similar results once the correspondence between modes is done, but PMA
uses, as input, a trajectory and does not need an analytical description of the model and thus it has
a more general applicability. These analyses can be used for several purposes. The deformations
Table 4. Summary of the features of the minimalist models for proteins
342
Model U bond U back U nb,loc U nb,non-loc Remarks
V. Tozzini
Elastic network
ANM rcut=8–15, ky10–0.9 kcal/mol Å2
Plastic/bimodal Harmonic potential for the single wells GNM rcut=8, ky0.02 kcal/mol Å2
networks Global or local valence-bond like combination rcut=13, k=1 kcal/mol Å2
Heterogeneous Harmonic In principle, infinite, but rcuty15 for simplicity
EN k=different for each bond couple
Extended/ Harmonic Anharmonic rcut=13, Ky46 kcal/mol Å2
anisotropic 1
=2 K(rijxr0ij)2 =2 kij((rijxr0ij)2xa2)H((rijxr0ij)2xa2)
1
kij=AA dependent avg y2 kcal/mol Å2
network
Chemical EN Harmonic Harmonic (1–3 and 1–4 distance-based Harmonic rcut=8 for U nb,loc
terms)
1
=2 k2(rijxrij0)2 1
=2 kvdw(rijxr0ij)2 Separated terms for H-bonds, disulfide bridges and
salt bridges, with different elastic constants
Go models Harmonic or Harmonic angle LJ 12-6 Repulsive only rcut=8 for Ca
constraint
r 0 10 12
r 0 12
1
=2 kh(hxh0)2 e ij
rij
x 65 ij
rij
e C
rij
rcut=4 for side chains
Cosine sum
P e=energy unit
n=1, 3 Kn [1x cos n(axa0 )] kb=100e kh=20e Knye
Partially biased Harmonic or Harmonic (1–3 and 1–4 distance-based Biased Morse Unbiased Morse rcuty8
models constraint terms) u(rij )= u(rij )=
OR 0 0 kb=50–100 kcal/mol Å2
Unbiased e[(e xk(rij xrij ) x1)2 x1] e[(e xk(r xr ) x1)2 x1] khy20–50 kcal/mol
Harmonic angle kay3 kcal/mol
U ang=1P=2 kh(hxh0)2 or e=e(r0)=decreasing from y5 to y0.1 kcal/mol
U ang = 3n=1 kn n!1 (hxh0 )n Parameterization half structure based, half BI
X based
U dih = K [1x cos n(axa0 )]
n=1, 3 n
Unbiased models Harmonic or U ang harmonic Explicit Uhb, anisotropic LJ-like, single or multiple Parameterization based on a mix of BI, FM and
constraint LJ-like wells, sometimes anisotropic physical–chemical considerations based
Or double well OR
U dih cosine sum dipole-dependent term
Explicit correlations between U ang and
U dih sometimes implicitly included
Minimalist models for proteins 343
associated with the slowest modes were used to flexibly fit high-resolution structural data into
low-resolution electronic maps (Florence Tama & Brooks, 2005) or to decompose the system
into domains (Kundu et al. 2004). More in general, the equilibrium dynamics of huge systems
such as entire viruses was analyzed with EN models (Chennubhotla et al. 2005 ; Demirel &
Keskin, 2005).
The limitation of these models stems from their simplicity : the bonding connectivity fixed to
that of the reference structure and the use of a single-well harmonic potential constrains the
system to move in the attraction basin of the reference structure, which is the only one possible
equilibrium configuration. In addition, the use of a unique elastic constant for all the possible,
bonded and non-bonded interactions of the system is clearly a un-physical oversimplification.
Thus improved network models were proposed that release one or more of these restrains.
Plastic networks (Maragakis & Karplus, 2005) and multiple-well networks (Chu & Voth, 2007)
allow studying systems with two or more equilibrium conformations. Each conformation is
represented with an EN model, subsequently coupled with a valence-bond-like approach. For
instance, in the case of two states A and B
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1
U = (U A +U B )x (U A xU B )2 +e2 , ð4:2Þ
2 2
where UA and UB refer to the single states and the combination can be done either at the global
level (i.e. UA,B are the total potential energies; Maragakis & Karplus, 2005) or at the local level
(i.e. UA,B=uAB(rij) are the single pair potentials; Chu & Voth, 2007). In both cases, the standard
EN parameters are used for the single well, but additional parameters are needed : the coupling
parameter e and the relative free energy of the two states. These models are able to describe the
transition between two known structural states, to find the minimum free energy paths and
analyze the properties of the system along them.
Heterogeneous EN (Lyman et al. 2008) models were proposed to improve the quality of the
RMSF. In these models, k is not uniform : conversely, kij for each single interacting pair is
considered a fitting parameter. Clearly, there is the need for a large amount of input data, for such
a large number of fitting parameters ; that is why these models usually take as input the RMSF
evaluated over different normal modes of the system from AA simulations. As a result, normal
mode displacements and RMSF of the CG model will have good agreement with those of the AA
model. The elastic constant strength as a function of the equilibrium distance is obtained as a by-
product, and it is decreasing (Fig. 5), in agreement with the intuitive consideration that shorter
interactions are stronger. Elastic constants dependent on the amino acids type were introduced
in the ‘ extended ’ ANM, where also an anharmonic potential for the non-subsequent Ca pairs
was used (Hamacher & McCammon, 2006). The ks were obtained based on available statistical
contact potential matrices (Keskin et al. 1998 ; Miyazawa & Jernigan, 1996) readjusted on the
experimental RMSF. This model gives better-quality RMSFs and accounts for the effect of
mutations.
The simplicity of EN is obtained at the cost of the need for a large number of contacts for
each bead for the system to be stable, scaling approximately with rcut3. In order to reduce the
computational cost and with the aim of having more physical FF terms, ENs with physics/
chemistry-based topology were proposed. In the ‘ chemical network ’ ( Jeong et al. 2005), the FF
terms are separated according to amino acid type with different ks calibrated fitting the overlap
of displacements from the NMA. As a consequence, rcut can be greatly reduced without losing
stability and accuracy.
344 V. Tozzini
The Go models belong to the class of biased models, although their purpose, FF terms
and parameterization philosophy differ from those of EN. Originally proposed as a simplified
statistical model for folding (Go & Scheraga, 1976), more recent versions (Clementi et al. 2000 ;
Koga & Takada, 2001) include terms with a more physical form than EN. U bond is the usual
harmonic term, while U back depends on the backbone conformational variables h and a, and is
separated into the corresponding two terms (usually harmonic and cosine series, respectively).
The rcut in this case has the meaning of separating the couples of amino acids that are in contact
in the folded structure (called the native contacts) from the others. The U nb,loc acting between
pairs in native contact is represented as an attractive Lennard–Jones (LJ)-like potential, while the
U nb,non-loc is repulsive, so as to push the system toward the native conformation. This form of
Go model is also called the minimally frustrated model for folding, because the local minima
other than the native structure are almost absent. The elastic constants and other energetic
parameters are fixed to be proportional to e, the LJ well depth (see Table 4). Thus, the energetic
properties of the system are entirely determined by a single parameter, e, which is adjusted to fit
the experimental melting temperature. The philosophy underlying these models relies on the
assumption that the native structure (the main input of the model) determines the folding
pathway and other global properties of the folding process. However, this is true only in the first
approximation : by construction, and due to the absence of frustration, the Go model fails to
catch the nature of the intermediate states that often occurs during folding. For these reasons,
the evolutions of the Go model have followed the direction of including frustration, either in
U back (Nakagawa & Peyrard, 2006) or in U nb (Kaya & Chan, 2003).
where UI=U bond, U angle, etc. and QI=ri,i+1, ai, etc. ({Q} indicates the whole set of co-
ordinates). The probability distribution of a single internal variable is
Z
U ({Q}) U (QI )
P(QI ) / dQ1 , . . . , dQI x1 , dQI +1 , . . . , dQN exp x = exp x ð4:4Þ
kT kT
and the second equality stands if the condition of complete un-correlation between the FF terms
is valid. This condition is never exactly satisfied, especially in the case of the OB FFs. The
consequences of this approximation will be illustrated later on. Equation (4.4)) is equivalent to
U (QI )=xkT ln (P(QI ))+const: ð4:5Þ
Minimalist models for proteins 345
(a) (b)
(c) (d)
Fig. 2. Probability distribution of the internal variables evaluated using different statistical sets : 102 large
proteins (size y50 Å, solid line) ; 312 proteins with prevalence of b-strands (long dashed line) ; 347 proteins
with prevalence of a-helices (short dashed line) ; 450 proteins prevalently unstructured (dotted line). The
sets were prepared with the selection tools of the RCSB databank website (RCSB protein databank, http://
www.pdb.org/pdb/home/home.do). Blue lines : arbitrarily normalized variable distributions ; red lines :
BIn, units on the right-hand vertical axis. (a) First neighbor distance distributions ; (b) pseudo-bond-angle
distributions ; (c) pseudo-dihedral distributions ; (d) the (a,h) plot for the set of generic proteins (in orange),
superimposed on the ideal map (see also Fig. 1).
that defines the core of the BI, and gives an operative way to derive the CG FF terms based on
the probability distribution of internal variable, which can be evaluated given a statistical set of
structures.
The probability distributions of h and a (and BIn) also display multiple peaks (minima), and
these correspond to different secondary structures of both Ca-based and non-Ca-based models.
However, in Ca-based models the relation between peaks and secondary structures is more
direct, as explained in section 3. More specifically, P(h) displays a peak at y90x typical of the
a-helices, while the b-like structures have h=115–145 depending on the kind of amino acid.
Figure 2 b clearly shows how the relative population of the two peaks varies passing from a
a-prevalent to b-prevalent set of proteins, while in the mainly unstructured proteins the two
peaks are broader. Similar consideration can be done for the a pseudo-dihedral distributions,
where the peak typical of helices is located at y60x, while the extended structures tend to adopt
a value of y178x. In Fig. 2 d, the correlation (a,h) plot is also reported for the set of generic
proteins, which shows a good superposition with the ideal (a,h) plot (see section 3).
where P0(QI) is the probability distribution of the internal variable in a reference state, usually the
system with non-interacting particles (but the choice of the reference state is a matter of debate
(Betancourt & Thirumalai, 1999). Non-interacting particles distribute randomly in the 3D space ;
thus P0(QI) is usually constant and the PMF differs from the BIn by an irrelevant constant.
However, in certain specific cases, this is not true. For instance, consider QI=h : the P0 distri-
bution of uniformly distributed points on a sphere is not uniform, but rather P0(h)/sin(h), due
to the fact that the lateral surface of a spherical horizontal section increases with sin(h). Thus, in
this case, one has that PMF is U(h)=xkT ln [P(h)/sin(h)]. This correction is usually neglected
because sin(h) varies only y20 % from the a peak to the b peak.
In other cases, however, the correction is relevant and the difference between BIn and PMF is
substantial. Consider, for instance, the probability distribution of distances between any two
pairs of Ca, P(r), which is related – after some elaboration (see the following section) – to the
non-bonded interaction potential. For an infinite system of non-interacting particles, one has
P0(r)=4pr2, that is, the volume of the spherical shell of radius r, to which also P(r) should tend
for large r. Thus the PMF is U(r)=xkT ln (P(r)/4pr2)=xkT ln(g(r)), where g(r) is the pair
correlation function. Figure 3 illustrates the differences between the mentioned quantities with
practical examples. In (a), the P(r) is shown (black lines). As it can be seen, however, for r >10 Å
its behavior strongly deviates from 4pr2 (red line). This is due to the fact that the proteins have
finite size ; thus their P(r) vanishes for r larger than the average protein size. Thus, in this case,
P0(r) should be the distribution of non-interacting particles limited in a sphere of finite radius (R),
which can be analytically calculated solving the integral from the definition, and is
Z Z
1 2
P0 (r )= r dr1 dr2 d(r xjr1 xr2 j)
N sphere sphere
ð4:7Þ
1 2 24 3 r 1 r 3
= r 4pr pR 1x 3
+ ,
N 3 4 R 16 R
Minimalist models for proteins 347
(a) (b)
(c) (d)
Fig. 3. (a) In black, probability distributions of the pair distances for a set of large proteins (3000–6000
residues, solid line) and for a set of small proteins ( <1000 residues, dotted line). Superimposed in red are
the P0(r) evaluated for an infinite system (red) and for spherical systems (eqn (4.7)) of radius 45 Å (green)
and 23 Å (blue), which best fit the P(r) for large and small proteins, respectively. The fit is worse for the
small proteins set, probably due to the poor size homogeneity in this set. These P0(r) are also reported in
(b) together with that evaluated for a spherical shell of inner radius 15 Å and outer radius 20 Å (magenta)
and that for an infinite cylinder of radius 11 Å (cyan), evaluated numerically. The normalizations are such
that the quadratic parts for r <5 Å superimpose. (c) The pair correlation function g(r) for large (solid) and
small (dotted) proteins normalized with the infinite radius P0(r) (red) and with the finite radius (R=45 for
large proteins, green, and R=23 for small proteins, blue). (d ) Plots in the lower part of the graph, scale on
the left vertical axis non-bonded parts of g(r), the same color and line code as in (b) ; Plots in the upper part
of the graph, right vertical axis: the corresponding PMF, the same color and line codes.
where N is the total number of beads and r the average particle density. This expression tends to
4pr2 as R tends to infinity. Empirical approximations of this formula were earlier derived (Zhou
& Zhou, 2002). Figure 3 a shows that expression eqn (4.7) reproduces the correct behavior of
P(r) for large r.
In Fig. 3 b, the P0(r) for a spherical shell of inner radius 15 Å and outer radius 20 Å is shown in
magenta and that for an infinite cylinder of radius 11 Å is shown in cyan. These P0(r) values are
relevant when dealing with a viral capsid and with DNA, respectively. The behavior is quadratic
within the thickness of the shell or of the cylinder and then becomes linear for the shell and tends
to a constant for the cylinder. In general, P0(r) with the same global geometry of the system that
one is analyzing should be used, if the correct long-range behavior of the pair g(r) is to be
348 V. Tozzini
evaluated. As it can be seen in Fig. 3 c the g(r) tends to 1 only if the correct P0 is used (green and
blue lines). This reflects in the long-range behavior of the PMF : in Fig. 3 d, the g(r) and the
corresponding PMF are shown. Different lines correspond to different P0. Only using the correct
P0(r) one obtains the correct asymptotic vanishing behavior of the PMF.
Fig. 4. Non-bonded part of the pair correlation functions (lower plots, left-hand vertical axis) and the
corresponding PMF (upper plot, right-hand vertical axis). Color codes : green=total non-bonded g(r) ob-
tained excluding the first, second and third neighbors along the polypeptide chain (that pertain to pseudo-
bond, pseudo-bond-angle and pseudo-dihedral distributions) ; red solid=g(r) for the 1–5 distances (fourth
neighbors along the polypeptide chain); red-dashed=g(r) for the 1–6 to 1–10 distances ; magenta=g(r) of
the distances of Cas involved in hydrogen bonds between strands in sheets ; blue=green minus red and
magenta (i.e. ‘ real’ non-bonded part). The distribution of hydrogen bonds in the sheets is obtained with the
following criteria : the two Cas i and j are considered to form a hydrogen bond in a sheet if also (i+1, jx1)
and (ix1, j+1) (anti-parallel sheet) or (i+1, j+1) and (ix1, jx1) (anti-parallel sheet) form a hydrogen
bond. Dotted lines on the upper graph are fits to the minima with Morse functions (f(r)=e[(exp
(xk(rxr0))x1)2x1]), which have an additional parameter k with respect to LJ potential related to the
width of the well. The Morse potential turns out to be more appropriate given the variably softer nature of
the CG non-bonded interactions. The statistical set used is the set of ‘ small proteins ’ as in Fig. 3.
‘ non-bonded ’ g(r), obtained excluding from the distribution the first, second and third neighbors
along the polypeptide chain, which are related to the U bond and U back terms. One would, in
principle, expect that the PMF from this g(r) has an LJ-like shape. However, Fig. 4 (green lines,
upper plot) shows that it is far from being a single well van der Waals (vdW)-like potential, rather
it is multi-welled. For instance, the more evident peak at y62 Å corresponds to the distribution
of the fourth neighbors (1–5 neighbors, red solid line) : the Ca separated by four amino acids
along the chain assume quite sharply that distance when they are in helical conformation. Thus
that peak can be thought as due to the presence of the hydrogen bonds that maintain stable the
helix. Other peaks at y85, y10, 11 and 14 Å, present in the fifth to tenth neighbor distribution
(dashed red line), have a similar origin : they are due to the regular recurrence of distances in the
helical structure. Here the concept of ‘ correlation ’ emerges : while the first peak (well) at 62 Å
can be put in relationship with the presence of an additional interaction term in the FF (e.g. the
hydrogen bond U hb), the additional peaks (wells) in the g(r) (PMF) are not due to additional
terms, but are induced by U hb via multi-body correlations present along the helical structures.
Similarly, the backbone hydrogen bonds among strands in a sheet are responsible for the three
overlapping peaks at y45, 50 and 55 Å (magenta line), and only these, not the corresponding
induced peaks, should be included as explicit terms in the FF.
Once all the U hb (and their correlated) are subtracted, only the excluded volume, hydrophobic
and electrostatic interactions remain. The corresponding g(r) and PMF look much smoother
(blue lines) although a couple of structures at y6 and y10 Å are still visible (even more evident
in the amino-acid-specific distributions (Trovato & Tozzini, in preparation), thought to be due to
350 V. Tozzini
other factors : intrinsic anisotropy of the excluded volume side chain interactions, hydrophobic
interactions mediated by water molecules, etc. Thus these distributions (blue lines), and not the
original g(r) (green lines), should be used to parameterize the U nb term.
The one described above is an empirical way to separate the ‘ genuine’ FF terms from the
peaks due to correlations in the PMF. However, a rigorous way to generate a ‘ true ’ effective
potential from PMF exists that, in principle, does not require any arbitrary choice. This is the
Iterative Boltzmann Inversion (IBI) (Reith et al. 2003). It consists in using the g(r) as the target
function to be reproduced iteratively proceeding as follows : (1) use the PMF as initial guess for
Ui(r) and generate a gi(r) from a simulation ; (2) correct the Ui(r) with the formula
g(r )
Ui+1 (r )=Ui (r )xkT ln ð4:8Þ
gi (r )
and repeat step 1 with Ui+1(r). At convergence, this procedure gives the effective potential that
best reproduces the target g(r) and, in principle, should be applied to determine each term of the
FF at the same time (using the corresponding probability distributions as targets). In practice,
however, as the number of terms increase, a number of problems arise. Even assuming that the
simple BI (not iterative) is enough for the bonded terms (less affected by the correlation prob-
lem), and restricting the IBI to U nb, one should apply it to y202 different terms. In addition, in
order to obtain a general and transferable potential, the simulations should be performed not on
a single structure, but on a set of diverse representative structures. For these reasons, the IBI
is only rarely applied to biopolymers (Májek & Elber, 2009) and usually in simplified forms
(Banachowicz et al. 2000).
As previously noted, it is sometimes assumed that the bonded terms of the FF are less affected
by the problem of correlations, but this is not entirely true. Observing Fig. 2 d it is apparent that
P(h) and P(a) are not independent, rather certain values of a and h appear preferentially in a
correlated way, for instance hy90x and ay60x in the helices. These correlations can be seen as
induced by the presence of additional terms of the FF : specifically the hy90, ay60 can be seen
as induced by the hydrogen bond terms that stabilize the helices. This again points out the
necessity of taking with great care the direct BIn as the interaction potential, even in the case of
the conformational terms.
where h0 is the location of the first well (y90x), and the other parameters are amino acid type
dependent and determine the position of the second well and its relative stability. The parameters
are extracted from the direct BIn (or PMFs) either from experimental structures (in Tozzini et al.
(2007) and Trylska et al. (2005)) or from AA simulations (in Voltz et al. (2008)) and then opti-
mized using the probability distributions as targets. U nb, both local and non-local, is represented
with a Morse potential, but a partial bias is conserved in the structural parameters of U nb-local :
the equilibrium distances r0ij differ for each ij couple and are taken from a single reference
structure. The distinction between U nb–local and U nb–non-local is similar to that in EN and Go
models, but both the parameters and the cutoff are based on physical grounds : looking at Fig. 4,
the cutoff can be naturally placed at y8 Å, which separates the local part of U nb, containing
mainly the hydrogen bond interactions, from the non-local, less structured part, containing
mainly the hydrophobic and electrostatic interactions. Thus the local bias allows us to represent
in a very simple way the most complex terms of the FF, maintains the structure stable and gives a
high level of structural accuracy that allows these CG models to be compatible with AA models
in multi-scale approaches (Chang et al. 2007). At the same time, the other unbiased terms give
enough flexibility to the system, so that even out of equilibrium dynamics can be simulated, such
as the substrate capture process of HIV-1 protease (Trylska et al. 2007) or the allosteric motions
upon binding of HIV-1 integrase with HAT proteins (Di Fenza et al. 2009).
A similar approach was adopted to build the VAMM FF (Korkuta & Hendrickson, 2009). The
FF terms are the same (eqn 4.1) and the parameterization based on BI, with more complex
functional forms for U ang and U dih. At variance with Tozzini et al. (2007), parameterization of
these terms is secondary structure dependent instead of amino acid type dependent, to improve
the structural accuracy, although a priori knowledge of the secondary structure is required.
U nb–local is biased toward a reference structure, but at variance with Tozzini et al. (2007), the
local–non-local separation is based on the distance along the chain ( j–i <6) instead of on a
distance cutoff. This implies that a double well U nb–non-local is necessary to account for the
residual short-range interactions (Fig. 4, blue dotted lines).
Although it originates as an evolution of the Go models and is specifically designed for
folding, even the Das, Matysiak, Clementi (DMC) model (Das et al. 2005) can be put in the class
of the partially biased because in the non-bonded interaction the bias is somewhat eliminated.
U bond and U back are parameterized as in the usual Go models based on a single structure.
Conversely the U nb assumes the form
X 12 10
nb sij s ij
U (r )= e(ai , aj ) 5 xd(ai , aj )6 ð4:10Þ
j xi >3
rij rij
(it is to be observed that a 12–10 form is used instead of the usual 12–6 LJ potential, which is
quite similar to the Morse form with intermediate values of the k parameter). U nb is not sepa-
rated into local and non-local parts ; however, the parameters e, d (=0 or 1) depend on the
amino acid type (ai,j), and s depends also on the distance of i and j along the chain. Three classes
are considered: j–i=4, j–i=5 and j–i >5. Thus, in some sense, the separation between local and
non-local parts of the non-bonded interaction is recovered by recognizing that the 1–5 and 1–6
interactions are qualitatively different from the others. The ss are parameterized by BI of a
statistical set of non-redundant proteins (Das et al. 2005). The set of {e, d} is obtained through a
complex optimization procedure that involves the minimization of the distance of the simulated
folded structure from the native one, the free energy experimental differences upon mutations,
352 V. Tozzini
and other available observables (Matysiak & Clementi, 2006). The separation into classes and the
amino-acid-dependent parameters enhances the structural accuracy of the model. However, the
set of parameters is protein dependent; thus the problem of transferability to different proteins is
not completely solved. The DMC model improves the folding landscape characterization by
improving the local structural accuracy with BI-based terms and at the same time eliminating a
part of the bias present in the Go models.
As already observed, the difficulty in completely releasing the bias in OB models stems from
the fact that the non-bonded interactions, especially the local ones, are very complex and highly
anisotropic. Thus maintaining the local bias is a simple compromise solution to have high
accuracy without introducing a large number of parameters and complex functional forms
(Mukherjee et al. 2005). A part of the problem, specifically that related to the anisotropy of the
interactions of the side chains, is solved in the Ca-based two-bead (or multiple bead) models.
Some of them are worth mentioning here although this paper is focused on the OB models,
because based on BI. In the two-beads model by Bahar & Jernigan (1997) and Bahar et al.
(1997), the backbone U ang and U dih terms are numerically evaluated by BI and are aminoacid
type dependent. Conformational terms pertaining to the orientation of the side chain and
U corr(h,a) term are added to account for the correlations between the backbone conformational
variables. The non-bonded terms between side chains are obtained with the same philosophy.
The CALF FF is virtually a two-bead model, although the position of the side chain is entirely
determined from that of Ca (Buck & Bystroff, 2009). From the FF form point of view, it is
interesting because explicit terms for the hydrogen bonds U hb are introduced. The para-
meterization procedure is BI related but quite complex. Given a protein, first the local structure
is determined from the local sequence, through a statistical procedure. Second, based on the
local structure, the U back and U hb terms are parameterized, including statistical information
through BI. This model is used for folding. Similar for the FF terms and for the philosophy of
the parameter assignment (based on the local structure) is the OPUS-Ca FF (Wu et al. 2007),
which uses very complex parametric forms for the FF terms. These models are quasi-unbiased,
but the bias is hidden in the a priori determination of the local structure, upon which the FF is
based.
The MARTINI FF, recently extended to proteins (Monticelli et al. 2008), is a multiple bead
model where the backbone bead is placed on the centroid of the NH-Ca-C==O group and 1–4
beads are used for the side chains ; in addition, the solvent is explicit, although it is CG itself.
Thus the level of coarse graining is not particularly extreme. However, this model is interesting
for the purpose of the present paper because the choice of FF terms and parameterization
contains elements that can be considered exemplar even for Ca-based models. U bond is a simple
harmonic form, including in this case also the terms between Ca and the side chains. U back is
split into the bond angle and dihedral part, all represented by single-well functions. Additional
conformational terms are assigned to maintain the correct orientation of the side chain. U nb is
split into an LJ plus a screened Coulomb part, representing excluded volume and hydrophobic
interactions and electrostatic, respectively. U hb is absent : the secondary structure is maintained
by the U back terms, whose values are derived based on the BI. The side chain non-bonded
interaction parameters are amino acid type dependent and were optimized including thermo-
statistics data from experiment (e.g. free energy of water–oil partitioning) or from AA simula-
tions (amino-acid association constants). As for OPUS-Ca and CALF, although the bias toward a
single structure is completely removed, a priori knowledge of the secondary structure is necessary.
In addition, no transitions between different secondary structures are allowed.
Minimalist models for proteins 353
where Q is a collective variable (e.g. a CG internal variable). This is the general form of eqn (4.5)
valid for any kind of collective variable Q. The additive constant is equal to xkT times the
logarithm of the partition function. The relation between F(Q) and the corresponding PMF is
P(Q)
PMF(Q)=xkT ln =F (Q)xF0 (Q)+const:, ð4:12Þ
P0 (Q)
where the subscript 0 refers to the reference system, usually the non-interacting particle system.
As discussed in the previous section, PMF and the corresponding free energy differ by more than
a simple constant if the choice of the variable Q is such that F0 is not independent of it. In the
previous section, it is also discussed how it is possible to obtain the ‘ best ’ U(Q) starting from
PMF(Q) with the iterative BI.
In principle, BI could be applied also to many-body PMFs and effective potentials. One could
consider a many-body CG potential depending on a set of collective CG variables {Q}
and determine it with a full convergence multi-variable IBI procedure, using a set of probability
distributions P({Q}) (from experiment or from simulation) as the target. Equations (4.11) and
(4.12) are valid also when Q is replaced by the set of {Q} and the corresponding multivariate
quantities (probabilities, free energies, PMF) are defined. The IBI procedure then gives the
effective multi-body CG potential U({Q}) that best reproduce the free energy F({Q}) of the
system (i.e. the Free Energy Surface) and consequently its thermodynamic properties.
An alternative route to determine effective CG potentials is the ‘ force matching ’ (FM)
method, targeting the reproduction of the forces (Ercolessi & Adams, 1994 ; Izvekov et al. 2004).
The forces acting on the CG sites are calculated with a more accurate method (e.g. AA MD
simulations) and then used as the target to fix the parameters of the CG potential or to fit a
potential in numerical form. This method was rigorously formulated and optimized in a series of
papers by Izvekov & Voth (2005), Wang et al. (2009), Noid et al. (2008a), Das & Andersen (2009)
and Noid et al. (2008b), who named it the multi-scale CG (MS-CG) method. The basic idea is to
minimize the functional
* +
1 XN
x2 ({F})= jFI ({Q({q})})xf I ({q})j , ð4:14Þ
3N I =1
where FI are the CG forces on CG sites I, fI are the forces on the CG sites evaluated from the
AA simulations, and nm represents the average along the trajectory (or within the data set) that is
intended to be large enough to sample the canonical ensemble. The statistical average eliminates
354 V. Tozzini
the explicit dependence on the AA ({q}) internal variables, so that x2 is a functional only of the
functions FI({Q}), i.e. the CG forces, to be determined by minimizing x2. It is shown that the
minimal FI({Q}) satisfies the equation
@
FI ({Q})=x U ({Q}), ð4:15Þ
@QI
where U({Q}) is the multi-body PMF related to the P({Q}). This equation puts in relation the
(I)BI with the FM methods. In practice, however, there are substantial differences between the
two methods, at different levels. First of all, the input data set is typically different. The BI-based
procedures take, as input, structures from any source, but preferentially experimental, to enlarge
the diversity of the data set, in order to improve the transferability of the parameters. Conversely,
the FM procedure needs, as inputs, the forces on CG sites that are typically evaluated along AA
MD simulations. This difference in the inputs used in the two techniques results in differences in
the FF terms : the FF terms from (I)BI tend to be ‘ softer ’ (multiple minima, larger and less
defined), while those from FM tend to be ‘ harder ’ (sharper and better defined minima). In
addition, the use of AA simulation structures as input data generates another question, which is
the question related to the accuracy and reliability of AA FFs that is assumed in this approach. It
is beyond the scope of this paper to analyze this problem ; however, it is to be mentioned that the
most commonly used AA FFs have recently revealed some inaccuracies that appear especially on
the long time scale (Okur et al. 2003 ; Ono et al. 2000). The possible differences between CG FFs
generated by the use of different input data sets are coherent with the usually different use made
of the BI and FM CG FFs: the former are generally used to address general dynamics in CG-only
simulations, while the latter are preferentially used in multi-scale simulations, where the CG
results must be compared with the AA ones. In fact, the MS-CG method is designed to realize
the mechanical consistency between AA and GC models.
Even considering an identical input set of structures, different FFs may arise from the different
approximations adopted in the two methods. Neither in FM nor in BI the exact multi-body
U({Q}) or forces are usually considered. Rather, in both cases they are decomposed into a finite
sum of terms. In the case of the (I)BI-based methods this simply corresponds to write the FF as a
sum of terms depending on the single internal CG variables, i.e. eqns (4.3) or (2.2). As already
discussed, behind this choice there is the assumption that the choice of the FF terms and explicit
internal variables is made in such a way as to reduce at minimum the correlation between terms.
The parameters of the FF terms are then varied in order to optimally reproduce the PMFs or
probability distributions by direct or iterative BI, as explained in the previously.
In the case of the MS-CG method, the forces on the CG sites are expressed as a linear
combination of ‘ basis ’ functions G({Q}) :
X
FI ({Q})= wd GI , d ({Q}) ð4:16Þ
d
and the coefficients w are determined by variationally minimizing eqn (4.14)). The exact form of
G({Q}) depends on the kind of system. For instance, if the pair-wise interactions are dominant in
the system, e.g. for a one-component system formed by non-bonded elements, it becomes
convenient to use the Cartesian coordinates of the CG sites and express the forces in the form
X X
FI ({R})= ^ IJ
R Fd d(RIJ xRd ), ð4:17Þ
J –I d
Minimalist models for proteins 355
where RIJ is the pair-wise distance between two CG sites and d(R) is a discrete delta function.
This particular choice returns directly the pair-wise force acting among the components
(FI({R})), assuming that multi-body forces are negligible.
It is clear that (I)BI and FM methods, although stemming from a common basis, use different
inputs and different approximations. The FM method has the advantage of directly giving the
effective interaction potential without the need of iterative procedures. On the other hand, the
(I)BI procedure is more broadly applicable, because it needs only the structures and not the
forces as input, although it needs more care and efforts to obtain accurate effective potentials.
A reference model for those belonging to this class is the one by Levitt (1976) and Levitt &
Warshel (1975). This model belongs to class 7 and is described by 2–4 interacting centers per
amino acid. However, the degrees of freedom are only those pertaining to the Ca, since the
position of the other interacting centers (i.e. the side chain centroid and auxiliary sites to describe
the hydrogen bonds) is entirely defined by the position of the Ca. As a consequence, the energy
functional terms are still those in eqn (4.1). Particularly interesting is the treatment of the term
U back : the correlation between a and h is explicitly treated imposing the relationship h=106–
13cos(a–45) (numbers are in degrees). This given, the backbone conformation is determined by
a and its corresponding term U dih, parameterized by BI for a small number of representative
classes of possible amino-acid quadruplets and analytically expressed with a six-term Fourier
expansion. At variance with models previously described (e.g. MARTINI), the parameterization
of the conformational term in this model is amino acid type based rather than secondary struc-
ture based. This means that the model is completely unbiased. The non-bonded interaction
between side chains, which takes the part of U nbxnon-loc, is separated into its two components,
the vdW excluded volume and the ‘ solvent effect ’ (i.e. hydrophobic/hydrophilic interaction).
This separation, which is dropped in most of the subsequent models, is related to the fact that
both of these terms are parameterized based on the physical–chemical properties of each amino
acid. The solvent term is parameterized based on the experimental solubility of the amino acids,
and the VdW term is estimated averaging the possible conformation of the amino-acid side
chain. Two facts are worth noticing : (i) the averaging of the side chain conformations is re-
cognized as responsible for the ‘ softer’ shape of the effective side chain potential with respect to
the usual LJ form, which was in fact adopted by most of the subsequent models ; (ii) the com-
bination of two terms with different nature and shape gives rise to not simply single-walled
effective U nbxnon-loc, which is also observed in the effective potentials obtained by BI or FM. The
hydrogen bonding U hb is rather complex and defined through additional sites located near the
C==O and N-H groups of the peptide, and described through the combination of a VdW and an
electrostatic term, with parameters taken from the corresponding one of the AA FFs. The
Levitt–Washell FF was shown to reproduce roughly the secondary structure contact map of
globular proteins, without introducing any a priori knowledge of the protein, except the sequence.
A simplified single-bead (class 2 in Table 1) version of this model was subsequently developed by
McCammon & Northrup (1980).
The UNRES FF, later developed by Liwo et al. (1997a, b) can be considered as the evolution
of the Levitt–Warshel model, being based on a similar definition of the interacting site and
composition of FF terms. A first improvement is related to the correlation between the con-
formational terms of the backbone, and consists in recognizing and parameterizing in an amino-
acid-dependent way a distribution of deviations from the correlation function proposed by Levitt
& Warshel (1975). Correlations between the backbone conformation and the side chain orien-
tation are also introduced. These two terms are parameterized based on BI-related techniques
complemented with other physics-based techniques, such as the averaging of the corresponding
parameters of the AA FF. The side chain interactions are allowed to be anisotropic and par-
ameterized in an amino-acid-dependent way with similar approaches. The correlations accurately
included in U back already determine the secondary structures, which are however further stabi-
lized by the dipole–dipole interaction among the sites placed on the Cas, which mimics U hb.
The UNRES model is the prototype of a class of models with a very complex parameteriza-
tion, whose detailed description is out of the scope of this paper. Going back to the truly
minimalist models, another class of them stemmed from the work of Levitt and Warshel, whose
Minimalist models for proteins 357
prototype can be considered the model by Sorenson & Head-Gordon (2002a), inspired by
previous work by Honeycutt & Thirumalai (1990) and by Nymeyer et al. (1998). Similar models
were subsequently developed by Friedel & Shea (2004). This model is minimalist in all senses : the
protein is represented by one bead per amino acid placed on the Ca. The FF form is expressed
by eqn (2.1), where U bond is replaced by a constraint and U back=U ang(h)+U dih(a). In the
simplest version of the model (Sorenson & Head-Gordon, 2002b), U ang is a simple harmonic
potential with equilibrium angle at h0=105x, midway between the typical helical and typical
extended values. U dih is more complex
X
U dih (a)= A[1+ cos a]+B[1x cos a]+C [1+ cos 3a]+D[1+ cos (a+p=4)], ð4:18Þ
dih
where the parameters A, B, C and D are secondary structure dependent (e.g. A==C==D helical).
In subsequent refinements (Yap et al. 2008), the value of h0 was also differentiated by secondary
structure (i.e. 95 for helical conformations and 105 for the others) and a substituted with (a–a0)
(Yap et al. 2008) that allowed a more accurate representation of the secondary structures. As in
some previously described models, the dihedral term is considered as the one that mainly de-
termines the secondary structure. At variance with BI-based models, the values of the parameters
are more roughly determined, i.e. chosen to simply stabilize one or the other secondary structure,
and assigned based on the amino-acid propensity of the amino acids. Thus the parameterization
is more chemistry–physics based and does not need any a priori knowledge except the sequence.
The stability of the fold is also determined by the non-bonded term. In the simplest form
U nb is
X 12 6
s s
U nb (r )= 4eS1 xS2 , ð4:19Þ
j >i+3
rij rij
where s is the same for all the beads and e is the energy unit of the system (also the parameters of
the bond angle and dihedral terms are expressed as fraction of e). The parameters S1 and S2 are
assigned based on the ‘ flavor ’ of the amino acid : hydrophobic (lately separated in strongly and
weakly hydrophobic), hydrophilic and neutral. As in the case of the dihedral potential, not much
attention is paid to the accurate reproduction of the form of the interaction potential, which can
assume only a few different characters (strongly or weakly attractive, strongly or weakly repul-
sive). However, even this potential term can be assigned solely based on the sequence. The
simplest version of the model is capable of distinguishing the secondary and tertiary fold of
proteins with different levels of a/b propensity, and locates the folding transition at values of
kT/e around 04–06. In addition, in the optimized version of the model, an explicit hydrogen
bond term is present, whose form is inspired by previous works on water (Silverstein et al. 1998)
X
U hb = xehb exp (x(rij xrhb )2 =s 2hb ) exp [(j^rij ti jx1)=s 2hb ] exp [(j^
rij tj jx1)=s2hb ], ð4:20Þ
hbonds
where ti is a unit vector orthogonal to the plane formed by the triplets of Ca (ix1, i, i+1). This
form of the potential induces those planes to stay parallel, thus stabilizing the helices and sheets
conformations. The sum runs over all the couple ‘ capable ’ of forming a bond, and the capability
is assigned similarly to the corresponding dihedral propensity, based on sequence. The values of
ehb and rhb depend on the flavor of the amino acid. The hydrogen bond term improves the
quality of the kinetics of the a/b transition in proteins with ambiguous propensity and shows a
358 V. Tozzini
high capability of predicting the a/b propensity upon mutations. On the other hand, the
structural accuracy in this model is sacrificed in favor of the prediction of the fold. The func-
tional forms of the single FF terms are rather ‘ primitive ’ with respect to the BI or FM par-
ameterized FFs.
It is worth mentioning here the two-bead model by Mukherjee & Bagchi (2002), who in-
troduced the hydrogen bonding effect in the helices with additional attractive harmonic terms for
1–3 and 1–4 interactions. The elastic constants of those terms are assigned based on the se-
quence determined helix propensity of each amino acid, as in the other models described in this
section.
An accurate treatment of U hb is also found in the OB model by Alemani et al. (2010). U ang
and U dih are represented by a double-well form like in Tozzini & McCammon (2005) and by a
simple cosine form, respectively. In addition, the persistence of the secondary structures is
maintained by an additional term that correlates subsequent dihedrals. The hydrogen bond term
is represented by a dipolar interaction between the peptide dipoles mij
The mij are located approximately midway between Cas i and i+1 and their orientation depends
on the position of Cas ix1, i and i+1. This approach is similar to that of Sorenson–Head-
Gordon in that the hydrogen bond interaction depends on the position of three subsequent Cas
only, but differs in many aspects. First of all, it is entirely physics based, imputing the hydrogen
bond to a simple dipolar interaction, while the former is more empirical. Second, eqn (4.21) does
not explicitly depend on the secondary structure. The secondary structure dependence enters
implicitly in the definition of the orientation of m that depends on the angle formed by the three
Cas determined by the specific secondary structure. This dependence is not empirical; con-
versely it is entirely physics based. In this model, the a versus b propensity is due to the U ang and
U dih terms, although stabilized by U hb. The model is capable of reproducing stable secondary
and super-secondary structures, and the transitions among them, with a good structural accuracy,
although a systematic amino-acid-dependent parameterization was not established ; thus in some
sense the model still lacks the predictive power that characterizes the Sorenson–Head-Gordon
model.
At the conclusion of this section, it is useful to report a summary of the features of the models
discussed here. In Table 4, the presence and functional form of the FF terms in the different
models are reported, together with remarks about the parameterization procedures. In Fig. 5, the
parameters of FF terms are reported as a function of the corresponding equilibrium distance.
The bond angle and dihedral parameters are converted into equivalent constants for linear
distance-dependent terms in order to be compared with the others, as explained in the caption of
Fig. 5. As can be seen, in spite of the very different parameterization procedures (biased (blue), BI
or FM based (red), or physical–chemical based (green)), the dots tend to accumulate along a line
that is roughly represented by a shifted inverse proportionality (dashed black line) and super-
impose partially with the relationship between elastic constant and cutoff distance of the ANM
(solid black line with squares). (The corresponding line for the GNM, conversely, looks like an
upper limit for the values of the constants.) This fact points to the emergence of a sort of
universality in the numerical values of the parameters of CG models and in their dependence on
the equilibrium distance, which can guide the parameterization.
Minimalist models for proteins 359
Fig. 5. Summary of the strength parameters in the OB FFs. All the constants are expressed in kcal/mol Å2,
as if they were elastic constants for harmonic distance potentials. For the non-bonded or hydrogen bonding
potentials, their equivalent are computed by calculating the second derivative in the minima and reporting it
in linear distance coordinates. It is to be noted that both the binding energy (i.e. well depth) and the well
width concur to determine the equivalent elastic constant in the case of Morse or LJ-like potentials. For the
bond angle and dihedral interactions, the parameters corresponding to the equivalent 1–3 and 1–4 inter-
actions are evaluated and reported in the graph. In this case both the equilibrium distance and the equivalent
elastic constants are evaluated. Colors: black solid lines with squares and dots represent the relationship
between elastic constants and cutoff distances in the ANM and GNM, respectively (Atilgan et al. 2001 ;
Soheilifard et al. 2008). The blue dots refer to biased models (different kinds of networks and Go models ;
Atilgan et al. 2001 ; Chennubhotla et al. 2005 ; Chu & Voth, 2007 ; Clementi et al. 2000 ; Demirel & Keskin,
2005 ; Go & Scheraga, 1976 ; Hamacher & McCammon, 2006 ; Jeong, et al. 2005 ; Kaya & Chan, 2003 ;
Keskin et al. 1998 ; Lyman et al. 2008 ; Maragakis & Karplus, 2005 ; Miyazawa & Jernigan, 1996 ; Nakagawa &
Peyrard, 2006 ; Soheilifard et al. 2008 ; Tirion, 1996). The vertical error bars are present in particular models
(e.g. the heterogeneous network), where local interactions can assume different strengths depending on the
protein. Red dots represent the models with parameterization based on the BI or FM (Arcangeli & Tozzini,
in preparation; Bahar & Jernigan, 1997 ; Chang et al. 2007 ; Di Fenza et al. 2009 ; Matysiak & Clementi, 2006 ;
Monticelli et al. 2008 ; Mukherjee et al. 2005 ; Reith et al. 2003 ; Tozzini & McCammon, 2005, 2008 ; Trovato
& Tozzini, in preparation ; Trylska et al. 2005, 2007 ; Voltz et al. 2008). Red dotted lines correspond to the
equilibrium distance-dependent parameters of Tozzini & McCammon (2005). Horizontal error bars are due
to the fact that often the same elastic parameters are used for helical or extended conformations, which have
different equilibrium distances. The green dots represent the models with chemical–physical-based par-
ameterization (Alemani et al. 2010 ; Friedel & Shea, 2004 ; Honeycutt & Thirumalai, 1990 ; Levitt, 1976 ;
Mukherjee & Bagchi, 2002 ; Nymeyer et al. 1998 ; Silverstein et al. 1998 ; Sorenson & Head-Gordon, 2002a,
b ; Tozzini, in preparation). The dashed black line is a guide to the eye: y=x1+25/(xx4).
previous section focus on the prediction of the fold, at the expense of the accuracy of the local
structure. The UNRES and related models represent an attempt to combine the two aspects, but
at the expense of having very complex FF terms with a large number of parameters, which puts
this model at the very border of the class of the ‘ minimalist ’. In this section, the models are
re-considered with the aim of analyzing how the single FF terms influence the structure accuracy
and prediction. Possible strategies to rationally build an accurate and predictive minimalist model
are then suggested.
(a) (b)
(c)
Fig. 6. (a,h) plots of different kinds of models. Colored striped areas correspond to the values of a and h
where the value of U back(h,a) assumes values less than a cutoff level (the cutoff level used ranges between 2
and 5 kcal/mol). Representative values of the parameters reported in the literature for each model are used
in the evaluation of the contour lines. (a) Representation of the UNRES-like U back (magenta). In orange in
the background, the ‘experimental ’ (a,h) plot (the same as in Fig. 2 d) is reported for comparison. The black
solid line is the correlation line by Levitt and Warshel ; the black dotted line is its symmetric, which should
stand for structures with opposite chirality. (b) Representation of the Sorenson–Head-Gordon-like U back
(green for helical propensity, blue for sheet propensity), compared with the ‘experimental ’ (a,h) plot.
(c) Representation of the double well+dihedral U back (green for helical propensity, blue for sheet pro-
pensity). Solid black lines: correlation lines imposed by constraining the 1–4 distance through the formula
(r/l )2=4sin2hsin2(a/2)+[1–2cos2h]2 with l=38 Å and r=52 Å for helices and r=10 for sheets in Å.
Black dashed line : the same for a structure with opposite chirality. In background in red-magenta, the (a,h)
map is evaluated over a simulation of the helix-to-hairpin transition for the corresponding minimalist
model. The structures corresponding to the various areas of the plot are also reported.
introduces a correlation between h and a variables that can be analytically evaluated (see the
caption of Fig. 6) and is represented by the black line in the central region of Fig. 6 c (solid for
right-handed helices and dashed for left-handed ones). Similarly, assuming for simplicity that
extended conformations in sheet structures are stabilized by constraining the 1–4 distances at
larger values (y10 Å) determined by inter-strand hydrogen bonds, one obtains the two corre-
lation lines in the upper part of the graph. It should be expected that when U hb is added to
U back, the contour map of the model changes, increasing the population around the correlation
362 V. Tozzini
lines. This is in fact what happens, as shown by the (a,h) plot obtained from a simulation of a
minimalist 20-mer model that implements these potentials (Tozzini, in preparation) (in the
background in red-magenta, in Fig. 6 c). In this simulation, the hairpin conformation is more
stable, but the simulation starts from the helical conformation, so that both the conformations
are sampled. As it is evident, the lobes are deformed and the (a,h) plot assumes a shape that is
very similar to that of the generic polypeptide (compare with Fig. 2 d ), especially in the helical
lobe, indicating that a proper functional form of U back (especially U ang) can give a very accurate
description of the backbone conformation when coupled to appropriate hydrogen bonding
terms, which also introduce the correct correlations between the a and h variables in the plot. In
addition, it is to be observed that U back and U hb both concur in stabilizing the secondary
structures one relative to the other ; thus their relative strength must be coherently balanced to
reproduce the helical versus sheet propensity for the different amino acids, in a transferable model.
When U hb is more complex than a simple distance constraint, the a priori evaluation of its effect
is more complex and can be evaluated only through a simulation. However, it is likely to be
similar to that described by the simplest model.
While U back+U hb gives the main contribution in determining the secondary structure,
U nb participates in stabilizing the less structured conformations (e.g. random coils, turns) and
determines the stability of the tertiary fold ; thus its accuracy is crucial in determining the
global fold. In section 4.2.3, it was shown that even after subtracting the correlation effects,
U nb maintains an at least double-welled shape. This is confirmed also from the direct calculation
of the effective potential from the FM procedure (Zhou et al. 2007). As already noted, the two
main minima correspond roughly to closer and looser packing of the side chains, due to their
conformational flexibility and/or to the mediation of the interaction by water molecules.
Thus the first minimum, located at y6 Å, is likely to influence the relative stability of helices,
sheets and random coils, although in second approximation with respect to the stronger U back
and U hb, while both minima are likely to influence the stability of the tertiary fold. An accurate
U nb term must include these effects in an amino-acid-dependent fashion, in order to reproduce
the protein fold, which implies that the single-welled Sorenson–Head-Gordon-like potential is
probably not completely adequate, and conversely, BI- or FM-based multi-welled potentials
should be used to achieve this aim, such as that proposed in Korkuta & Hendrickson (2009) or
Zhou et al. (2007).
V. Tozzini
(typical helix length) ; m=7 (typical length of a sheet in a strand). The energy and entropy terms are evaluated under the following additional assumptions : (i) in the turns,
kinks between helices and in the globules, the conformation adopted is similar to that of the helix ; thus its conformational energy is estimated to be y3/4D ; (ii) the hydrogen
bond energy in helices and sheets is evaluated counting the stable number of hydrogen bonds ; (iii) the percentage of residues adopting the turn-like or helical-like conformation in
the globules is y40%, evaluated on a set of unstructured proteins (e.g. containing less than 5 % helices or sheets) ; similarly, the average number of helical-like hydrogen bonds,
other hydrogen bonds and non-bonded contacts is evaluated to be y01, 03 and 12 per residue ; (iv) in the turns between strands the hydrogen bond conformation adopted is
similar to that in the helices ; (v) in the helices the non-bonded contacts are considered the 1–5 interactions plus possible inter-helical interactions when broken helix structures are
possible ; (vi) in the sheets, the non-bonded interactions are considered as the second neighbor interactions among sheets, and are scaled by 09 because their distance is displaced
from the minimum Unb ; (vii) the entropy is evaluated very roughly considering both the conformational and configuration space for each structure. For instance, the helix has only
one (or a few) possibility of hydrogen bonding topology and is rigid; thus it has the lowest possible entropy (larger – TS) ; in the sheet the topology of contacts is equally fixed, but
there is more conformational flexibility ; the globule has larger possibility of topologic contacts and average flexibility ; finally, the extended conformation has the larger
conformational space possible. The present evaluation of the entropy is entirely qualitative and has an arbitrary multiplicative factor, chosen in such a way that the melting
temperature of the peptide is a little above the room temperature. The total free energy of each conformation is evaluated as the sum of the energy terms – TS
secondary structure propensity scale, such as that by Chou and Fasman (CF) (Chou & Fasman,
1978), who give, for each amino acid, values of the helix, sheet and turn propensity (pa, pb and
pt). At fixed rhb thus the secondary structure propensity depends on d and rnb. This can be
expressed by simple relations d=d(pa, pb) and rnb=rnb(pt, pa) such as those given in the
caption of Fig. 7 (Tozzini, in preparation). Using them, and the CF propensity scale, the par-
ameters d and rnb for each amino acid are evaluated and the name of the amino acid placed on
the (d, rnb) plane accordingly. As can be seen, each amino acid is located in the correct region
of the plane, according to its propensity, including those with hybrid sheet–helix propensity
(located on the line separating the two phases) and the turn former amino acids.
This result was obtained assuming that the helix versus sheet propensity is mainly due to d,
while rnb modulates the tendency to form defined secondary structures versus random coils.
However, one could also impute the helix versus sheet propensity to rhb and keep constant d to an
average value, since the strength of the hydrogen bonds is dependent on the hydration properties
and on the amino acid type. In this case, a relationship rhb=rhb(pa, pb) is needed to assign the
rhb value to each amino acid (see the caption of Fig. 7), but the result is nearly the same and is
reported in Fig. 7 b : the amino acids are located correctly in the region of the planes according to
their secondary structure propensity. A third way, the most physically plausible, is to vary both d
and rhb with the amino acid type, in a correlated way (correlation relation in the caption of Fig. 7)
and the result is represented in Fig. 7 c, giving the same accuracy as the two previous ones in
locating the amino acids in the phase plane. Figure 7 c reports the graph at room temperature
(t=003 if ehba assumes the reasonable value of y2 kcal/mol) : with respect to the zero tem-
perature the globule state is stabilized and the corresponding area enlarges at the expense of the
structured areas.
Once these relationships between d, rhb and rnb and pa, pb and pt are fixed, one can study the
phase diagram as a function of the temperature. This is reported in Fig. 7 d–f for three different
values of rhb corresponding to helix former, sheet former and ‘ indifferent ’ amino acids (areas
enclosed in rectangles in Fig. 7 c). The behaviors are quite different, and only in the case of the
helix former the helix-to-sheet transition is observed, at a temperature modulated by rnb. In
Fig. 7 e, f, the temperature scale is expanded to show also the denaturation temperature (e.g. the
transition to the extended state). These phase diagrams can help in fixing the parameterization
based on the experimental knowledge of the transition temperatures. It is to be remarked that the
evaluation of these phase diagrams is qualitative and should be checked by means of simulations
of different peptide models spanning all the parameter space (work in progress; Tozzini, in
preparation).
In conclusion, the suggested relationships rnb(pa, pt), rhb(pb, pa) and d(rhb) give simple
operative prescriptions to evaluate the main energetic parameters for each amino acid, based
on given amino acid secondary structure propensities. Although there is room for optimizing it,
this recipe incorporates physical–chemical properties in the parameterization basically at null
cost and possibly also some thermodynamic properties, through the secondary structure pro-
pensity. This does not exhaust the full parameterization: the relative weights of the angle and
dihedral parameters in determining d should be determined separately, possibly by means of
FM or BI. The same is true for the relative energies of the two wells in U nb, which influences
also more global properties such as the tertiary fold and the interactions between protein
domains. These aspects must be systematically faced (and they are, by several authors : Korkuta
& Hendrickson (2009) and Trovato & Tozzini (in preparation)) by FM- and/or BI-based
methods.
366 V. Tozzini
(a) (b)
(c) (d)
(e) (f)
Fig. 7. Phase diagram of the minimalist polypeptide model. Color code for phases : green=helix, cyan=
sheet, yellow=coil, turn or molten globule, blue=extended (completely denaturated state). (a) Projection
onto the d–rnb plane, at zero temperature and average value of rhb (indicated in the graph). The amino acids
names are located in the plane according to the corresponding values of d and rnb, evaluated from the
Chou–Fasman (CF) secondary structure propensities pa (helix propensity), pb (sheet propensity) and pt
Minimalist models for proteins 367
Once optimized, this recipe for building the CG FF can in principle include the local structural
accuracy (within the structural parameters via BI or FM procedures), the capability of predicting
the relative secondary structure stabilities (included in the energetic parameters evaluated from
the secondary structure propensities) and the tertiary–quaternary fold (via the accurate amino-
acid-dependent parameterization of the U nb term).
Fig. 8. Upper part : schematic summary of the characteristics of the main classes of OB CG models. The
models are placed in the diagram approximately according to their structural accuracy and predictive power.
The pictures indicate very roughly the size of the system treated and the functions of the model. At the
bottom a schematic representation of the release of the bias is shown, which parallels the increase in
transferability. In blue, the polypeptide backbone is schematically represented. The red lines connecting the
Ca represent the biased interactions. Dashed lines represent interactions with hybrid biased–unbiased
character (sometimes occurring in Go models).
reached, for instance, for AA FF. In spite of this, I believe that efforts to systematically study and
parameterize a minimalist model for proteins (and more in general for bio-molecules) should not
be abandoned, because minimalist models are those that combine a sufficient level of resolution
with the largest possible gain in computational cost ; thus they are the key to address bio-systems
on biologically interesting size and time scales.
7. Acknowledgements
8. References
ALEMANI, D., COLLU, F., CASCELLA, M. & DAL PERARO, M. changes: a double-well network model. Biophys. J. 93,
(2010). A nonradial coarse-grained potential for pro- 3860–3871.
teins produces naturally stable secondary structure CLEMENTI, C., NYMEYER, H. & ONUCHIC, J. N. (2000).
elements. J. Chem. Theor. Comput. 6, 315–324. Topological and energetic factors: what determines the
ARCANGELI, C. & TOZZINI, V. (in preparation). Multi-scale structural details of the transition state ensemble and
modeling molecular dynamics of the Artichoke Mottled. ‘en-route’ intermediates for protein folding? An inves-
Crinkle Virus, in preparation. tigation for small globular proteins. J. Mol. Biol. 298,
ARORA, N. & JAYARAM, B. (1996). Strength of hydrogen 937–953.
bonds in alpha helices. J. Comput. Chem. 18, 1246–1252. DAS, A. & ANDERSEN, H. C. (2009). The multiscale coarse-
ATILGAN, A. R., DURELL, S. R., JERNIGAN, R. L., DEMIREL, graining method. III. A test of pairwise additivity of the
M. C., KESKIN, O. & BAHAR, I. (2001). Anisotropy of coarse-grained potential and of new basis functions for
fluctuation dynamics of proteins with an elastic network the variational calculation. J. Chem. Phys. 131, 034102.
model. Biophys. J. 80, 505–515. DAS, P., MATYSIAK, S. & CLEMENTI, C. (2005). Balancing
AYTON, G. S., NOID, W. G. & VOTH, G. A. (2007). energy and entropy: a minimalist model for the char-
Multiscale modeling of biomolecular systems: in serial acterization of protein folding landscapes. Proc. Natl.
and in parallel. Curr. Opin. Struct. Biol. 17, 192–198. Acad. Sci. U.S.A. 102, 10141–10146.
BAHAR, I. & JERNIGAN, R. L. (1997). Inter-residue poten- DEMIREL, M. C. & KESKIN, O. (2005). Protein interactions
tials in globular proteins and the dominance of highly and fluctuations in a proteomic network using an elastic
specific hydrophilic interactions at close separation. network model. J. Biomol. Struct. Dyn. 22, 381–386.
J. Mol. Biol. 266, 195–214. DI FENZA, A., ROCCHIA, W. & TOZZINI, V. (2009).
BAHAR, I., KAPLAN, M. & JERNIGAN, R. L. (1997). Complexes of HIV-1 Integrase with HAT proteins:
multiscale models, dynamics and hypotheses on allos-
Short-range conformational energies, secondary struc-
teric sites of inhibition. Proteins 76, 946–958.
ture propensities, and recognition of correct sequence–
ERCOLESSI, F. & ADAMS, J. B. (1994). Interatomic potentials
structure matches. Proteins 29, 292–308.
from first-principles calculations: the force-matching
BANACHOWICZ, E., GAPINSKI, J. & PATKOWSKI, A. (2000).
method. Europhys. Lett. 26, 583.
Solution structure of biopolymers: a new method of
FLORENCE TAMA, F. & BROOKS III, C. L. (2005). Symmetry,
constructing a bead model. Biophys. J. 78, 70–78.
form, and shape: guiding principles for robustness in
BAY, Y. & ENGLANDER, W. (1994). Hydrogen bond
macromolecular machines. Annu. Rev. Biophys. Biomol.
strength and beta-sheet propensities: the role of a side
Struct. 35, 115–133.
chain blocking effect. Proteins 18, 262–266.
FRIEDEL, M. & SHEA, J. M. (2004). Self-assembly of pep-
BETANCOURT, M. R. & THIRUMALAI, D. (1999). Pair poten-
tides into a b-barrel motif. J. Chem. Phys. 120, 5809.
tials for protein folding: choice of reference states and
FRIEDEL, M., SHEELER, D. J., & SHEA, J.-E. (2003). Effects
sensitivity of predicted native states to variations in the
of confinement and crowding on the thermodynamics
interaction schemes. Protein Sci. 8, 361–369. and kinetics of folding of a minimalist b-barrel protein.
BUCK, P. M. & BYSTROFF, C. (2009). Simulating protein J. Chem. Phys. 118, 8106–8113.
folding initiation sites using an alpha-carbon-only GO, N. & SCHERAGA, H. A. (1976). On the use of classical
knowledge-based force field. Proteins 76, 331–342. statistical mechanics in the treatment of polymer chain
CASCELLA, M. & PERARO, M. D. (2008). Challenges and conformation. Macromolecules 9, 535–542.
perspectives in biomolecular simulations : from the HA-DUONG, T. (2010). Protein backbone dynamics
atomistic picture to multiscale modeling. Curr. Opin. simulations using coarse-grained bonded potentials and
Struct. Biol. 18, 630–640. simplified hydrogen bonds. J. Chem. Theory Comput. 6,
CHANG, C.-E., TRYLSKA, J., TOZZINI, V. & MCCAMMON, 761–773.
J. A. (2007). Binding pathways of ligands to HIV-1 HAMACHER, K. & MCCAMMON, J. A. (2006). Computing the
protease: coarse-grained and atomistic simulations. amino acid specificity of fluctuations in biomolecular
Chem. Biol. Drug Des. 69, 5–13. systems. J. Chem. Theory Comput. 2, 873–878.
CHENNUBHOTLA, C., RADER, A. J., LEE-WEI YANG, L.-W. & HONEYCUTT, J. D. & THIRUMALAI, D. (1990). Metastability
BAHAR, I. (2005). Elastic network models for under- of the folded states of globular proteins. Proc. Natl.
standing biomolecular machinery: from enzymes to Acad. Sci. U.S.A. 87, 3526–3529.
supramolecular assemblies. Phys. Biol. 2, S173–S180. IZVEKOV, S. & VOTH, G. A. (2006). Multiscale coarse-
CHOU, P. Y. & FASMAN, G. D. (1978). Empirical prediction graining of mixed phospholipid/cholesterol bilayers.
of protein conformation. Annu. Rev. Biochem. 47, J. Chem. Theory Comput. 2, 637–648.
251–276. IZVEKOV, S., PARRINELLO, M., BURNHAM, C. J. & VOTH,
CHU, J.-W. & VOTH, G. A. (2007). Coarse-grained free G. A. (2004). Effective force fields for condensed phase
energy functions for studying protein conformational systems from ab initio molecular dynamics simulation: a
370 V. Tozzini
new method for force-matching. J. Chem. Phys. 120, interactions and determination of weights of energy
10896–10913. terms by z-score optimization. J. Comput. Chem. 18,
IZVEKOV, S. & VOTH, G. A. (2005). Multiscale coarse 874–887.
graining of liquid-state systems. J. Chem. Phys. 123, LYMAN, E., PFAENDTNER, J., & VOTH, G. A. (2008).
134105. Systematic multiscale parameterization of hetero-
JANG, H., HALL, C. K. & ZHOU, Y. (2004). Assembly and geneous elastic network models of proteins. Biophys. J.
kinetic folding pathways of a tetrameric b-sheet com- 95, 4183–4192.
plex : molecular dynamics simulations on simplified off- MÁJEK, P. & ELBER, R. (2009). A coarse-grained potential
lattice protein models. Biophys. J. 86, 31–49. for fold recognition and molecular dynamics simula-
JEONG, J. I., JANG, Y. & KIM, M. K. (2005). A connection tions of proteins. Proteins 76, 822–836.
rule for a-carbon coarse-grained elastic network models MARAGAKIS, P. & KARPLUS, M. (2005). Large amplitude
using chemical bond information. J. Mol. Graph Model conformational change in proteins explored with a
24, 296–306. plastic network model: adenylate kinase. J. Mol. Biol.
KAYA, H. & CHAN, H. S. (2003). Solvation effects and 352, 807–822.
driving forces for proteinthermodynamic and kinetic MATHEWS, C., VAN HOLDE, K. E. & AHERN, K. G. (2000).
cooperativity: how adequate is native-centric topo- Biochemistry. 3rd edn. San Francisco: Addison Wesley
logical modeling? J. Mol. Biol. 326, 911–931. Longman Inc.
KESKIN, O., BAHAR, I., BADRETDINOV, A., PTITSYN, O. & MATYSIAK, S. & CLEMENTI, C. (2006). Minimalist protein
JERNIGAN, R. (1998). Empirical solvent-mediated model as a diagnostic tool for misfolding and aggre-
potentials hold for both intra-molecular and inter- gation. J. Mol. Biol. 363, 297–308.
molecular inter-residue interactions. Protein Sci. 7, 2578. MCCAMMON, J. A. & NORTHRUP, S. H. (1980). Helix–coil
KLIMOV, D. K. & THIRUMALAI, D. (2000). Mechanisms and
transition in a simple polypeptide model. Biopolymers 19,
kinetics of b-hairpin formation. Proc. Natl. Acad. Sci.
2033–2045.
U.S.A. 97, 2544–2549.
MIYAZAWA, S. & JERNIGAN, R. L. (1996). Residue–residue
KLIMOV, D. K., BETANCOURT, M. R. & THIRUMALAI, D.
potentials with a favorable contact pair term and an
(1998). Virtual atom representation of hydrogen bonds
unfavorable high packing density term, for simulation
in minimal off-lattice models of alpha helices: effect on
and threading. J. Mol. Biol. 256, 623.
stability, cooperativity and kinetics. Folding Des. 3,
MONTICELLI, L., KANDASAMY, S. K., PERIOLE, X., LARSON,
481–496.
R. G., TIELEMAN, D. P., & MARRINK, S.-J. (2008). The
KOGA, N. & TAKADA, S. (2001). Roles of native topology
MARTINI coarse-grained force field: extension to
and chain-length scaling in protein folding: a simulation
proteins. J. Chem. Theory Comput. 4, 819–834.
study. J. Mol. Biol. 313, 171–180.
MUKHERJEE, A. & BAGCHI, B. (2002). Correlation between
KORKUTA, A. & HENDRICKSON, W. A. (2009). A force field
rate of folding, energy landscape and topology in the
for virtual atom molecular mechanics of proteins. Proc.
folding of a model protein HP-36. J. Chem. Phys. 118,
Natl. Acad. Sci. U.S.A. 106, 15667–15672.
KUNDU, S., SORENSEN, D. C., & PHILLIPS, JR., G. R. (2004). 4733–4747.
Automatic domain decomposition of proteins by a MUKHERJEE, A., BHIMALAPURAM, P. & BAGCHIA, B. (2005).
Gaussian network model. Proteins 57, 725–733. Orientation-dependent potential of mean force for
LEVITT, M. & WARSHEL, A. (1975). Computer simulation of protein folding. J. Chem. Phys. 123, 014901.
protein folding. Nature 253, 694–698. NAKAGAWA, N. & PEYRARD, M. (2006). Modeling protein
LEVITT, M. (1976). A simplified representation of protein thermodynamics and fluctuations at the mesoscale. Phys.
conformations for rapid simulation of protein folding. Rev. E 74, 041916.
J. Mol. Biol. 104, 59–107. NOID, W. G., CHU, J.-W., AYTON, G. S., KRISHNA, V.,
LIU, P., IZVEKOW, S. & VOTH, G. A. (2007). Multi-scale IZVEKOV, S., VOTH, G. A., DAS, A. & ANDERSEN, H. C.
coarse graining of monosaccharides. J. Phys. Chem. B (2008a). The multiscale coarse-graining method. I. A
111, 11566–11575. rigorous bridge between atomistic and coarse-grained
LIWO, A., OLDZIEJ, S., PINCUS, M. R., WAWAK, R. J., models. J Chem. Phys. 128, 244114.
RACKOWSKY, S. & SCHERAGA, H. A. (1997a). A united- NOID, W. G., LIU, P., WANG, Y., CHU, J.-W., AYTON, G. S.,
residue force field for off-lattice protein structure IZVEKOV, S., ANDERSEN, H. C., & VOTH, G. A. (2008b).
simulations. I. Functional forms and parameters of long The multiscale coarse-graining method. II. Numerical
range side chain interactions potentials from protein implementation for coarse-grained molecular models.
crystal data. J. Comput. Chem. 18, 849–873. J. Chem. Phys. 128, 244115.
LIWO, A., PINCUS, M. R., WAWAK, R. J., RACKOWSKY, S., NYMEYER, H., GARCIA, A. E. & ONUCHIC, J. N. (1998).
OLDZIEJ, S. & SCHERAGA, H. A. (1997b). A united- Folding funnels and frustration in off-lattice minimalist
residue force field for off-lattice protein structure protein landscapes. Proc. Natl. Acad. Sci. U.S.A. 95,
simulations. II. Parameterization of short-range 5921–5928.
Minimalist models for proteins 371
OKUR, A., STROCKBINE, B., HORNAK, V. & SIMMERLING, C. Biomolecular Systems (ed. G. A. Voth), p. 285. Washington,
(2003). Using PC clusters to evaluate the transferability DC: CRC Press.
of molecular mechanics force fields for proteins. TOZZINI, V., ROCCHIA, W. & MCCAMMON, J. A. (2006).
J. Comput. Chem. 24, 21–31. Mapping AA models onto one-bead coarse grained
ONO, S., NAKAJIMA, N., HIGO, J. & NAKAMURA, H. (2000). models: general properties and applications to a mini-
Peptide free-energy profile is strongly dependent on mal polypeptide model. J. Chem. Theory Comput. 2,
the force field: comparison of C96 and AMBER95. 667–673.
J. Comput. Chem. 21, 748–762. TOZZINI, V., TRYLSKA, J., CHANG, C.-E. & MCCAMMON,
REITH, D., PÜ TZ, M. & MÜ LLER-PLATHE, F. (2003). J. A. (2007). Flap opening dynamics in HIV-1 protease
Deriving effective mesoscale potentials from atomistic explored with a coarse-grained model. J. Struct. Biol. 157,
simulations. J. Comput. Chem. 24, 1624–1636.
606–615.
RUSSELL, D., LASKER, K., PHILLIPS, J., SCHNEIDMAN-
TROVATO, F. & TOZZINI, V. A. (in preparation). Coarse
DUHOVNY, D., VELASZQUEZ-MURIEL, J. A. & SALI, A.
grained model for the dynamic of the aggregation of the
(2009). The structural dynamics of macromolecular
green fluorescent proteins, in preparation.
processes. Curr. Opin. Cell Biol. 21, 1–12.
TRYLSKA, J., TOZZINI, V., CHANG, C.-E. & MCCAMMON,
SHERWOOD, P., BROOKS, B. R. & SANSOM, M. S. (2008).
J. A. (2007). HIV-1 protease substrate binding and
Multiscale methods for macromolecular simulations.
Curr. Opin. Struct. Biol. 18, 630–640. product release pathways explored with coarse-grained
SHI, Q., IZVEKOV, S., & VOTH, G. A. (2006). Mixed atom- molecular dynamics. Biophys. J. 92, 4179–4187.
istic and coarse grained molecular dynamics: simulation TRYLSKA, J., TOZZINI, V. & MCCAMMON, J. A. (2005).
of membrane a bound ion channel. J. Phys. Chem. B. 110, Exploring global motions and correlations in the ribo-
15045–15048. some. Biophys. J. 89, 1455–1463.
SILVERSTEIN, K. A. T., HAYMET, A. D. J. & DILL, K. A. VAN AALTEN, D. M. F., DE GROOT, B. L., FINDLAY,
(1998). A simple model of water and the hydrophobic J. B. C., BERENDSEN, H. J. C. & AMADEI, A. (1997).
effect. J. Am. Chem. Soc. 120, 3166–3175. A comparison of techniques for calculating protein
SOHEILIFARD, R., MAKAROV, D. E. & RODIN, G. J. (2008). essential dynamics. J. Comput. Chem. 18, 169–181.
Critical evaluation of simple network models of protein VOET, D. & VOET, J. G. (2005). Biochemistry. 3rd edn.
dynamics and their comparison with crystallographic New York: Wiley.
B-factors. Phys. Biol. 5, 026008. VOLTZ, K., TRYLSKA, J., TOZZINI, V., KURKAL-SIEBERT, V.,
SORENSON, J. M. & HEAD-GORDON, T. (2002a). Protein LANGOWSKI, J. & SMITH, J. (2008). Coarse-grained force
engineering study of protein L by simulation. J. Comput. field for the nucleosome from self-consistent multi-
Biol. 9, 35–54. scaling. J. Comput. Chem. 29, 1429–1439.
SORENSON, J. M. & HEAD-GORDON, T. (2002b). Toward WANG, Y., NOID, W. G., LIU, P. & VOTH, G. A. (2009).
minimalist models of larger proteins: a ubiquitin-like Effective force coarse-graining. Phys. Chem. Chem. Phys.
protein. Proteins 46, 368–379. 11, 2002–2015.
THORPE, I. F., ZHOU, J. & VOTH, G. A. (2008). Peptide
WU, Y., LU, M., CHEN, M., LI, J. & MA, J. (2007). OPUS-
folding using multiscale coarse-grained models. J. Phys.
Ca: a knowledge-based potential function requiring only
Chem. B 112, 13079–13090.
Ca positions. Protein Sci. 16, 1449–1463.
TIRION, M. M. (1996). Large amplitude elastic motions in
YAP, E.-H., FAWZI, N. L., & HEAD-GORDON, T. (2008). A
proteins from a single-parameter, atomic analysis. Phys.
coarse-grained a-carbon protein model with anisotropic
Rev. Lett. 77, 1905.
hydrogen-bonding. Proteins 70, 626–638.
TOZZINI, V. (2005). Coarse grained models for proteins.
Curr. Opin. Struct. Biol. 15, 144–150. ZACHARIAS, M. (2003). Protein–protein docking with a
TOZZINI, V. (2010). Multi-scale modeling of proteins. Acc. reduced protein model accounting for side-chain flexi-
Chem. Res. 43, 220–230. bility. Protein Sci. 12, 1271–1282.
TOZZINI, V. (in preparation). The phase diagram of a ZHOU, H. & ZHOU, Y. (2002). Distance-scaled, finite ideal-
minimalist polypeptide model, in preparation. gas reference state improves structure-derived poten-
TOZZINI, V. & MCCAMMON, J. A. (2005). A coarse grained tials of mean force for structure selection and stability
model for the dynamics of flap opening in HIV-1 pro- prediction. Protein Sci. 11, 2714–2726.
tease. Chem. Phys. Lett. 413, 123–128. ZHOU, J., THORPE, I. F., IZVEKOV, S. & VOTH, G. A. (2007).
TOZZINI, V. & MCCAMMON J. A. (2008). One-bead models Coarse-grained peptide modeling using a systematic
for proteins. In Coarse Graining of Condensed Phase and multiscale approach. Biophys. J. 92, 4289–4303.