Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Quarterly Reviews of Biophysics 43, 3 (2010), pp. 333–371.

f Cambridge University Press 2010 333


doi:10.1017/S0033583510000132 Printed in the United States of America

Minimalist models for proteins :


a comparative analysis

Valentina Tozzini*
NEST, Istituto Nanoscienze – CNR Scuola Normale Superiore, Piazza San Silvestro 12, I-56127 Pisa, Italy

Abstract. The last decade has witnessed a renewed interest in the coarse-grained (CG)
models for biopolymers, also stimulated by the needs of modern molecular biology, dealing
with nano- to micro-sized bio-molecular systems and larger than microsecond timescale. This
combination of size and timescale is, in fact, hard to access by atomic-based simulations.
Coarse graining the system is a route to be followed to overcome these limits, but the ways of
practically implementing it are many and different, making the landscape of CG models very
vast and complex.
In this paper, the CG models are reviewed and their features, applications and performances
compared. This analysis, restricted to proteins, focuses on the minimalist models, namely those
reducing at minimum the number of degrees of freedom without losing the possibility of
explicitly describing the secondary structures. This class includes models using a single or a few
interacting centers (beads) for each amino acid.
From this analysis several issues emerge. The difficulty in building these models resides in the
need for combining transferability/predictive power with the capability of accurately
reproducing the structures. It is shown that these aspects could be optimized by accurately
choosing the force field (FF) terms and functional forms, and combining different
parameterization procedures. In addition, in spite of the variety of the minimalist models,
regularities can be found in the parameters values and in FF terms. These are outlined and
schematically presented with the aid of a generic phase diagram of the polypeptide in the
parameter space and, hopefully, could serve as guidelines for the development of minimalist
models incorporating the maximum possible level of predictive power and structural accuracy.

1. Introduction 334

2. Models description and FF terms 335

3. Backbone conformation description and secondary structures representability 337

4. Parameterization philosophies 341


4.1. Bias and fixed topology : networks and Go-models 341
4.2. The Boltzmann inversion 344
4.2.1. The Boltzmann inverses (BIns) of the conformational internal variables 345
4.2.2. BIns and potential of mean forces (PMFs) 346
4.2.3. From the PMF to the effective interactions 348
4.2.4. Examples of models based on BI 350
4.3. The Force Matching (FM) method 353
4.3.1. Mechanical consistency versus thermodynamic consistency 353

* Email : tozzini@nest.sns.it
334 V. Tozzini

4.3.2. Examples of FFs based on the FM 355


4.4. Physics–chemistry-based models and combinations of methods 355

5. Building accurate and predictive minimalist models 359


5.1. Accurate reproduction of the secondary structure 360
5.2. A possible strategy toward accurate and predictive models 362

6. Toward an optimal OB model: conclusions and perspectives 367

7. Acknowledgements 368

8. References 369

1. Introduction
The activity of a living cell consists of a complex network of interactions among bio-molecules
exchanging information and energy through biochemical processes (Russell et al. 2009). These
occur on different scales, spanning about 10 orders of magnitude in the space domain and 15 in
the time domain and requiring the use of very many different modeling techniques, often
combined in the so-called multi-scale approaches (Ayton et al. 2007 ; Cascella & Peraro, 2008 ;
Sherwood et al. 2008 ; Tozzini, 2010). The methods used for the atomic-level descriptions,
namely the quantum mechanics approaches and the force field (FF)-based molecular dynamics
(MD) simulations, are very well established techniques that have reached a satisfactory level of
standard and accuracy (Tozzini, 2010). However, even taking into account the current trend
of computer power increase, the atomistic simulations are not likely to be able to reach the
biologically interesting scales for a long time. This is especially true for the time domain : while
large macromolecular assemblies are currently addressable with all-atom (AA) MD simulations
to the sub-ns timescale, the ms scale is a hard limit even for simulations of single proteins.
This excludes a large portion of the biological processes, generally involving macromolecular
aggregates (>10 nm) and the timescale of y10 ms or more.
In order to overcome these limits, the idea of considering simplified models at less than
atomic resolution arises quite naturally. The reduction of the amount of internal variables used in
the description of the system (the ‘ Coarse Graining ’) brings a saving in computational cost, and
the consequent possibility of simulating large systems for a longer time, in principle with no
limitation, because the upper limit of the run length depends on the level of coarse graining.
However, after its first appearance several decades ago, this idea underwent a long period of
latency. It was reconsidered in the last years, probably triggered by the development of new
experimental techniques for bio-systems proper for the investigation of the nano–micro scale.
Coarse graining can be done at many different levels (Tozzini, 2005). The coarser the
description, the larger the saving in computational cost. But the elimination of internal degrees of
freedom implies that their effect must be taken into account implicitly in the effective forces
acting among the explicit degrees of freedom. This task becomes harder as the level of coarse
graining is made stronger (Tozzini & McCammon, 2008). Different recipes were proposed to
solve the related problems, and a large variety of different CG models, differing by the level of
coarse graining and by the philosophy of the parameterization of the FFs, are available, making
the CG models landscape very complex.
This paper focuses on a sub-class of CG models for proteins, also called ‘ minimalist ’ (Tozzini
& McCammon, 2008). Although in general this term is used with different meanings, in this
Minimalist models for proteins 335

paper ‘ minimalist ’ is attributed to the models that implement the maximum level of coarsening
that still allows us to explicitly represent some fundamental features of the bio-molecule, such as
the secondary structure level. Among these, particularly interesting are the models representing
an amino acid with one single interacting center (bead), i.e. the one-bead (OB) models. Coarser
representations cannot easily describe the secondary structure transitions. In addition, the OB
CG models are the more ‘ natural ’ representation, because the amino acid is the ‘ building block ’
of proteins. In this paper, however, also the 2–3 beads models that represent explicitly the side
chain of the amino acid are considered, because they share with the OB–CG models a similar
description of the backbone and similar parameterization-related problems. Conversely, the CG
models representing explicitly the backbone atoms (4–6 beads and more models) or, on the other
side, the coarser grouping of more than a single amino acid are excluded from the present paper,
because they display very different features.
The advantages of using minimalist models are obtained at the cost of a number of emerging
problems in the parameterization. Combining accuracy and predictive power in a few parameters
reveals a hard task that has been faced with different strategies (Tozzini & McCammon, 2008 ;
Tozzini, 2010), giving rise to a variety of different models and parameterization recipes.
In this paper, these are reviewed and classified according to the number and location of the
beads, the type and form of the FF terms, and the parameterization strategy. Some technicalities
and subtleties underling specific parameterization methods are particularly addressed, with
the aim of including these methods within a rigorous theoretical frame. The performances
and applicability of different models are also compared, and criticalities outlined. Advantages
and disadvantages of the parameterization method emerge, together with regularities in the
relationship between FF terms and parameters values. Overall, this analysis outlines a possible
global strategy to build an optimal minimalist model and systematically assign the parameters
values, which is illustrated with the aid of a schematic phase diagram for polypeptides in the
parameters space.

2. Models description and FF terms


A list of representative minimalist models is reported in Table 1, separated in classes according
to the number and location of the beads. Class 1 includes the most natural models, i.e. those
representing an amino acid with a single interacting center (bead) placed on the alpha carbon
(Ca). This representation has several advantages. First, it matches exactly with the low-resolution
structural data that usually resolve only the coordinates of the Cas, allowing for a direct data
exchange with experiment (Trylska et al. 2005). This is particularly important when one adopts a
parameterization strategy based on statistical sets of structures, as it will be clear in section 4.
In addition, due to rigid geometry of the peptide bond, the position of the backbone atoms is
determined by the Ca coordinates, allowing the AA reconstruction of the backbone itself. This is
not generally true if the bead is placed on different locations.
The number of internal variables is extremely reduced : only the angle hi between three sub-
sequent Cas and the dihedral ai between four subsequent Cas (see Table 1 for the definitions)
are sufficient to describe the conformation, since – with a few exceptions (pre-proline and cis-
conformation bonds) – the Ca—Ca pseudo-bond has a value that is fixed at 38 Å independent
of the secondary structure. As a consequence, the FF can have a very simple form

U =U bond (ri, i+1 )+U back (hi , ai )+U nb (rij ) ð2:1Þ


336 V. Tozzini

Table 1. Classification of the minimalist models for proteins according to their component beads. The main
internal variables are indicated in the third column. The main references are indicated. CM, center of mass

Class Balls and sticks Scheme Main references


1. Honeycutt & Thirumalai (1990)
1 bead Friedel et al. (2003)
Ca Jang et al. (2004)
Das et al. (2005)
Tozzini & McCammon (2005)
Korkuta & Hendrickson (2009)
Sorenson & Head-Gordon
(2002a, b) ; Yap et al. (2008)
2. McCammon & Northrup (1980)
1 bead
Cb

3. Bahar et al. (1997), Bahar &


2 beads Jernigan (1997)
Ca Klimov et al. (1998), Klimov &
Side chain Thirumalai (2000)
(CM, Cb or Mukherjee & Bagchi (2002)
centroid) Májek & Elber (2009)

4. Zhou et al. (2007)


2 beads
Backbone CM
Side chain CM

5. Zacharias (2003)
1–3 beads
Ca
0–2 for side
chain

6. Monticelli et al. (2008)


1–6 beads Ha-Duong (2010)
Backbone
centroid
0–5 beads
for the
side chain

7. Liwo et al. (1997a, b)


1–2 beads Levitt (1976)
Ca
Backbone
centroid
side chain
centroid
Minimalist models for proteins 337

(sums over subscripts are implied), where U bond is the term describing the pseudo-peptide bond
energy (often substituted with a constraint), U back describes the conformational energy and U nb
the non-bonded interactions. The latter term in OB models must include very many different
effects : the hydrogen bonding, the excluded volume and hydrophobicity interactions, and the
electrostatics. Consequently, it can be very complex and it is often separated in sub-terms de-
scribing each effect. For instance, the excluded volume interaction is intrinsically non-isotropic,
because the Ca is not located in the center of the amino acid. Models in class 2, i.e. those with the
bead placed on the Cb (or on the ‘ centroid ’ of the amino acid), were proposed to reduce this
problem, and possess a more isotropic excluded volume term. However, additional problems
arise related to the more difficult physical interpretation of the internal variables and backbone
reconstruction. Related to this, the equilibrium value of the Cb—Cb pseudo-bond distance is no
more structure independent and, consequently, the term U bond(ri,i+1) is complex and dependent
on the secondary structure and amino acid type. This is true for all the models (even multiple
bead models) where the backbone description is not based on the Ca positions (e.g. classes 4
and 6).
Adding one or more beads located on the side chains (classes 3–6) allow us to more easily
describe the non-bonded interactions separating the side chain effects and simplify the functional
form of the corresponding FF terms, although of course their number increases, including also
new conformational terms (i.e. those depending on the side chain bond angles hs )

U =U bond (ri, i+1 )+U back (hi , ai )+U hb (rij )+U sc (hsi )+U nb (r s ij ): ð2:2Þ
The hydrogen bond interactions of the backbone U hb(rij) are usually separated from the other
non-bonded interactions, associated with the side chain beads and included in the term U nb(rsij),
which is possibly decomposed in its excluded volume, hydrophobicity and electrostatic com-
ponents (see also Table 2). In these models, the description of the non-bonded interactions is
simpler (more isotropic, simpler functional forms), at the expense of increasing significantly the
number of parameters.
Additional auxiliary beads whose position is constrained to that of the Ca and that do not
increase the number of degrees of freedom are used in models of class 7 to simplify the de-
scription of hydrogen bonding and other non-bonded interactions. This class includes together
one of the more sophisticated (and complex) CG models currently available (UNRES ; Liwo et al.
1997a, b) and one of the first CG models ever reported (Levitt, 1976), to which most of the
current ones are inspired.
Of course, a large variety of other models were considered, using up to five beads for the
backbone and multiple beads representations for the side chain. This paper focuses specifically
on the OB Ca-based models (class 1) and those sharing with them a similar description of the
backbone conformation (classes 3, 5 and 7). A scheme of the FF terms commonly present in the
different classes of models is reported in Table 2.

3. Backbone conformation description and secondary


structures representability
For the AA representations, a very powerful tool exists to validate the backbone conformation of
a protein model : the Ramachandran plot (RP), i.e. a two-dimensional (2D) density distribution as
a function of the two internal conformational variables, the pair of dihedrals (w,y) around the
two rotational bonds Ca—N and Ca—C (see Table 1, first row, third column for the definitions
338 V. Tozzini

Table 2. FF terms present in the different classes of minimalist models. In addition to the FF terms defined
in the main text, here the non-bonded interaction for the side chain is split into its components
U nb (rijs )=U sh +U hyd +U el , where U sh describes the hydrogen bonds of the side chains, U el the
electrostatic interactions and U hyd the hydrophobicity and excluded volume interactions, usually not separable.
Optional terms are enclosed in parentheses. In classes where certain FF terms are usually treated together, the
corresponding table cells are merged

Class Ub U h,a U hb U sc U sh U hyd U el


1. (”) ” (”) ”
2. ” ” (”) ”
3. ” (”) ” (”) ”
4. ” ” (”) ” (”) ”
5. ” (”) ” ” ”
6. ” ” (”) ” ” ”
7. ” ” ” ” ” ” ”

of dihedrals). The RP displays densely populated areas corresponding to the main secondary
structures (see Fig. 1 a, colored contours) in which the (w,y) pairs assume typical values reported
in Table 3, while areas outside the contours are sterically forbidden. To validate a protein model,
the protein RP must not display anomalies, such as points in the forbidden areas.
Obviously, the RP cannot be used in the class of models considered in this paper, whose
internal conformational variables are (a,h). However, the chemical constraints introduced by the
peptide bond allow deriving an analytical form for the (w,y)p(a,h) mapping (Tozzini et al.
2006). Under some simplifying conditions, this is described by the following equations:
8
>
> a=w+y+p+c( sin w+ sin y)xc(txp=2)( sin w+ sin y)
>
>
>
< 1
+ c2 ( sin 2w+ sin 2y+4 sin (w+y)),
4 ð3:1Þ
>
> 2 2
>
> cos (h)= cos t[ cos cx sin c cos w cos y]
>
:
+ sin t[ cos c sin c( cos w cos y)]x[ sin2 c sin w sin y],
where t=111x is the NH—Ca—CO angle and cy16x is the angle formed by the Ca—Ca
pseudo-bond and the NH—Ca and Ca—CO bonds (Tozzini et al. 2006). The (w,y)p(a,h)
mapping is graphically represented in Fig. 1 a, b. A uniform density in the (w,y) plane is mapped
onto a non-uniform butterfly-shaped image in the (a,h) plane. As an effect of the mapping, the
allowed areas corresponding to secondary structures are re-shaped and re-sized, as shown in
Fig. 1 b. The (w,y)p(a,h) mapping is not one-to-one : couples of points symmetric with respect
to the main diagonal are mapped onto the same (a,h) point. However, due to the specific relative
location of the forbidden and allowed secondary structure areas in the (w,y) plane, these remain
separated even in the (a,h) plane. This point is very important, because it is precisely what makes
the OB–CG representation meaningful and useful to describe the secondary structures : although
an (a,h) pair corresponds to two (w,y) pairs, only one of them falls out of the forbidden areas.
Consequently, the backbone conformation is uniquely determined for each (a,h) couple and the
AA backbone conformation can be uniquely reconstructed from the CG one.
Figure 1 c, d report the same information, but described in a different way. The color is con-
served in the (w,y)p(a,h) mapping ; thus the comparison of (c) to (d) shows that lines parallel to
the secondary diagonal in (w,y) are mapped in almost vertical lines in (a,h). In addition, the
color is assigned according to the ‘ helicity ’ of the secondary structure, changing from almost flat
Minimalist models for proteins 339

(a) (b)

(c) (d)

Fig. 1. Illustration of the (w,y)p(h,a) mapping. (a) The RP for the generic amino acid: the colored lines
enclose areas where (w,y) couples belonging to defined secondary structures accumulate : blue=extended,
green=right-handed helices, red=left-handed helices. Cyan and yellow lines enclose the weakly allowed
regions ; other areas are sterically forbidden. (b) The (w,y) plane is mapped in the butterfly-shaped region of
the (h,a) plot. Shades of grey show the dishomogeneity introduced by the mapping. The colored lines are
the images of the profiles in (a) mapped in the (a,h) plane, and enclose the areas corresponding to specific
secondary structures, as in (a). The black lines have the same meaning, but evaluated for glycine instead of
for the generic amino acid. (c) and (d) : The same as (a) and (b), with the following variants : the (w,y) plane is
colored in strips at constant (w+y) value and the color is conserved upon mapping in the (a,h) plane.
The dots represent specific kinds of secondary structures (those reported in Table 3) with the following
color code: blue=extended ; cyan=flat ribbon ; green=right-handed helices, red=left-handed helices ;
magenta=proline helices ; yellow=rings. In (d ), the open dots correspond to directly measured data (re-
ported in Table 3), the filled dots are obtained from the analytic mapping of the corresponding dots in
the (w,y) plane. The discrepancy is due to the use of the simplified formula for the analytic mapping
(see Tozzini et al. 2006).

structures (blue to cyan) to positive (green, right-handed helices), to zero (yellow, flat rings), to
negative (red, left-handed helices). Dots with the same color coding are placed in correspon-
dence to typical values of (w,y) for the different secondary structures reported in Table 3. In the
(a,h) plane, the colored strips become almost vertical (compare (c) to (d)), indicating that the
helicity depends only on a, being y180 for the flat structures and decreasing for helices toward
the rings (a=0). Right- and left-handed helices differ by the sign of a. The (a,h) representation
of the secondary structures helicity is rather intuitive. This is evident in the (a,h) plot for glycine,
340 V. Tozzini

Table 3. Conformational backbone variables in the most common secondary structures. The h and a values
were from the structures built with InsightII. The same software was used to build the turns using the data for
(w,y). In this case the extremal residues were added in extended conformation. For the conformations whose
secondary structure is not uniform (turns) or contains peptide bonds in cis conformation the (w,y)p(h,a)
mapping represented in Fig. 1 does not apply. (pro3) in turns VI means that the third residue is proline

v
Structure (deg) w (deg) y (deg) Ca—Ca (Å) h (deg) a (deg)
Extended* 180 180 180 38 146 180
Anti-parallel sheet Mathews et al. 180 x139 135 38 131 179
(2000)
b-strand* 180 x120 120 38 121 178
Parallel sheet (Mathews et al. 2000) 180 x120 113 38 119 177
Flat ribbon (Mathews et al. 2000) 180 x78 59 38 92 163
3–10 helix# 180 x49 x26 38 84 85
3–10 helix (Mathews et al. 2000) 180 x49 x29 38 85 81
3–10 helix* 180 x60 x30 38 88 68
a-helix (Mathews et al. 2000) 180 x57 x47 38 92 52
a-helix* 180 x65 x40 38 92 51
p-helix* 180 x30 x90 38 100 34
p-helix (Mathews et al. 2000) 180 x57 x70 38 99 27
p-helix# 180 x57 x80 38 102 17
6-membered ring* 180 180 0 38 115 0
5-membered ring* 180 x75 x75 38 105 0
5-membered ring (Mathews et al. 180 x60 x105 38 108 0
2000)
Left-handed a-helix (Mathews 180 57 47 38 92 x52
et al. 2000)
Collagen triple helix (Mathews 180 x51 153 38 117 x77
et al. 2000)
Polyproline II (Voet & Voet, 2005) 180 x71 150 38 117 x106
Polyproline II* 180 x71 145 38 117 x107
Polyproline II* 180 x79 150 38 121 x109
Polyproline II (Mathews et al. 2000) 180 x75 145 38 119 x109
Turn-I# 180 x60, x90 x30, 0 38 90, 88 48
Turn-II# 180 x60, 80 120, 0 38 88, 108 1
Turn-III# 180 x60 x30 38 88 68
Turn-V# 180 x80, 80 80, x80 38 98 x63
Turn-Via# 180, 0 x60, x90 120, 0 38, 123, 81 x50
24 (pro3)
Turn-VIb# 180, 0 x120, x60 120, 0 38, 81, 89 x25
24 (pro3)
Turn-VIII# 180 x60, x120 x30, 120 38 121, 88 48
Polyproline I (Mathews et al. 2000) 0 x75 160 29 100 94
Polyproline I* 0 x71 160 29 100 96

* Ideal conformation as automatically built by InsightII software.


# http://www.bmb.uga.edu/wampler/tutorial/prot2.html#alpha.

the un-chiral amino acid (black contour in Fig. 1 b) for which the complete symmetry with
respect to a is recovered.
In the next sections, the role of the (a,h) plot in validating the CG models and in helping their
parameterization will be clear. It is to be remarked that building a unique AApCG mapping was
possible due to the choice of the Ca as the interacting site: other choices produce more complex
and secondary-structure-dependent mappings.
Minimalist models for proteins 341

4. Parameterization philosophies
The minimalist CG models may have y10–100 parameters, including both the ‘ structural
parameters ’ (equilibrium values of the coordinates) and the ‘ energetic parameters ’ (elastic con-
stants, bonding energies, well depths, barriers, etc.). There are several possible strategies to fix
their values that are described in this section. To some extent, the parameterization strategy
is related to the type of model, defined by the kind and number of bonded terms in the FF
(the ‘ topology ’), their functional form and the number and functional form of the non-bonded
terms. In the following, the CG models are classified and described, restricting to the OB
Ca-based ones, unless otherwise stated.

4.1 Bias and fixed topology : networks and Go-models


The simplest idea to fix the structural parameters is to completely bias them toward a reference
structure, usually experimental. The general form of the FF for this kind of models is

U =U bond (ri, i+1 )+U back (hi , ai )+U nb, loc (rij )+U nb, non-loc (rij ): ð4:1Þ

The presence and form of the terms depend on the specific model. The separation into local and
non-local parts of U nb(rij) is generally based on a cutoff radius rcut : all the distances rij less than
rcut in the reference structure are treated as local, the others as non-local and the corresponding
FF terms are treated in different ways. In the simplest possible biased model, namely the elastic
network (EN), the U nb,non-loc is absent and U nb,loc is treated with a harmonic distance-depen-
dent potential. U back is also absent, and the correct backbone conformation is maintained by
U nb,loc, which also includes the interaction between second and third neighbors along the
polypeptide chain (1–3 and 1–4 interactions, equivalent of the pseudo-bond-angle and dihedral
interactions, respectively). In the original formulation (Tirion, 1996), all the elastic constants are
set at the same value k, optimized by fitting the calculated root mean-squared fluctuations
(RMSF) onto the experimental temperature B factors. This fit creates an inter-dependence be-
tween k and rcut : increasing rcut, k must be softened with the rule krcut2ycost. In subsequent
works, the rule ‘ the larger the cutoff, the softer the interaction’ was confirmed (Atilgan et al.
2001 ; Soheilifard et al. 2008) although the quantitative relationship does not seem to be so
simple. Average values of k and rcut2 are given in Table 4.
Thanks to its simplicity and robustness, EN models had a great success. Under certain as-
sumptions on the distribution of the fluctuations (the Gaussian network model, GNM, and its
anisotropic version ANM ; Atilgan et al. 2001 ; Soheilifard et al. 2008), they can be analytically
solved, and normal mode analysis (NMA) easily performed. The low-frequency normal modes
obtained from EN are seen to catch the fundamental motions of the system related to its
biological function, in spite of the extreme simplification of the representation. This indicates
that the connectivity and shape of a protein, namely the input of EN models, and not the
structure details, generally determine its biological function. Similar information can also be
obtained from the principal mode analysis (PMA) (Van Aalten et al. 1997) of an MD trajectory,
whose output are the modes ordered by amplitude. Within the harmonic approximation, the first
modes (i.e. largest in amplitude) coincide with the slowest if the trajectory is equilibrated. Thus
NMA and PMA give similar results once the correspondence between modes is done, but PMA
uses, as input, a trajectory and does not need an analytical description of the model and thus it has
a more general applicability. These analyses can be used for several purposes. The deformations
Table 4. Summary of the features of the minimalist models for proteins

342
Model U bond U back U nb,loc U nb,non-loc Remarks

Harmonic 1=2 k(rijxr0ij)2 GNM rcut=6–10, ky1–0.2 kcal/mol Å2

V. Tozzini
Elastic network
ANM rcut=8–15, ky10–0.9 kcal/mol Å2
Plastic/bimodal Harmonic potential for the single wells GNM rcut=8, ky0.02 kcal/mol Å2
networks Global or local valence-bond like combination rcut=13, k=1 kcal/mol Å2
Heterogeneous Harmonic In principle, infinite, but rcuty15 for simplicity
EN k=different for each bond couple
Extended/ Harmonic Anharmonic rcut=13, Ky46 kcal/mol Å2
anisotropic 1
=2 K(rijxr0ij)2 =2 kij((rijxr0ij)2xa2)H((rijxr0ij)2xa2)
1
kij=AA dependent avg y2 kcal/mol Å2
network
Chemical EN Harmonic Harmonic (1–3 and 1–4 distance-based Harmonic rcut=8 for U nb,loc
terms)
1
=2 k2(rijxrij0)2 1
=2 kvdw(rijxr0ij)2 Separated terms for H-bonds, disulfide bridges and
salt bridges, with different elastic constants
Go models Harmonic or Harmonic angle LJ 12-6 Repulsive only rcut=8 for Ca
constraint
  r 0 10   12
r 0 12
1
=2 kh(hxh0)2 e ij
rij
x 65 ij
rij
e C
rij
rcut=4 for side chains

Cosine sum
P e=energy unit
n=1, 3 Kn [1x cos n(axa0 )] kb=100e kh=20e Knye

Partially biased Harmonic or Harmonic (1–3 and 1–4 distance-based Biased Morse Unbiased Morse rcuty8
models constraint terms) u(rij )= u(rij )=
OR 0 0 kb=50–100 kcal/mol Å2
Unbiased e[(e xk(rij xrij ) x1)2 x1] e[(e xk(r xr ) x1)2 x1] khy20–50 kcal/mol
Harmonic angle kay3 kcal/mol
U ang=1P=2 kh(hxh0)2 or e=e(r0)=decreasing from y5 to y0.1 kcal/mol
U ang = 3n=1 kn n!1 (hxh0 )n Parameterization half structure based, half BI
X based
U dih = K [1x cos n(axa0 )]
n=1, 3 n

Unbiased models Harmonic or U ang harmonic Explicit Uhb, anisotropic LJ-like, single or multiple Parameterization based on a mix of BI, FM and
constraint LJ-like wells, sometimes anisotropic physical–chemical considerations based
Or double well OR
U dih cosine sum dipole-dependent term
Explicit correlations between U ang and
U dih sometimes implicitly included
Minimalist models for proteins 343

associated with the slowest modes were used to flexibly fit high-resolution structural data into
low-resolution electronic maps (Florence Tama & Brooks, 2005) or to decompose the system
into domains (Kundu et al. 2004). More in general, the equilibrium dynamics of huge systems
such as entire viruses was analyzed with EN models (Chennubhotla et al. 2005 ; Demirel &
Keskin, 2005).
The limitation of these models stems from their simplicity : the bonding connectivity fixed to
that of the reference structure and the use of a single-well harmonic potential constrains the
system to move in the attraction basin of the reference structure, which is the only one possible
equilibrium configuration. In addition, the use of a unique elastic constant for all the possible,
bonded and non-bonded interactions of the system is clearly a un-physical oversimplification.
Thus improved network models were proposed that release one or more of these restrains.
Plastic networks (Maragakis & Karplus, 2005) and multiple-well networks (Chu & Voth, 2007)
allow studying systems with two or more equilibrium conformations. Each conformation is
represented with an EN model, subsequently coupled with a valence-bond-like approach. For
instance, in the case of two states A and B
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1
U = (U A +U B )x (U A xU B )2 +e2 , ð4:2Þ
2 2
where UA and UB refer to the single states and the combination can be done either at the global
level (i.e. UA,B are the total potential energies; Maragakis & Karplus, 2005) or at the local level
(i.e. UA,B=uAB(rij) are the single pair potentials; Chu & Voth, 2007). In both cases, the standard
EN parameters are used for the single well, but additional parameters are needed : the coupling
parameter e and the relative free energy of the two states. These models are able to describe the
transition between two known structural states, to find the minimum free energy paths and
analyze the properties of the system along them.
Heterogeneous EN (Lyman et al. 2008) models were proposed to improve the quality of the
RMSF. In these models, k is not uniform : conversely, kij for each single interacting pair is
considered a fitting parameter. Clearly, there is the need for a large amount of input data, for such
a large number of fitting parameters ; that is why these models usually take as input the RMSF
evaluated over different normal modes of the system from AA simulations. As a result, normal
mode displacements and RMSF of the CG model will have good agreement with those of the AA
model. The elastic constant strength as a function of the equilibrium distance is obtained as a by-
product, and it is decreasing (Fig. 5), in agreement with the intuitive consideration that shorter
interactions are stronger. Elastic constants dependent on the amino acids type were introduced
in the ‘ extended ’ ANM, where also an anharmonic potential for the non-subsequent Ca pairs
was used (Hamacher & McCammon, 2006). The ks were obtained based on available statistical
contact potential matrices (Keskin et al. 1998 ; Miyazawa & Jernigan, 1996) readjusted on the
experimental RMSF. This model gives better-quality RMSFs and accounts for the effect of
mutations.
The simplicity of EN is obtained at the cost of the need for a large number of contacts for
each bead for the system to be stable, scaling approximately with rcut3. In order to reduce the
computational cost and with the aim of having more physical FF terms, ENs with physics/
chemistry-based topology were proposed. In the ‘ chemical network ’ ( Jeong et al. 2005), the FF
terms are separated according to amino acid type with different ks calibrated fitting the overlap
of displacements from the NMA. As a consequence, rcut can be greatly reduced without losing
stability and accuracy.
344 V. Tozzini

The Go models belong to the class of biased models, although their purpose, FF terms
and parameterization philosophy differ from those of EN. Originally proposed as a simplified
statistical model for folding (Go & Scheraga, 1976), more recent versions (Clementi et al. 2000 ;
Koga & Takada, 2001) include terms with a more physical form than EN. U bond is the usual
harmonic term, while U back depends on the backbone conformational variables h and a, and is
separated into the corresponding two terms (usually harmonic and cosine series, respectively).
The rcut in this case has the meaning of separating the couples of amino acids that are in contact
in the folded structure (called the native contacts) from the others. The U nb,loc acting between
pairs in native contact is represented as an attractive Lennard–Jones (LJ)-like potential, while the
U nb,non-loc is repulsive, so as to push the system toward the native conformation. This form of
Go model is also called the minimally frustrated model for folding, because the local minima
other than the native structure are almost absent. The elastic constants and other energetic
parameters are fixed to be proportional to e, the LJ well depth (see Table 4). Thus, the energetic
properties of the system are entirely determined by a single parameter, e, which is adjusted to fit
the experimental melting temperature. The philosophy underlying these models relies on the
assumption that the native structure (the main input of the model) determines the folding
pathway and other global properties of the folding process. However, this is true only in the first
approximation : by construction, and due to the absence of frustration, the Go model fails to
catch the nature of the intermediate states that often occurs during folding. For these reasons,
the evolutions of the Go model have followed the direction of including frustration, either in
U back (Nakagawa & Peyrard, 2006) or in U nb (Kaya & Chan, 2003).

4.2 The Boltzmann inversion


The bias toward a single structure is the main limit of the network and Go models, which,
although very useful in describing the near equilibrium dynamics or the global properties of
folding, are inadequate to describe more general dynamical properties. A possible way to over-
come this limit is to extract the parameters from a statistical set of structures. As it will be clear
in the following, the origin of the structures (e.g. experimental or from AA simulations) may
influence the features of the FF.
A possibility to extract parameters from a statistical set of data is the Boltzmann inversion
(BI). Suppose that the total internal energy of the system can be exactly decomposed as the sum of
uncorrelated terms, each depending on a single internal CG variable, i.e.
X
U ({Q})= U I (QI ), ð4:3Þ
I

where UI=U bond, U angle, etc. and QI=ri,i+1, ai, etc. ({Q} indicates the whole set of co-
ordinates). The probability distribution of a single internal variable is
Z    
U ({Q}) U (QI )
P(QI ) / dQ1 , . . . , dQI x1 , dQI +1 , . . . , dQN exp x = exp x ð4:4Þ
kT kT

and the second equality stands if the condition of complete un-correlation between the FF terms
is valid. This condition is never exactly satisfied, especially in the case of the OB FFs. The
consequences of this approximation will be illustrated later on. Equation (4.4)) is equivalent to
U (QI )=xkT ln (P(QI ))+const: ð4:5Þ
Minimalist models for proteins 345

(a) (b)

(c) (d)

Fig. 2. Probability distribution of the internal variables evaluated using different statistical sets : 102 large
proteins (size y50 Å, solid line) ; 312 proteins with prevalence of b-strands (long dashed line) ; 347 proteins
with prevalence of a-helices (short dashed line) ; 450 proteins prevalently unstructured (dotted line). The
sets were prepared with the selection tools of the RCSB databank website (RCSB protein databank, http://
www.pdb.org/pdb/home/home.do). Blue lines : arbitrarily normalized variable distributions ; red lines :
BIn, units on the right-hand vertical axis. (a) First neighbor distance distributions ; (b) pseudo-bond-angle
distributions ; (c) pseudo-dihedral distributions ; (d) the (a,h) plot for the set of generic proteins (in orange),
superimposed on the ideal map (see also Fig. 1).

that defines the core of the BI, and gives an operative way to derive the CG FF terms based on
the probability distribution of internal variable, which can be evaluated given a statistical set of
structures.

4.2.1 The Boltzmann inverses (BIns) of the conformational internal variables


In order to give an idea of how the BIns look like, those for the three conformational internal
variables (Ca—Ca distance, pseudo-bond-angle h and pseudo-dihedral a) are reported in Fig. 2.
The probability distributions for three different sets of proteins and the corresponding BIn are
shown. The pseudo-bond distribution (BIn) is bimodal, but the position of the peaks (minima) is
independent of the secondary structure : rather they correspond to the trans and (rare) cis con-
formation of the peptide bond. The independence of the secondary structure is related to the fact
that Cas were chosen as the location for the interacting centers. Conversely, the multiple peaks
in the pseudo-bond distributions for models with beads placed on Cb or centroids (e.g. classes 2
and 4) are signatures of the secondary structure (Monticelli et al. 2008).
346 V. Tozzini

The probability distributions of h and a (and BIn) also display multiple peaks (minima), and
these correspond to different secondary structures of both Ca-based and non-Ca-based models.
However, in Ca-based models the relation between peaks and secondary structures is more
direct, as explained in section 3. More specifically, P(h) displays a peak at y90x typical of the
a-helices, while the b-like structures have h=115–145 depending on the kind of amino acid.
Figure 2 b clearly shows how the relative population of the two peaks varies passing from a
a-prevalent to b-prevalent set of proteins, while in the mainly unstructured proteins the two
peaks are broader. Similar consideration can be done for the a pseudo-dihedral distributions,
where the peak typical of helices is located at y60x, while the extended structures tend to adopt
a value of y178x. In Fig. 2 d, the correlation (a,h) plot is also reported for the set of generic
proteins, which shows a good superposition with the ideal (a,h) plot (see section 3).

4.2.2 BIns and potential of mean forces (PMFs)


The BIn from eqn (4.5) is often identified with the PMF. However, rigorously speaking, they are
not exactly the same, although their difference is seldom clearly stated (Májek & Elber, 2009).
The PMF is defined by
 
P(QI )
U (QI )=xkT ln , ð4:6Þ
P0 (QI )

where P0(QI) is the probability distribution of the internal variable in a reference state, usually the
system with non-interacting particles (but the choice of the reference state is a matter of debate
(Betancourt & Thirumalai, 1999). Non-interacting particles distribute randomly in the 3D space ;
thus P0(QI) is usually constant and the PMF differs from the BIn by an irrelevant constant.
However, in certain specific cases, this is not true. For instance, consider QI=h : the P0 distri-
bution of uniformly distributed points on a sphere is not uniform, but rather P0(h)/sin(h), due
to the fact that the lateral surface of a spherical horizontal section increases with sin(h). Thus, in
this case, one has that PMF is U(h)=xkT ln [P(h)/sin(h)]. This correction is usually neglected
because sin(h) varies only y20 % from the a peak to the b peak.
In other cases, however, the correction is relevant and the difference between BIn and PMF is
substantial. Consider, for instance, the probability distribution of distances between any two
pairs of Ca, P(r), which is related – after some elaboration (see the following section) – to the
non-bonded interaction potential. For an infinite system of non-interacting particles, one has
P0(r)=4pr2, that is, the volume of the spherical shell of radius r, to which also P(r) should tend
for large r. Thus the PMF is U(r)=xkT ln (P(r)/4pr2)=xkT ln(g(r)), where g(r) is the pair
correlation function. Figure 3 illustrates the differences between the mentioned quantities with
practical examples. In (a), the P(r) is shown (black lines). As it can be seen, however, for r >10 Å
its behavior strongly deviates from 4pr2 (red line). This is due to the fact that the proteins have
finite size ; thus their P(r) vanishes for r larger than the average protein size. Thus, in this case,
P0(r) should be the distribution of non-interacting particles limited in a sphere of finite radius (R),
which can be analytically calculated solving the integral from the definition, and is
Z Z
1 2
P0 (r )= r dr1 dr2 d(r xjr1 xr2 j)
N sphere sphere
  ð4:7Þ
1 2 24 3 r 1  r 3
= r 4pr pR 1x 3
+ ,
N 3 4 R 16 R
Minimalist models for proteins 347

(a) (b)

(c) (d)

Fig. 3. (a) In black, probability distributions of the pair distances for a set of large proteins (3000–6000
residues, solid line) and for a set of small proteins ( <1000 residues, dotted line). Superimposed in red are
the P0(r) evaluated for an infinite system (red) and for spherical systems (eqn (4.7)) of radius 45 Å (green)
and 23 Å (blue), which best fit the P(r) for large and small proteins, respectively. The fit is worse for the
small proteins set, probably due to the poor size homogeneity in this set. These P0(r) are also reported in
(b) together with that evaluated for a spherical shell of inner radius 15 Å and outer radius 20 Å (magenta)
and that for an infinite cylinder of radius 11 Å (cyan), evaluated numerically. The normalizations are such
that the quadratic parts for r <5 Å superimpose. (c) The pair correlation function g(r) for large (solid) and
small (dotted) proteins normalized with the infinite radius P0(r) (red) and with the finite radius (R=45 for
large proteins, green, and R=23 for small proteins, blue). (d ) Plots in the lower part of the graph, scale on
the left vertical axis non-bonded parts of g(r), the same color and line code as in (b) ; Plots in the upper part
of the graph, right vertical axis: the corresponding PMF, the same color and line codes.

where N is the total number of beads and r the average particle density. This expression tends to
4pr2 as R tends to infinity. Empirical approximations of this formula were earlier derived (Zhou
& Zhou, 2002). Figure 3 a shows that expression eqn (4.7) reproduces the correct behavior of
P(r) for large r.
In Fig. 3 b, the P0(r) for a spherical shell of inner radius 15 Å and outer radius 20 Å is shown in
magenta and that for an infinite cylinder of radius 11 Å is shown in cyan. These P0(r) values are
relevant when dealing with a viral capsid and with DNA, respectively. The behavior is quadratic
within the thickness of the shell or of the cylinder and then becomes linear for the shell and tends
to a constant for the cylinder. In general, P0(r) with the same global geometry of the system that
one is analyzing should be used, if the correct long-range behavior of the pair g(r) is to be
348 V. Tozzini

evaluated. As it can be seen in Fig. 3 c the g(r) tends to 1 only if the correct P0 is used (green and
blue lines). This reflects in the long-range behavior of the PMF : in Fig. 3 d, the g(r) and the
corresponding PMF are shown. Different lines correspond to different P0. Only using the correct
P0(r) one obtains the correct asymptotic vanishing behavior of the PMF.

4.2.3 From the PMF to the effective interactions


The concept of PMF gives an operative way to build the FF terms : under certain approxima-
tions, the PMFs could be used either directly (in numerical form) as FF terms or to fit the
parameters for corresponding analytical FF terms. For instance, from distributions and PMFs
such as those in Fig. 2 a, the U bond term can be fitted with a simple harmonic form with
r0=38 Å or 29 Å and ky100–150 kcal/mol ; the bond angle and dihedral terms need more
complicated analytical forms, but both the equilibrium and energetic parameters can in principle
be extracted from the PMFs.
Unfortunately, this view is fairly too optimistic. This procedure cannot be straightforwardly
applied for at least three reasons : (i) the probability distributions depend on the amino acid type ;
thus in order to have an accurate FF, one should build y202 different PMFs for the P(r), y203
for P(h) and 204 for P(a) ; (ii) the probability distributions (and the PMFs) depend on the
statistical set chosen ; (iii) the internal variables chosen are not uncorrelated ; thus eqn (4.3) is not
exact and the PMFs do not coincide with the real effective interaction potentials (i.e. the FF
term).
Issue (i) is not a problem in principle, but the large number of probability distributions to
build, possibly with similar occurrence statistics (which is difficult because different amino acids
have different occurrence frequencies), has limited the use of the method, at least exclusively.
However, several CG FFs have some of the terms, especially the bonded ones, directly deter-
mined by BI of the corresponding distributions (Bahar & Jernigan, 1997 ; Bahar et al. 1997 ;
Monticelli et al. 2008 ; Tozzini & McCammon, 2008).
Issue (ii) points out an interesting problem : the probability distribution depends on the
statistical set chosen. In particular, the relative amount of helical versus extended secondary
structures influences the bonded terms distributions (see Fig. 2 b, c) and the short-range part of
the non-bonded one, as it will be clear further on in this section. Some author (e.g. Monticelli
et al. (2008)) have turned around this problem by using FFs bonded terms depending directly
on the secondary structure (that must be a priori known) rather than on the amino acid type.
In addition, the probability distributions also depend on the origin of the data. Data from
AA simulations usually span a limited area of the conformational space, due to the limited AA
MD run lengths, meaning that the structures in the set are generally less structurally diverse than
in an experimental data set. For this reason, the peaks present in the distributions
from simulations tend to be sharper and better defined, generally leading to larger energetic
parameters. However, using parameters generated from AA simulation turns out to be useful
especially in multi-scale simulations, where compatibility with AA simulations is required (Zhou
et al. 2007).
However, issue (iii), the correlation between FF terms, is the main reason that prevents using
the simple PMFs as effective CG interactions. This concept will be illustrated by means of a
deeper analysis of the non-bonded part of the distance probability distribution, whose PMF is
related to U nb(r). Figure 4 shows the g(r) and corresponding PMFs, evaluated using a set of
proteins with natural statistical occurrence of helical and sheets structures. The green line is the
Minimalist models for proteins 349

Fig. 4. Non-bonded part of the pair correlation functions (lower plots, left-hand vertical axis) and the
corresponding PMF (upper plot, right-hand vertical axis). Color codes : green=total non-bonded g(r) ob-
tained excluding the first, second and third neighbors along the polypeptide chain (that pertain to pseudo-
bond, pseudo-bond-angle and pseudo-dihedral distributions) ; red solid=g(r) for the 1–5 distances (fourth
neighbors along the polypeptide chain); red-dashed=g(r) for the 1–6 to 1–10 distances ; magenta=g(r) of
the distances of Cas involved in hydrogen bonds between strands in sheets ; blue=green minus red and
magenta (i.e. ‘ real’ non-bonded part). The distribution of hydrogen bonds in the sheets is obtained with the
following criteria : the two Cas i and j are considered to form a hydrogen bond in a sheet if also (i+1, jx1)
and (ix1, j+1) (anti-parallel sheet) or (i+1, j+1) and (ix1, jx1) (anti-parallel sheet) form a hydrogen
bond. Dotted lines on the upper graph are fits to the minima with Morse functions (f(r)=e[(exp
(xk(rxr0))x1)2x1]), which have an additional parameter k with respect to LJ potential related to the
width of the well. The Morse potential turns out to be more appropriate given the variably softer nature of
the CG non-bonded interactions. The statistical set used is the set of ‘ small proteins ’ as in Fig. 3.

‘ non-bonded ’ g(r), obtained excluding from the distribution the first, second and third neighbors
along the polypeptide chain, which are related to the U bond and U back terms. One would, in
principle, expect that the PMF from this g(r) has an LJ-like shape. However, Fig. 4 (green lines,
upper plot) shows that it is far from being a single well van der Waals (vdW)-like potential, rather
it is multi-welled. For instance, the more evident peak at y62 Å corresponds to the distribution
of the fourth neighbors (1–5 neighbors, red solid line) : the Ca separated by four amino acids
along the chain assume quite sharply that distance when they are in helical conformation. Thus
that peak can be thought as due to the presence of the hydrogen bonds that maintain stable the
helix. Other peaks at y85, y10, 11 and 14 Å, present in the fifth to tenth neighbor distribution
(dashed red line), have a similar origin : they are due to the regular recurrence of distances in the
helical structure. Here the concept of ‘ correlation ’ emerges : while the first peak (well) at 62 Å
can be put in relationship with the presence of an additional interaction term in the FF (e.g. the
hydrogen bond U hb), the additional peaks (wells) in the g(r) (PMF) are not due to additional
terms, but are induced by U hb via multi-body correlations present along the helical structures.
Similarly, the backbone hydrogen bonds among strands in a sheet are responsible for the three
overlapping peaks at y45, 50 and 55 Å (magenta line), and only these, not the corresponding
induced peaks, should be included as explicit terms in the FF.
Once all the U hb (and their correlated) are subtracted, only the excluded volume, hydrophobic
and electrostatic interactions remain. The corresponding g(r) and PMF look much smoother
(blue lines) although a couple of structures at y6 and y10 Å are still visible (even more evident
in the amino-acid-specific distributions (Trovato & Tozzini, in preparation), thought to be due to
350 V. Tozzini

other factors : intrinsic anisotropy of the excluded volume side chain interactions, hydrophobic
interactions mediated by water molecules, etc. Thus these distributions (blue lines), and not the
original g(r) (green lines), should be used to parameterize the U nb term.
The one described above is an empirical way to separate the ‘ genuine’ FF terms from the
peaks due to correlations in the PMF. However, a rigorous way to generate a ‘ true ’ effective
potential from PMF exists that, in principle, does not require any arbitrary choice. This is the
Iterative Boltzmann Inversion (IBI) (Reith et al. 2003). It consists in using the g(r) as the target
function to be reproduced iteratively proceeding as follows : (1) use the PMF as initial guess for
Ui(r) and generate a gi(r) from a simulation ; (2) correct the Ui(r) with the formula
 
g(r )
Ui+1 (r )=Ui (r )xkT ln ð4:8Þ
gi (r )

and repeat step 1 with Ui+1(r). At convergence, this procedure gives the effective potential that
best reproduces the target g(r) and, in principle, should be applied to determine each term of the
FF at the same time (using the corresponding probability distributions as targets). In practice,
however, as the number of terms increase, a number of problems arise. Even assuming that the
simple BI (not iterative) is enough for the bonded terms (less affected by the correlation prob-
lem), and restricting the IBI to U nb, one should apply it to y202 different terms. In addition, in
order to obtain a general and transferable potential, the simulations should be performed not on
a single structure, but on a set of diverse representative structures. For these reasons, the IBI
is only rarely applied to biopolymers (Májek & Elber, 2009) and usually in simplified forms
(Banachowicz et al. 2000).
As previously noted, it is sometimes assumed that the bonded terms of the FF are less affected
by the problem of correlations, but this is not entirely true. Observing Fig. 2 d it is apparent that
P(h) and P(a) are not independent, rather certain values of a and h appear preferentially in a
correlated way, for instance hy90x and ay60x in the helices. These correlations can be seen as
induced by the presence of additional terms of the FF : specifically the hy90, ay60 can be seen
as induced by the hydrogen bond terms that stabilize the helices. This again points out the
necessity of taking with great care the direct BIn as the interaction potential, even in the case of
the conformational terms.

4.2.4 Examples of models based on BI


In spite of these problems, many CG FF are at least partially based on BI, using a simplified
version of the iterative approach. Among these, there are the partially biased models, first in-
troduced to simulate the large-scale dynamics of the HIV-1 protease (Tozzini & McCammon,
2005 ; Tozzini et al. 2007) and subsequently applied also to larger systems such as the ribosome
(Trylska et al. 2005), nucleosomes (Voltz et al. 2008) and viruses (Arcangeli & Tozzini, in prep-
aration). In these models, the FF form is as in eqn (4.1)), with the U back conformational term
separated in U ang and U dih. The philosophy underlying these models is to gradually abandon the
bias toward a reference structure and base the parameterization on the unbiased terms on BI.
Particular attention is devoted to the functional form of the unbiased terms, such as U ang, which
in Tozzini & McCammon (2005) is represented with a double-well quartic potential
1 1 1
U ang (h)=k (hxh0 )2 +kk (hxh0 )3 +ka (hxh0 )4 , ð4:9Þ
2 3 4
Minimalist models for proteins 351

where h0 is the location of the first well (y90x), and the other parameters are amino acid type
dependent and determine the position of the second well and its relative stability. The parameters
are extracted from the direct BIn (or PMFs) either from experimental structures (in Tozzini et al.
(2007) and Trylska et al. (2005)) or from AA simulations (in Voltz et al. (2008)) and then opti-
mized using the probability distributions as targets. U nb, both local and non-local, is represented
with a Morse potential, but a partial bias is conserved in the structural parameters of U nb-local :
the equilibrium distances r0ij differ for each ij couple and are taken from a single reference
structure. The distinction between U nb–local and U nb–non-local is similar to that in EN and Go
models, but both the parameters and the cutoff are based on physical grounds : looking at Fig. 4,
the cutoff can be naturally placed at y8 Å, which separates the local part of U nb, containing
mainly the hydrogen bond interactions, from the non-local, less structured part, containing
mainly the hydrophobic and electrostatic interactions. Thus the local bias allows us to represent
in a very simple way the most complex terms of the FF, maintains the structure stable and gives a
high level of structural accuracy that allows these CG models to be compatible with AA models
in multi-scale approaches (Chang et al. 2007). At the same time, the other unbiased terms give
enough flexibility to the system, so that even out of equilibrium dynamics can be simulated, such
as the substrate capture process of HIV-1 protease (Trylska et al. 2007) or the allosteric motions
upon binding of HIV-1 integrase with HAT proteins (Di Fenza et al. 2009).
A similar approach was adopted to build the VAMM FF (Korkuta & Hendrickson, 2009). The
FF terms are the same (eqn 4.1) and the parameterization based on BI, with more complex
functional forms for U ang and U dih. At variance with Tozzini et al. (2007), parameterization of
these terms is secondary structure dependent instead of amino acid type dependent, to improve
the structural accuracy, although a priori knowledge of the secondary structure is required.
U nb–local is biased toward a reference structure, but at variance with Tozzini et al. (2007), the
local–non-local separation is based on the distance along the chain ( j–i <6) instead of on a
distance cutoff. This implies that a double well U nb–non-local is necessary to account for the
residual short-range interactions (Fig. 4, blue dotted lines).
Although it originates as an evolution of the Go models and is specifically designed for
folding, even the Das, Matysiak, Clementi (DMC) model (Das et al. 2005) can be put in the class
of the partially biased because in the non-bonded interaction the bias is somewhat eliminated.
U bond and U back are parameterized as in the usual Go models based on a single structure.
Conversely the U nb assumes the form

X   12  10 
nb sij s ij
U (r )= e(ai , aj ) 5 xd(ai , aj )6 ð4:10Þ
j xi >3
rij rij

(it is to be observed that a 12–10 form is used instead of the usual 12–6 LJ potential, which is
quite similar to the Morse form with intermediate values of the k parameter). U nb is not sepa-
rated into local and non-local parts ; however, the parameters e, d (=0 or 1) depend on the
amino acid type (ai,j), and s depends also on the distance of i and j along the chain. Three classes
are considered: j–i=4, j–i=5 and j–i >5. Thus, in some sense, the separation between local and
non-local parts of the non-bonded interaction is recovered by recognizing that the 1–5 and 1–6
interactions are qualitatively different from the others. The ss are parameterized by BI of a
statistical set of non-redundant proteins (Das et al. 2005). The set of {e, d} is obtained through a
complex optimization procedure that involves the minimization of the distance of the simulated
folded structure from the native one, the free energy experimental differences upon mutations,
352 V. Tozzini

and other available observables (Matysiak & Clementi, 2006). The separation into classes and the
amino-acid-dependent parameters enhances the structural accuracy of the model. However, the
set of parameters is protein dependent; thus the problem of transferability to different proteins is
not completely solved. The DMC model improves the folding landscape characterization by
improving the local structural accuracy with BI-based terms and at the same time eliminating a
part of the bias present in the Go models.
As already observed, the difficulty in completely releasing the bias in OB models stems from
the fact that the non-bonded interactions, especially the local ones, are very complex and highly
anisotropic. Thus maintaining the local bias is a simple compromise solution to have high
accuracy without introducing a large number of parameters and complex functional forms
(Mukherjee et al. 2005). A part of the problem, specifically that related to the anisotropy of the
interactions of the side chains, is solved in the Ca-based two-bead (or multiple bead) models.
Some of them are worth mentioning here although this paper is focused on the OB models,
because based on BI. In the two-beads model by Bahar & Jernigan (1997) and Bahar et al.
(1997), the backbone U ang and U dih terms are numerically evaluated by BI and are aminoacid
type dependent. Conformational terms pertaining to the orientation of the side chain and
U corr(h,a) term are added to account for the correlations between the backbone conformational
variables. The non-bonded terms between side chains are obtained with the same philosophy.
The CALF FF is virtually a two-bead model, although the position of the side chain is entirely
determined from that of Ca (Buck & Bystroff, 2009). From the FF form point of view, it is
interesting because explicit terms for the hydrogen bonds U hb are introduced. The para-
meterization procedure is BI related but quite complex. Given a protein, first the local structure
is determined from the local sequence, through a statistical procedure. Second, based on the
local structure, the U back and U hb terms are parameterized, including statistical information
through BI. This model is used for folding. Similar for the FF terms and for the philosophy of
the parameter assignment (based on the local structure) is the OPUS-Ca FF (Wu et al. 2007),
which uses very complex parametric forms for the FF terms. These models are quasi-unbiased,
but the bias is hidden in the a priori determination of the local structure, upon which the FF is
based.
The MARTINI FF, recently extended to proteins (Monticelli et al. 2008), is a multiple bead
model where the backbone bead is placed on the centroid of the NH-Ca-C==O group and 1–4
beads are used for the side chains ; in addition, the solvent is explicit, although it is CG itself.
Thus the level of coarse graining is not particularly extreme. However, this model is interesting
for the purpose of the present paper because the choice of FF terms and parameterization
contains elements that can be considered exemplar even for Ca-based models. U bond is a simple
harmonic form, including in this case also the terms between Ca and the side chains. U back is
split into the bond angle and dihedral part, all represented by single-well functions. Additional
conformational terms are assigned to maintain the correct orientation of the side chain. U nb is
split into an LJ plus a screened Coulomb part, representing excluded volume and hydrophobic
interactions and electrostatic, respectively. U hb is absent : the secondary structure is maintained
by the U back terms, whose values are derived based on the BI. The side chain non-bonded
interaction parameters are amino acid type dependent and were optimized including thermo-
statistics data from experiment (e.g. free energy of water–oil partitioning) or from AA simula-
tions (amino-acid association constants). As for OPUS-Ca and CALF, although the bias toward a
single structure is completely removed, a priori knowledge of the secondary structure is necessary.
In addition, no transitions between different secondary structures are allowed.
Minimalist models for proteins 353

4.3 The Force Matching (FM) method


4.3.1 Mechanical consistency versus thermodynamic consistency
The BI-based methods can be thought of as realizing the ‘ thermodynamic consistency ’ of the
CG model with a given statistical set of data. In fact they use as the target quantity to reproduce
in the CG simulations the probability distributions or, equivalently, the corresponding BIn,
which are related to the internal variable-dependent free energies

F (Q)=xkT ln (P(Q))+const:, ð4:11Þ

where Q is a collective variable (e.g. a CG internal variable). This is the general form of eqn (4.5)
valid for any kind of collective variable Q. The additive constant is equal to xkT times the
logarithm of the partition function. The relation between F(Q) and the corresponding PMF is
 
P(Q)
PMF(Q)=xkT ln =F (Q)xF0 (Q)+const:, ð4:12Þ
P0 (Q)

where the subscript 0 refers to the reference system, usually the non-interacting particle system.
As discussed in the previous section, PMF and the corresponding free energy differ by more than
a simple constant if the choice of the variable Q is such that F0 is not independent of it. In the
previous section, it is also discussed how it is possible to obtain the ‘ best ’ U(Q) starting from
PMF(Q) with the iterative BI.
In principle, BI could be applied also to many-body PMFs and effective potentials. One could
consider a many-body CG potential depending on a set of collective CG variables {Q}

U ({Q})=U (Q1 , . . . , QN ) ð4:13Þ

and determine it with a full convergence multi-variable IBI procedure, using a set of probability
distributions P({Q}) (from experiment or from simulation) as the target. Equations (4.11) and
(4.12) are valid also when Q is replaced by the set of {Q} and the corresponding multivariate
quantities (probabilities, free energies, PMF) are defined. The IBI procedure then gives the
effective multi-body CG potential U({Q}) that best reproduce the free energy F({Q}) of the
system (i.e. the Free Energy Surface) and consequently its thermodynamic properties.
An alternative route to determine effective CG potentials is the ‘ force matching ’ (FM)
method, targeting the reproduction of the forces (Ercolessi & Adams, 1994 ; Izvekov et al. 2004).
The forces acting on the CG sites are calculated with a more accurate method (e.g. AA MD
simulations) and then used as the target to fix the parameters of the CG potential or to fit a
potential in numerical form. This method was rigorously formulated and optimized in a series of
papers by Izvekov & Voth (2005), Wang et al. (2009), Noid et al. (2008a), Das & Andersen (2009)
and Noid et al. (2008b), who named it the multi-scale CG (MS-CG) method. The basic idea is to
minimize the functional
* +
1 XN
x2 ({F})= jFI ({Q({q})})xf I ({q})j , ð4:14Þ
3N I =1

where FI are the CG forces on CG sites I, fI are the forces on the CG sites evaluated from the
AA simulations, and nm represents the average along the trajectory (or within the data set) that is
intended to be large enough to sample the canonical ensemble. The statistical average eliminates
354 V. Tozzini

the explicit dependence on the AA ({q}) internal variables, so that x2 is a functional only of the
functions FI({Q}), i.e. the CG forces, to be determined by minimizing x2. It is shown that the
minimal FI({Q}) satisfies the equation

@
FI ({Q})=x U ({Q}), ð4:15Þ
@QI

where U({Q}) is the multi-body PMF related to the P({Q}). This equation puts in relation the
(I)BI with the FM methods. In practice, however, there are substantial differences between the
two methods, at different levels. First of all, the input data set is typically different. The BI-based
procedures take, as input, structures from any source, but preferentially experimental, to enlarge
the diversity of the data set, in order to improve the transferability of the parameters. Conversely,
the FM procedure needs, as inputs, the forces on CG sites that are typically evaluated along AA
MD simulations. This difference in the inputs used in the two techniques results in differences in
the FF terms : the FF terms from (I)BI tend to be ‘ softer ’ (multiple minima, larger and less
defined), while those from FM tend to be ‘ harder ’ (sharper and better defined minima). In
addition, the use of AA simulation structures as input data generates another question, which is
the question related to the accuracy and reliability of AA FFs that is assumed in this approach. It
is beyond the scope of this paper to analyze this problem ; however, it is to be mentioned that the
most commonly used AA FFs have recently revealed some inaccuracies that appear especially on
the long time scale (Okur et al. 2003 ; Ono et al. 2000). The possible differences between CG FFs
generated by the use of different input data sets are coherent with the usually different use made
of the BI and FM CG FFs: the former are generally used to address general dynamics in CG-only
simulations, while the latter are preferentially used in multi-scale simulations, where the CG
results must be compared with the AA ones. In fact, the MS-CG method is designed to realize
the mechanical consistency between AA and GC models.
Even considering an identical input set of structures, different FFs may arise from the different
approximations adopted in the two methods. Neither in FM nor in BI the exact multi-body
U({Q}) or forces are usually considered. Rather, in both cases they are decomposed into a finite
sum of terms. In the case of the (I)BI-based methods this simply corresponds to write the FF as a
sum of terms depending on the single internal CG variables, i.e. eqns (4.3) or (2.2). As already
discussed, behind this choice there is the assumption that the choice of the FF terms and explicit
internal variables is made in such a way as to reduce at minimum the correlation between terms.
The parameters of the FF terms are then varied in order to optimally reproduce the PMFs or
probability distributions by direct or iterative BI, as explained in the previously.
In the case of the MS-CG method, the forces on the CG sites are expressed as a linear
combination of ‘ basis ’ functions G({Q}) :
X
FI ({Q})= wd GI , d ({Q}) ð4:16Þ
d

and the coefficients w are determined by variationally minimizing eqn (4.14)). The exact form of
G({Q}) depends on the kind of system. For instance, if the pair-wise interactions are dominant in
the system, e.g. for a one-component system formed by non-bonded elements, it becomes
convenient to use the Cartesian coordinates of the CG sites and express the forces in the form
X X
FI ({R})= ^ IJ
R Fd d(RIJ xRd ), ð4:17Þ
J –I d
Minimalist models for proteins 355

where RIJ is the pair-wise distance between two CG sites and d(R) is a discrete delta function.
This particular choice returns directly the pair-wise force acting among the components
(FI({R})), assuming that multi-body forces are negligible.
It is clear that (I)BI and FM methods, although stemming from a common basis, use different
inputs and different approximations. The FM method has the advantage of directly giving the
effective interaction potential without the need of iterative procedures. On the other hand, the
(I)BI procedure is more broadly applicable, because it needs only the structures and not the
forces as input, although it needs more care and efforts to obtain accurate effective potentials.

4.3.2 Examples of FFs based on the FM


The MS-CG method was used for a variety of systems of biological interest (lipid bilayers
(Izvekov & Voth, 2006) with membrane proteins (Shi et al. 2006), monosaccharides (Liu et al.
2007). For the purposes of the present paper, it is useful to cite its use in a two-bead model of
class 4 (Zhou et al. 2007) (see Table 1) in which the MS-CG procedure was used to model the
non-bonded interactions between beads, while the direct BI is used to model the bonded in-
teraction terms, assuming that these are slightly affected by the correlation problem. At variance
with previous models, in this one the dihedral term for the backbone is omitted : the task of
maintaining the correct dihedral value and backbone conformation is entirely demanded from
the hydrogen bonding between backbone atoms including the non-bonded interactions between
backbone beads. Referring to eqn (4.1), U back depends only on h, while U nbxloc is completely
unbiased and contains together effects of hydrogen bonding and other local interactions. It is
derived in a numerical form using eqn (4.17), as far as the non-bonded potential terms for the
side chains, which in this model assume the role of U nbxnon-loc. This model was shown to
describe properly the dynamics folding landscape of a 15-mer of poly-alanine. Conversely, to
describe accurately the b-hairpin structure, an analogously derived model but with three beads
for the backbone was used (Thorpe et al. 2008), due to the fact that for the b-hairpin structure,
the correct conformation of the backbone hydrogen bond is critical, and this is better described
using more bead for the backbone.

4.4 Physics–chemistry-based models and combinations of methods


The BI and FM methods allow, in principle, to parameterize very accurately effective potentials,
once the decomposition of the FF in single terms is established. However, from the previous
analysis two main problems emerge : (i) in principle, all the FF terms must be parameterized in an
amino-acid-type-dependent fashion, in order for the FF to be predictive ; (ii) the FF terms and
their functional forms must be properly chosen. In this section, a group model is reported that
tries to face and solve the problems related to (i) and (ii) including in the parameterization
elements based on the known chemical and physical properties of the amino acids and/or
thermodynamics and statistical data from experiment, and reducing at minimum the necessary
a priori knowledge of the system. This procedure is not at all new in the field of empirical FF
simulations, since even in the parameterization of the AA FFs physics- and chemistry-based
elements are introduced. In addition, in this group of models the focus is moved from the
reproduction of the local structure to the prediction of the global fold (secondary and tertiary
structures). For this, particular attention is devoted to the parameterization of the FF terms that
determine the stabilization of the helical versus sheet propensity, e.g. U hb and U back.
356 V. Tozzini

A reference model for those belonging to this class is the one by Levitt (1976) and Levitt &
Warshel (1975). This model belongs to class 7 and is described by 2–4 interacting centers per
amino acid. However, the degrees of freedom are only those pertaining to the Ca, since the
position of the other interacting centers (i.e. the side chain centroid and auxiliary sites to describe
the hydrogen bonds) is entirely defined by the position of the Ca. As a consequence, the energy
functional terms are still those in eqn (4.1). Particularly interesting is the treatment of the term
U back : the correlation between a and h is explicitly treated imposing the relationship h=106–
13cos(a–45) (numbers are in degrees). This given, the backbone conformation is determined by
a and its corresponding term U dih, parameterized by BI for a small number of representative
classes of possible amino-acid quadruplets and analytically expressed with a six-term Fourier
expansion. At variance with models previously described (e.g. MARTINI), the parameterization
of the conformational term in this model is amino acid type based rather than secondary struc-
ture based. This means that the model is completely unbiased. The non-bonded interaction
between side chains, which takes the part of U nbxnon-loc, is separated into its two components,
the vdW excluded volume and the ‘ solvent effect ’ (i.e. hydrophobic/hydrophilic interaction).
This separation, which is dropped in most of the subsequent models, is related to the fact that
both of these terms are parameterized based on the physical–chemical properties of each amino
acid. The solvent term is parameterized based on the experimental solubility of the amino acids,
and the VdW term is estimated averaging the possible conformation of the amino-acid side
chain. Two facts are worth noticing : (i) the averaging of the side chain conformations is re-
cognized as responsible for the ‘ softer’ shape of the effective side chain potential with respect to
the usual LJ form, which was in fact adopted by most of the subsequent models ; (ii) the com-
bination of two terms with different nature and shape gives rise to not simply single-walled
effective U nbxnon-loc, which is also observed in the effective potentials obtained by BI or FM. The
hydrogen bonding U hb is rather complex and defined through additional sites located near the
C==O and N-H groups of the peptide, and described through the combination of a VdW and an
electrostatic term, with parameters taken from the corresponding one of the AA FFs. The
Levitt–Washell FF was shown to reproduce roughly the secondary structure contact map of
globular proteins, without introducing any a priori knowledge of the protein, except the sequence.
A simplified single-bead (class 2 in Table 1) version of this model was subsequently developed by
McCammon & Northrup (1980).
The UNRES FF, later developed by Liwo et al. (1997a, b) can be considered as the evolution
of the Levitt–Warshel model, being based on a similar definition of the interacting site and
composition of FF terms. A first improvement is related to the correlation between the con-
formational terms of the backbone, and consists in recognizing and parameterizing in an amino-
acid-dependent way a distribution of deviations from the correlation function proposed by Levitt
& Warshel (1975). Correlations between the backbone conformation and the side chain orien-
tation are also introduced. These two terms are parameterized based on BI-related techniques
complemented with other physics-based techniques, such as the averaging of the corresponding
parameters of the AA FF. The side chain interactions are allowed to be anisotropic and par-
ameterized in an amino-acid-dependent way with similar approaches. The correlations accurately
included in U back already determine the secondary structures, which are however further stabi-
lized by the dipole–dipole interaction among the sites placed on the Cas, which mimics U hb.
The UNRES model is the prototype of a class of models with a very complex parameteriza-
tion, whose detailed description is out of the scope of this paper. Going back to the truly
minimalist models, another class of them stemmed from the work of Levitt and Warshel, whose
Minimalist models for proteins 357

prototype can be considered the model by Sorenson & Head-Gordon (2002a), inspired by
previous work by Honeycutt & Thirumalai (1990) and by Nymeyer et al. (1998). Similar models
were subsequently developed by Friedel & Shea (2004). This model is minimalist in all senses : the
protein is represented by one bead per amino acid placed on the Ca. The FF form is expressed
by eqn (2.1), where U bond is replaced by a constraint and U back=U ang(h)+U dih(a). In the
simplest version of the model (Sorenson & Head-Gordon, 2002b), U ang is a simple harmonic
potential with equilibrium angle at h0=105x, midway between the typical helical and typical
extended values. U dih is more complex
X
U dih (a)= A[1+ cos a]+B[1x cos a]+C [1+ cos 3a]+D[1+ cos (a+p=4)], ð4:18Þ
dih

where the parameters A, B, C and D are secondary structure dependent (e.g. A==C==D helical).
In subsequent refinements (Yap et al. 2008), the value of h0 was also differentiated by secondary
structure (i.e. 95 for helical conformations and 105 for the others) and a substituted with (a–a0)
(Yap et al. 2008) that allowed a more accurate representation of the secondary structures. As in
some previously described models, the dihedral term is considered as the one that mainly de-
termines the secondary structure. At variance with BI-based models, the values of the parameters
are more roughly determined, i.e. chosen to simply stabilize one or the other secondary structure,
and assigned based on the amino-acid propensity of the amino acids. Thus the parameterization
is more chemistry–physics based and does not need any a priori knowledge except the sequence.
The stability of the fold is also determined by the non-bonded term. In the simplest form
U nb is
X  12  6 
s s
U nb (r )= 4eS1 xS2 , ð4:19Þ
j >i+3
rij rij

where s is the same for all the beads and e is the energy unit of the system (also the parameters of
the bond angle and dihedral terms are expressed as fraction of e). The parameters S1 and S2 are
assigned based on the ‘ flavor ’ of the amino acid : hydrophobic (lately separated in strongly and
weakly hydrophobic), hydrophilic and neutral. As in the case of the dihedral potential, not much
attention is paid to the accurate reproduction of the form of the interaction potential, which can
assume only a few different characters (strongly or weakly attractive, strongly or weakly repul-
sive). However, even this potential term can be assigned solely based on the sequence. The
simplest version of the model is capable of distinguishing the secondary and tertiary fold of
proteins with different levels of a/b propensity, and locates the folding transition at values of
kT/e around 04–06. In addition, in the optimized version of the model, an explicit hydrogen
bond term is present, whose form is inspired by previous works on water (Silverstein et al. 1998)
X
U hb = xehb exp (x(rij xrhb )2 =s 2hb ) exp [(j^rij  ti jx1)=s 2hb ] exp [(j^
rij  tj jx1)=s2hb ], ð4:20Þ
hbonds

where ti is a unit vector orthogonal to the plane formed by the triplets of Ca (ix1, i, i+1). This
form of the potential induces those planes to stay parallel, thus stabilizing the helices and sheets
conformations. The sum runs over all the couple ‘ capable ’ of forming a bond, and the capability
is assigned similarly to the corresponding dihedral propensity, based on sequence. The values of
ehb and rhb depend on the flavor of the amino acid. The hydrogen bond term improves the
quality of the kinetics of the a/b transition in proteins with ambiguous propensity and shows a
358 V. Tozzini

high capability of predicting the a/b propensity upon mutations. On the other hand, the
structural accuracy in this model is sacrificed in favor of the prediction of the fold. The func-
tional forms of the single FF terms are rather ‘ primitive ’ with respect to the BI or FM par-
ameterized FFs.
It is worth mentioning here the two-bead model by Mukherjee & Bagchi (2002), who in-
troduced the hydrogen bonding effect in the helices with additional attractive harmonic terms for
1–3 and 1–4 interactions. The elastic constants of those terms are assigned based on the se-
quence determined helix propensity of each amino acid, as in the other models described in this
section.
An accurate treatment of U hb is also found in the OB model by Alemani et al. (2010). U ang
and U dih are represented by a double-well form like in Tozzini & McCammon (2005) and by a
simple cosine form, respectively. In addition, the persistence of the secondary structures is
maintained by an additional term that correlates subsequent dihedrals. The hydrogen bond term
is represented by a dipolar interaction between the peptide dipoles mij

X mi  mj (mi  ^rij )(mj  ^rij )


U hb = x3 : ð4:21Þ
rij3 rij3

The mij are located approximately midway between Cas i and i+1 and their orientation depends
on the position of Cas ix1, i and i+1. This approach is similar to that of Sorenson–Head-
Gordon in that the hydrogen bond interaction depends on the position of three subsequent Cas
only, but differs in many aspects. First of all, it is entirely physics based, imputing the hydrogen
bond to a simple dipolar interaction, while the former is more empirical. Second, eqn (4.21) does
not explicitly depend on the secondary structure. The secondary structure dependence enters
implicitly in the definition of the orientation of m that depends on the angle formed by the three
Cas determined by the specific secondary structure. This dependence is not empirical; con-
versely it is entirely physics based. In this model, the a versus b propensity is due to the U ang and
U dih terms, although stabilized by U hb. The model is capable of reproducing stable secondary
and super-secondary structures, and the transitions among them, with a good structural accuracy,
although a systematic amino-acid-dependent parameterization was not established ; thus in some
sense the model still lacks the predictive power that characterizes the Sorenson–Head-Gordon
model.
At the conclusion of this section, it is useful to report a summary of the features of the models
discussed here. In Table 4, the presence and functional form of the FF terms in the different
models are reported, together with remarks about the parameterization procedures. In Fig. 5, the
parameters of FF terms are reported as a function of the corresponding equilibrium distance.
The bond angle and dihedral parameters are converted into equivalent constants for linear
distance-dependent terms in order to be compared with the others, as explained in the caption of
Fig. 5. As can be seen, in spite of the very different parameterization procedures (biased (blue), BI
or FM based (red), or physical–chemical based (green)), the dots tend to accumulate along a line
that is roughly represented by a shifted inverse proportionality (dashed black line) and super-
impose partially with the relationship between elastic constant and cutoff distance of the ANM
(solid black line with squares). (The corresponding line for the GNM, conversely, looks like an
upper limit for the values of the constants.) This fact points to the emergence of a sort of
universality in the numerical values of the parameters of CG models and in their dependence on
the equilibrium distance, which can guide the parameterization.
Minimalist models for proteins 359

Fig. 5. Summary of the strength parameters in the OB FFs. All the constants are expressed in kcal/mol Å2,
as if they were elastic constants for harmonic distance potentials. For the non-bonded or hydrogen bonding
potentials, their equivalent are computed by calculating the second derivative in the minima and reporting it
in linear distance coordinates. It is to be noted that both the binding energy (i.e. well depth) and the well
width concur to determine the equivalent elastic constant in the case of Morse or LJ-like potentials. For the
bond angle and dihedral interactions, the parameters corresponding to the equivalent 1–3 and 1–4 inter-
actions are evaluated and reported in the graph. In this case both the equilibrium distance and the equivalent
elastic constants are evaluated. Colors: black solid lines with squares and dots represent the relationship
between elastic constants and cutoff distances in the ANM and GNM, respectively (Atilgan et al. 2001 ;
Soheilifard et al. 2008). The blue dots refer to biased models (different kinds of networks and Go models ;
Atilgan et al. 2001 ; Chennubhotla et al. 2005 ; Chu & Voth, 2007 ; Clementi et al. 2000 ; Demirel & Keskin,
2005 ; Go & Scheraga, 1976 ; Hamacher & McCammon, 2006 ; Jeong, et al. 2005 ; Kaya & Chan, 2003 ;
Keskin et al. 1998 ; Lyman et al. 2008 ; Maragakis & Karplus, 2005 ; Miyazawa & Jernigan, 1996 ; Nakagawa &
Peyrard, 2006 ; Soheilifard et al. 2008 ; Tirion, 1996). The vertical error bars are present in particular models
(e.g. the heterogeneous network), where local interactions can assume different strengths depending on the
protein. Red dots represent the models with parameterization based on the BI or FM (Arcangeli & Tozzini,
in preparation; Bahar & Jernigan, 1997 ; Chang et al. 2007 ; Di Fenza et al. 2009 ; Matysiak & Clementi, 2006 ;
Monticelli et al. 2008 ; Mukherjee et al. 2005 ; Reith et al. 2003 ; Tozzini & McCammon, 2005, 2008 ; Trovato
& Tozzini, in preparation ; Trylska et al. 2005, 2007 ; Voltz et al. 2008). Red dotted lines correspond to the
equilibrium distance-dependent parameters of Tozzini & McCammon (2005). Horizontal error bars are due
to the fact that often the same elastic parameters are used for helical or extended conformations, which have
different equilibrium distances. The green dots represent the models with chemical–physical-based par-
ameterization (Alemani et al. 2010 ; Friedel & Shea, 2004 ; Honeycutt & Thirumalai, 1990 ; Levitt, 1976 ;
Mukherjee & Bagchi, 2002 ; Nymeyer et al. 1998 ; Silverstein et al. 1998 ; Sorenson & Head-Gordon, 2002a,
b ; Tozzini, in preparation). The dashed black line is a guide to the eye: y=x1+25/(xx4).

5. Building accurate and predictive minimalist models


From the analysis reported in the previous sections, it emerges that the structural accuracy and
the predictive power/transferability are very difficult to combine in the same minimalist model.
On the one side, we have the completely biased models (Go and networks), which have a high
structural accuracy by definition, since they have a single or a few minima corresponding to
experimental structures, but low transferability and predictive power. BI- and FM-based models
have variable predictive power and transferability, dependent on how the input set are chosen
and on whether a priori knowledge of the local structure and on the secondary structure is
included; similarly, the resulting structural accuracy is variable. Finally the models reported in the
360 V. Tozzini

previous section focus on the prediction of the fold, at the expense of the accuracy of the local
structure. The UNRES and related models represent an attempt to combine the two aspects, but
at the expense of having very complex FF terms with a large number of parameters, which puts
this model at the very border of the class of the ‘ minimalist ’. In this section, the models are
re-considered with the aim of analyzing how the single FF terms influence the structure accuracy
and prediction. Possible strategies to rationally build an accurate and predictive minimalist model
are then suggested.

5.1 Accurate reproduction of the secondary structure


Although all the terms of the FF concur to the global structure, in the models here reviewed
the stabilization of the different secondary structures is mainly imputed to the terms U back
(U ang+U dih) and U hb. The effect of these terms can be conveniently analyzed by plotting the
contour lines of those potential terms in the (a,h) plane and comparing them with the ‘ exper-
imental’ (a,h) plot (see Fig. 2 d ), an equivalent of the Ramachandran map for the AA models. In
fact, the contour plot U back(a,h) locates its minima, and represents a rough approximation of the
(a,h) probability map.
The (a,h) are reported in Fig. 6. The area enclosed within the contour line in the UNRES-like
U back potential agrees nicely with the most populated areas of the experimental (a,h) plot
(Fig. 6 a). In particular, the two areas corresponding to the helical and to the sheet conformation
emerge over the sinusoidal Levitt–Warshel correlation line, also reported in black. As already
commented, the very good accuracy of this FF in reproducing the conformation of the backbone
is obtained at the cost of a very complex U back. It should be observed, in addition, that, although
much less populated, also the structures with opposite chirality are present in the experimental
map, which should roughly follow a correlation line symmetric to that of Levitt (represented as a
dotted line).
Considering the truly minimalist models, in Fig. 6 b the contour map of the U back in the
Sorenson–Head-Gordon-like models is reported. The experimental (a,h) map is much more
roughly reproduced, although also areas with non-standard helical and sheet propensity are
covered by this potential. The areas corresponding to helical and sheet propensity are reproduced
using different parameterizations, according to the philosophy of these kinds of FFs, although
some superposition between the two is possible, which allow partial transitions between helical
and extended conformations. However, these transitions are possible only along the dihedral
coordinate, implying that the determination of the secondary structure is mainly determined by it,
while the bond angle coordinate is considered secondary.
In Fig. 6 c, the more recent models are represented. First of all, it is shown that the inclusion of
the double well potential for U ang (as in Tozzini & McCammon (2005) and in Alemani et al.
(2010)) in place of the single well harmonic potential reproduces more realistically the (a,h) map
and allows transitions between different backbone states even through the bond angle variable,
so that in principle full transitions from the sheet region (ay180, hy140) to the helical region
(ay60, hy90) are possible, although their probability depends on the helical versus sheet pro-
pensity. This kind of models considers h as a secondary structure determining variable, at the
same level of a, if not the most important. Additional improvements are obtained by adding an
explicit U hb (as in Yap et al. (2008), Mukherjee & Bagchi (2002) and Alemani et al. (2010)).
Considering for instance the approach adopted by Mukherjee & Bagchi (2002), the helical
hydrogen bonds are represented by 1–4 interactions with equilibrium distance y52 Å. This
Minimalist models for proteins 361

(a) (b)

(c)

Fig. 6. (a,h) plots of different kinds of models. Colored striped areas correspond to the values of a and h
where the value of U back(h,a) assumes values less than a cutoff level (the cutoff level used ranges between 2
and 5 kcal/mol). Representative values of the parameters reported in the literature for each model are used
in the evaluation of the contour lines. (a) Representation of the UNRES-like U back (magenta). In orange in
the background, the ‘experimental ’ (a,h) plot (the same as in Fig. 2 d) is reported for comparison. The black
solid line is the correlation line by Levitt and Warshel ; the black dotted line is its symmetric, which should
stand for structures with opposite chirality. (b) Representation of the Sorenson–Head-Gordon-like U back
(green for helical propensity, blue for sheet propensity), compared with the ‘experimental ’ (a,h) plot.
(c) Representation of the double well+dihedral U back (green for helical propensity, blue for sheet pro-
pensity). Solid black lines: correlation lines imposed by constraining the 1–4 distance through the formula
(r/l )2=4sin2hsin2(a/2)+[1–2cos2h]2 with l=38 Å and r=52 Å for helices and r=10 for sheets in Å.
Black dashed line : the same for a structure with opposite chirality. In background in red-magenta, the (a,h)
map is evaluated over a simulation of the helix-to-hairpin transition for the corresponding minimalist
model. The structures corresponding to the various areas of the plot are also reported.

introduces a correlation between h and a variables that can be analytically evaluated (see the
caption of Fig. 6) and is represented by the black line in the central region of Fig. 6 c (solid for
right-handed helices and dashed for left-handed ones). Similarly, assuming for simplicity that
extended conformations in sheet structures are stabilized by constraining the 1–4 distances at
larger values (y10 Å) determined by inter-strand hydrogen bonds, one obtains the two corre-
lation lines in the upper part of the graph. It should be expected that when U hb is added to
U back, the contour map of the model changes, increasing the population around the correlation
362 V. Tozzini

lines. This is in fact what happens, as shown by the (a,h) plot obtained from a simulation of a
minimalist 20-mer model that implements these potentials (Tozzini, in preparation) (in the
background in red-magenta, in Fig. 6 c). In this simulation, the hairpin conformation is more
stable, but the simulation starts from the helical conformation, so that both the conformations
are sampled. As it is evident, the lobes are deformed and the (a,h) plot assumes a shape that is
very similar to that of the generic polypeptide (compare with Fig. 2 d ), especially in the helical
lobe, indicating that a proper functional form of U back (especially U ang) can give a very accurate
description of the backbone conformation when coupled to appropriate hydrogen bonding
terms, which also introduce the correct correlations between the a and h variables in the plot. In
addition, it is to be observed that U back and U hb both concur in stabilizing the secondary
structures one relative to the other ; thus their relative strength must be coherently balanced to
reproduce the helical versus sheet propensity for the different amino acids, in a transferable model.
When U hb is more complex than a simple distance constraint, the a priori evaluation of its effect
is more complex and can be evaluated only through a simulation. However, it is likely to be
similar to that described by the simplest model.
While U back+U hb gives the main contribution in determining the secondary structure,
U nb participates in stabilizing the less structured conformations (e.g. random coils, turns) and
determines the stability of the tertiary fold ; thus its accuracy is crucial in determining the
global fold. In section 4.2.3, it was shown that even after subtracting the correlation effects,
U nb maintains an at least double-welled shape. This is confirmed also from the direct calculation
of the effective potential from the FM procedure (Zhou et al. 2007). As already noted, the two
main minima correspond roughly to closer and looser packing of the side chains, due to their
conformational flexibility and/or to the mediation of the interaction by water molecules.
Thus the first minimum, located at y6 Å, is likely to influence the relative stability of helices,
sheets and random coils, although in second approximation with respect to the stronger U back
and U hb, while both minima are likely to influence the stability of the tertiary fold. An accurate
U nb term must include these effects in an amino-acid-dependent fashion, in order to reproduce
the protein fold, which implies that the single-welled Sorenson–Head-Gordon-like potential is
probably not completely adequate, and conversely, BI- or FM-based multi-welled potentials
should be used to achieve this aim, such as that proposed in Korkuta & Hendrickson (2009) or
Zhou et al. (2007).

5.2 A possible strategy toward accurate and predictive models


The analysis of the previous sections suggests some prescriptions for the functional form and
parameterization of the various FF terms for a – possibly predictive and structurally accurate –
Ca-based OB model for proteins with implicit solvent. These can be summarized as follows and
combine functional forms and parameterization procedures from different models
1. U bond : This term is the simplest, and can be conveniently represented by a constraint or by
harmonic terms whose parameters are determined by BI, since the cis–trans-isomerization of
the peptide bond is very rare. However, there is also the possibility of accounting for it, with a
double well potential parameterized based on BI of the first neighbor distance.
2. U back=U ang+U dih with the two terms represented by the double well (eqn (4.9)) and cosine
sum (eqn (4.18)) or similar), respectively. All the parameters should depend on the amino acid
type, rather than on the a priori determined secondary structure, although the latter strategy
Minimalist models for proteins 363

can be also taken into consideration, provided a reliable sequence-to-secondary structure


prediction algorithm is used. The structural parameters can be determined by BI or FM. The
energetic parameters must be determined coherently with the U hb term (see below).
3. U hb must be anisotropic (e.g. eqn (4.20)) and their parameters should depend on the amino
acid type and determined with a procedure similar to that of U back : the structural parameters
with BI or FM, and the energetic parameters determined in such a way that the relative
strengths of U back and U hb are balanced in order to reproduce the secondary structure
propensity of each amino acid (see below).
4. U nb should possibly be double welled (e.g. the combination of two Morse potentials), with
parameterization dependent on the amino acid type. Again, the equilibrium distances could be
determined based on BI or FM procedures. Iterative BI or FM is likely to be also the best
procedures to parameterize the energetic parameters of U nb. In general all the energetic
parameters (but especially those of U nb) should be readjusted to include thermodynamics or
other physical and chemical data.

Determining coherently the energetic parameters of 2, 3 and possibly 4 is a cumbersome task.


Without claiming to give the best one, in the following a possible strategy is described to achieve
it. For the sake of simplicity, it is assumed that U nb can be represented by a single well, with
average properties of the two wells. Assuming in addition that all the structural parameters were
determined as suggested above, the free parameter left are e=binding energy of U nb ; ehba,
ehbb=binding energies of the hydrogen bonding in the helices and in the sheets, respectively ;
D=the relative energy of the helical conformation versus the extended one in the U back term
(D is, in general, a combination of the energetic parameters of U ang and U dih). Given this fact,
in the following, the phase diagram in the space of the five parameters e, ehba, ehbb, D and T
(the temperature) is studied, where the phases are the main secondary structures. This allows
assigning the values of these parameters based on the secondary propensity of each amino acid
type.
The parameter space can be further reduced by referring all the parameters to eahb, considered
as the energy unit. Thus let us define d=D/ehba, rhb=ehbb/ehba rnb=enb/enba and t=kT/
ehba. In Table 5, the values of the different energy terms for the different conformations are
reported, evaluated as a function of the parameters as explained in the caption. The phase
diagram obtained using them is reported in Fig. 7, projected onto different subspaces. Each
phase among helix, sheet, molten globule (or random coil) and fully extended (e.g. completely
unfolded) is considered stable when its free energy is the minimal for that specific set of par-
ameters.
In Fig. 7 a, the d versus rnb projection is reported at t=0 and rhb=11. This value is an average
value for the relative strength of hydrogen bonds in sheets relative to those in helices, although
both are known to depend on the amino acid type and on the level of salvation (Arora &
Jayaram, 1996) and are correlated to the sheet versus helix propensity (Bay & Englander, 1994).
The graph shows that for small values of rnb (y01), only well-defined secondary structures are
stable, while the random coil can be stable for values of rnby1. According to the analysis of
section 4, such large values of rnb are adopted in the Sorenson–Head-Gordon model, while
smaller values are more typical of (partially) biased models or of models in Alemani et al. (2010)
and Tozzini et al. (2006). Since intrinsically structured and unstructured amino acids exist, rnb
(and all the other parameters) should assume the whole range of values, depending on the
amino acid type. One possibility to correctly assign this dependency is to exploit some available
364
Table 5. Evaluation of the (free) energy terms for each of the main possible conformations of a polypeptide, as a function of the four main parameters of the model
d=D/ehba, rhb=ehbb/enba and t=kT/ehba. The energy term per peptide in reduced units is given in the second line of each row. N=35 (peptide length) ; l=13

V. Tozzini
(typical helix length) ; m=7 (typical length of a sheet in a strand). The energy and entropy terms are evaluated under the following additional assumptions : (i) in the turns,
kinks between helices and in the globules, the conformation adopted is similar to that of the helix ; thus its conformational energy is estimated to be y3/4D ; (ii) the hydrogen
bond energy in helices and sheets is evaluated counting the stable number of hydrogen bonds ; (iii) the percentage of residues adopting the turn-like or helical-like conformation in
the globules is y40%, evaluated on a set of unstructured proteins (e.g. containing less than 5 % helices or sheets) ; similarly, the average number of helical-like hydrogen bonds,
other hydrogen bonds and non-bonded contacts is evaluated to be y01, 03 and 12 per residue ; (iv) in the turns between strands the hydrogen bond conformation adopted is
similar to that in the helices ; (v) in the helices the non-bonded contacts are considered the 1–5 interactions plus possible inter-helical interactions when broken helix structures are
possible ; (vi) in the sheets, the non-bonded interactions are considered as the second neighbor interactions among sheets, and are scaled by 09 because their distance is displaced
from the minimum Unb ; (vii) the entropy is evaluated very roughly considering both the conformational and configuration space for each structure. For instance, the helix has only
one (or a few) possibility of hydrogen bonding topology and is rigid; thus it has the lowest possible entropy (larger – TS) ; in the sheet the topology of contacts is equally fixed, but
there is more conformational flexibility ; the globule has larger possibility of topologic contacts and average flexibility ; finally, the extended conformation has the larger
conformational space possible. The present evaluation of the entropy is entirely qualitative and has an arbitrary multiplicative factor, chosen in such a way that the melting
temperature of the peptide is a little above the room temperature. The total free energy of each conformation is evaluated as the sum of the energy terms – TS

U back U hb,a U hb,b U nb xTS


Phase U back/ehbares U hb,b/ehbares U hb,b/ ehbares U nb/ehbares xTS/ehbares
Extended 0 0 0 0
xt
Helix D(NxN/(4 l)x1) xehba(Nx3xN/l) 0 x[(Nx4xN/l)+l/4(N/lx1)]enb
d x09 x096 rnb x01t
Sheet 3
=4 D(N/mx1) xehba(N/mx1) xehb [(mx4)(N/mx1)+2]
a
x2 . 0.9(Nx2N/mxm+2)enb
01 d x011 x039rhb x103rnb x02t
Globule D043=4 N x01Nehba x01Nehbb x12Nenb
03 d x01 x03rhb x12rnb x07t
Minimalist models for proteins 365

secondary structure propensity scale, such as that by Chou and Fasman (CF) (Chou & Fasman,
1978), who give, for each amino acid, values of the helix, sheet and turn propensity (pa, pb and
pt). At fixed rhb thus the secondary structure propensity depends on d and rnb. This can be
expressed by simple relations d=d(pa, pb) and rnb=rnb(pt, pa) such as those given in the
caption of Fig. 7 (Tozzini, in preparation). Using them, and the CF propensity scale, the par-
ameters d and rnb for each amino acid are evaluated and the name of the amino acid placed on
the (d, rnb) plane accordingly. As can be seen, each amino acid is located in the correct region
of the plane, according to its propensity, including those with hybrid sheet–helix propensity
(located on the line separating the two phases) and the turn former amino acids.
This result was obtained assuming that the helix versus sheet propensity is mainly due to d,
while rnb modulates the tendency to form defined secondary structures versus random coils.
However, one could also impute the helix versus sheet propensity to rhb and keep constant d to an
average value, since the strength of the hydrogen bonds is dependent on the hydration properties
and on the amino acid type. In this case, a relationship rhb=rhb(pa, pb) is needed to assign the
rhb value to each amino acid (see the caption of Fig. 7), but the result is nearly the same and is
reported in Fig. 7 b : the amino acids are located correctly in the region of the planes according to
their secondary structure propensity. A third way, the most physically plausible, is to vary both d
and rhb with the amino acid type, in a correlated way (correlation relation in the caption of Fig. 7)
and the result is represented in Fig. 7 c, giving the same accuracy as the two previous ones in
locating the amino acids in the phase plane. Figure 7 c reports the graph at room temperature
(t=003 if ehba assumes the reasonable value of y2 kcal/mol) : with respect to the zero tem-
perature the globule state is stabilized and the corresponding area enlarges at the expense of the
structured areas.
Once these relationships between d, rhb and rnb and pa, pb and pt are fixed, one can study the
phase diagram as a function of the temperature. This is reported in Fig. 7 d–f for three different
values of rhb corresponding to helix former, sheet former and ‘ indifferent ’ amino acids (areas
enclosed in rectangles in Fig. 7 c). The behaviors are quite different, and only in the case of the
helix former the helix-to-sheet transition is observed, at a temperature modulated by rnb. In
Fig. 7 e, f, the temperature scale is expanded to show also the denaturation temperature (e.g. the
transition to the extended state). These phase diagrams can help in fixing the parameterization
based on the experimental knowledge of the transition temperatures. It is to be remarked that the
evaluation of these phase diagrams is qualitative and should be checked by means of simulations
of different peptide models spanning all the parameter space (work in progress; Tozzini, in
preparation).
In conclusion, the suggested relationships rnb(pa, pt), rhb(pb, pa) and d(rhb) give simple
operative prescriptions to evaluate the main energetic parameters for each amino acid, based
on given amino acid secondary structure propensities. Although there is room for optimizing it,
this recipe incorporates physical–chemical properties in the parameterization basically at null
cost and possibly also some thermodynamic properties, through the secondary structure pro-
pensity. This does not exhaust the full parameterization: the relative weights of the angle and
dihedral parameters in determining d should be determined separately, possibly by means of
FM or BI. The same is true for the relative energies of the two wells in U nb, which influences
also more global properties such as the tertiary fold and the interactions between protein
domains. These aspects must be systematically faced (and they are, by several authors : Korkuta
& Hendrickson (2009) and Trovato & Tozzini (in preparation)) by FM- and/or BI-based
methods.
366 V. Tozzini

(a) (b)

(c) (d)

(e) (f)

Fig. 7. Phase diagram of the minimalist polypeptide model. Color code for phases : green=helix, cyan=
sheet, yellow=coil, turn or molten globule, blue=extended (completely denaturated state). (a) Projection
onto the d–rnb plane, at zero temperature and average value of rhb (indicated in the graph). The amino acids
names are located in the plane according to the corresponding values of d and rnb, evaluated from the
Chou–Fasman (CF) secondary structure propensities pa (helix propensity), pb (sheet propensity) and pt
Minimalist models for proteins 367

Once optimized, this recipe for building the CG FF can in principle include the local structural
accuracy (within the structural parameters via BI or FM procedures), the capability of predicting
the relative secondary structure stabilities (included in the energetic parameters evaluated from
the secondary structure propensities) and the tertiary–quaternary fold (via the accurate amino-
acid-dependent parameterization of the U nb term).

6. Toward an optimal OB model : conclusions and perspectives


Although several models include one or more of the above-mentioned aspects, to my knowledge
none include all of them that would be necessary for an optimal (predictive and accurate) model.
UNRES, MARTINI and related models are probably those that better realize such an ideal, at
the expense of additional beads and very complex functional forms. Remaining within the do-
main of the OB and truly minimalist models, the situation is summarized in Fig. 8 where the
main classes of models cited in this paper are reported. The network models (green rectangle) are
the simplest and bear the largest structural accuracy by definition due to the total bias toward a
single reference structure that is the input of the problem. Their simplicity allows gaining the
maximum efficiency ; thus huge systems (up to viruses) can be studied. In the Go models, some
unbiased terms start to be included, together with some elements of physics-based par-
ameterization, which is a first step toward the transferability. These steps are more pronounced
in the partially biased models, with a small amount of local bias left, which allows us to accurately
represent the hydrogen bonds and other local strong interactions. These models bear a good
structural accuracy and good predictive power, being suitable for general dynamics. The com-
pletely unbiased physics-based models such as that of Sorenson–Head-Gordon make the final
step toward the transferability, dropping any bias toward single structures. However, in the
present form, they bear low structural accuracy. This can be improved by enhancing the flexi-
bility of the functional forms used in the FF terms and the accuracy of the parameterization. In
the previous section, a general procedure to do that is proposed that involves combining FM-
and/or BI-based methods with physics- and chemistry-based methods, which should ideally
guide the parameterization of a model placed in the top right corner of the graph. Encouraging
preliminary steps have been already performed for generic polypeptide models (Alemani et al.
2010 ; Tozzini et al. 2006).
In general, however, although the goal is clear and the routes to reach it start now to be less
confused, the minimalist models are, in my opinion, still far from a possible standard such as that

(turn propensity) by means of the following empirical relationships : d=(pb/pax11)/22+038,


rnb=pt/2pa. (b) Projection onto the rnb–rhb plane, at zero temperature and average value of d (indicated
in the graph), and rhb=pb/pa, rnb=pt/2pa. (c) The same as (b), but evaluated at room temperature and
using a correlated dependence of d and rhb on the CF parameters : d(rhb)=(rhbx11)/22+038,
rhb=05(pb/pax11)+11, rnb=pt/2pa. There is a certain degree of arbitrariness in deciding the specific
correlation between d and rhb. However, this can be reduced considering that the variation of the hydrogen
bond strength in helices and in sheets is estimated around 10 %, which should more or less correspond to
the spread of the location of the amino acids projected on the rhb axis. This is how the specific form was
chosen here. It is to be remarked that this procedure is only indicative, and could be optimized. (d )–( f )
Projection onto the T–rnb plane, evaluated for three typical values of the rhb reported in the graphs. Selected
amino acid names are reported and located with the same criterion of the other graphs. Colored dots
indicate the a–b transition temperature (violet), the structured-to-molten globule (or random coil) tran-
sition (orange) and the denaturation temperature (blue).
368 V. Tozzini

Fig. 8. Upper part : schematic summary of the characteristics of the main classes of OB CG models. The
models are placed in the diagram approximately according to their structural accuracy and predictive power.
The pictures indicate very roughly the size of the system treated and the functions of the model. At the
bottom a schematic representation of the release of the bias is shown, which parallels the increase in
transferability. In blue, the polypeptide backbone is schematically represented. The red lines connecting the
Ca represent the biased interactions. Dashed lines represent interactions with hybrid biased–unbiased
character (sometimes occurring in Go models).

reached, for instance, for AA FF. In spite of this, I believe that efforts to systematically study and
parameterize a minimalist model for proteins (and more in general for bio-molecules) should not
be abandoned, because minimalist models are those that combine a sufficient level of resolution
with the largest possible gain in computational cost ; thus they are the key to address bio-systems
on biologically interesting size and time scales.

7. Acknowledgements

I thank Fabio Trovato and Armida Di Fenza for useful discussions.


Minimalist models for proteins 369

8. References
ALEMANI, D., COLLU, F., CASCELLA, M. & DAL PERARO, M. changes: a double-well network model. Biophys. J. 93,
(2010). A nonradial coarse-grained potential for pro- 3860–3871.
teins produces naturally stable secondary structure CLEMENTI, C., NYMEYER, H. & ONUCHIC, J. N. (2000).
elements. J. Chem. Theor. Comput. 6, 315–324. Topological and energetic factors: what determines the
ARCANGELI, C. & TOZZINI, V. (in preparation). Multi-scale structural details of the transition state ensemble and
modeling molecular dynamics of the Artichoke Mottled. ‘en-route’ intermediates for protein folding? An inves-
Crinkle Virus, in preparation. tigation for small globular proteins. J. Mol. Biol. 298,
ARORA, N. & JAYARAM, B. (1996). Strength of hydrogen 937–953.
bonds in alpha helices. J. Comput. Chem. 18, 1246–1252. DAS, A. & ANDERSEN, H. C. (2009). The multiscale coarse-
ATILGAN, A. R., DURELL, S. R., JERNIGAN, R. L., DEMIREL, graining method. III. A test of pairwise additivity of the
M. C., KESKIN, O. & BAHAR, I. (2001). Anisotropy of coarse-grained potential and of new basis functions for
fluctuation dynamics of proteins with an elastic network the variational calculation. J. Chem. Phys. 131, 034102.
model. Biophys. J. 80, 505–515. DAS, P., MATYSIAK, S. & CLEMENTI, C. (2005). Balancing
AYTON, G. S., NOID, W. G. & VOTH, G. A. (2007). energy and entropy: a minimalist model for the char-
Multiscale modeling of biomolecular systems: in serial acterization of protein folding landscapes. Proc. Natl.
and in parallel. Curr. Opin. Struct. Biol. 17, 192–198. Acad. Sci. U.S.A. 102, 10141–10146.
BAHAR, I. & JERNIGAN, R. L. (1997). Inter-residue poten- DEMIREL, M. C. & KESKIN, O. (2005). Protein interactions
tials in globular proteins and the dominance of highly and fluctuations in a proteomic network using an elastic
specific hydrophilic interactions at close separation. network model. J. Biomol. Struct. Dyn. 22, 381–386.
J. Mol. Biol. 266, 195–214. DI FENZA, A., ROCCHIA, W. & TOZZINI, V. (2009).
BAHAR, I., KAPLAN, M. & JERNIGAN, R. L. (1997). Complexes of HIV-1 Integrase with HAT proteins:
multiscale models, dynamics and hypotheses on allos-
Short-range conformational energies, secondary struc-
teric sites of inhibition. Proteins 76, 946–958.
ture propensities, and recognition of correct sequence–
ERCOLESSI, F. & ADAMS, J. B. (1994). Interatomic potentials
structure matches. Proteins 29, 292–308.
from first-principles calculations: the force-matching
BANACHOWICZ, E., GAPINSKI, J. & PATKOWSKI, A. (2000).
method. Europhys. Lett. 26, 583.
Solution structure of biopolymers: a new method of
FLORENCE TAMA, F. & BROOKS III, C. L. (2005). Symmetry,
constructing a bead model. Biophys. J. 78, 70–78.
form, and shape: guiding principles for robustness in
BAY, Y. & ENGLANDER, W. (1994). Hydrogen bond
macromolecular machines. Annu. Rev. Biophys. Biomol.
strength and beta-sheet propensities: the role of a side
Struct. 35, 115–133.
chain blocking effect. Proteins 18, 262–266.
FRIEDEL, M. & SHEA, J. M. (2004). Self-assembly of pep-
BETANCOURT, M. R. & THIRUMALAI, D. (1999). Pair poten-
tides into a b-barrel motif. J. Chem. Phys. 120, 5809.
tials for protein folding: choice of reference states and
FRIEDEL, M., SHEELER, D. J., & SHEA, J.-E. (2003). Effects
sensitivity of predicted native states to variations in the
of confinement and crowding on the thermodynamics
interaction schemes. Protein Sci. 8, 361–369. and kinetics of folding of a minimalist b-barrel protein.
BUCK, P. M. & BYSTROFF, C. (2009). Simulating protein J. Chem. Phys. 118, 8106–8113.
folding initiation sites using an alpha-carbon-only GO, N. & SCHERAGA, H. A. (1976). On the use of classical
knowledge-based force field. Proteins 76, 331–342. statistical mechanics in the treatment of polymer chain
CASCELLA, M. & PERARO, M. D. (2008). Challenges and conformation. Macromolecules 9, 535–542.
perspectives in biomolecular simulations : from the HA-DUONG, T. (2010). Protein backbone dynamics
atomistic picture to multiscale modeling. Curr. Opin. simulations using coarse-grained bonded potentials and
Struct. Biol. 18, 630–640. simplified hydrogen bonds. J. Chem. Theory Comput. 6,
CHANG, C.-E., TRYLSKA, J., TOZZINI, V. & MCCAMMON, 761–773.
J. A. (2007). Binding pathways of ligands to HIV-1 HAMACHER, K. & MCCAMMON, J. A. (2006). Computing the
protease: coarse-grained and atomistic simulations. amino acid specificity of fluctuations in biomolecular
Chem. Biol. Drug Des. 69, 5–13. systems. J. Chem. Theory Comput. 2, 873–878.
CHENNUBHOTLA, C., RADER, A. J., LEE-WEI YANG, L.-W. & HONEYCUTT, J. D. & THIRUMALAI, D. (1990). Metastability
BAHAR, I. (2005). Elastic network models for under- of the folded states of globular proteins. Proc. Natl.
standing biomolecular machinery: from enzymes to Acad. Sci. U.S.A. 87, 3526–3529.
supramolecular assemblies. Phys. Biol. 2, S173–S180. IZVEKOV, S. & VOTH, G. A. (2006). Multiscale coarse-
CHOU, P. Y. & FASMAN, G. D. (1978). Empirical prediction graining of mixed phospholipid/cholesterol bilayers.
of protein conformation. Annu. Rev. Biochem. 47, J. Chem. Theory Comput. 2, 637–648.
251–276. IZVEKOV, S., PARRINELLO, M., BURNHAM, C. J. & VOTH,
CHU, J.-W. & VOTH, G. A. (2007). Coarse-grained free G. A. (2004). Effective force fields for condensed phase
energy functions for studying protein conformational systems from ab initio molecular dynamics simulation: a
370 V. Tozzini

new method for force-matching. J. Chem. Phys. 120, interactions and determination of weights of energy
10896–10913. terms by z-score optimization. J. Comput. Chem. 18,
IZVEKOV, S. & VOTH, G. A. (2005). Multiscale coarse 874–887.
graining of liquid-state systems. J. Chem. Phys. 123, LYMAN, E., PFAENDTNER, J., & VOTH, G. A. (2008).
134105. Systematic multiscale parameterization of hetero-
JANG, H., HALL, C. K. & ZHOU, Y. (2004). Assembly and geneous elastic network models of proteins. Biophys. J.
kinetic folding pathways of a tetrameric b-sheet com- 95, 4183–4192.
plex : molecular dynamics simulations on simplified off- MÁJEK, P. & ELBER, R. (2009). A coarse-grained potential
lattice protein models. Biophys. J. 86, 31–49. for fold recognition and molecular dynamics simula-
JEONG, J. I., JANG, Y. & KIM, M. K. (2005). A connection tions of proteins. Proteins 76, 822–836.
rule for a-carbon coarse-grained elastic network models MARAGAKIS, P. & KARPLUS, M. (2005). Large amplitude
using chemical bond information. J. Mol. Graph Model conformational change in proteins explored with a
24, 296–306. plastic network model: adenylate kinase. J. Mol. Biol.
KAYA, H. & CHAN, H. S. (2003). Solvation effects and 352, 807–822.
driving forces for proteinthermodynamic and kinetic MATHEWS, C., VAN HOLDE, K. E. & AHERN, K. G. (2000).
cooperativity: how adequate is native-centric topo- Biochemistry. 3rd edn. San Francisco: Addison Wesley
logical modeling? J. Mol. Biol. 326, 911–931. Longman Inc.
KESKIN, O., BAHAR, I., BADRETDINOV, A., PTITSYN, O. & MATYSIAK, S. & CLEMENTI, C. (2006). Minimalist protein
JERNIGAN, R. (1998). Empirical solvent-mediated model as a diagnostic tool for misfolding and aggre-
potentials hold for both intra-molecular and inter- gation. J. Mol. Biol. 363, 297–308.
molecular inter-residue interactions. Protein Sci. 7, 2578. MCCAMMON, J. A. & NORTHRUP, S. H. (1980). Helix–coil
KLIMOV, D. K. & THIRUMALAI, D. (2000). Mechanisms and
transition in a simple polypeptide model. Biopolymers 19,
kinetics of b-hairpin formation. Proc. Natl. Acad. Sci.
2033–2045.
U.S.A. 97, 2544–2549.
MIYAZAWA, S. & JERNIGAN, R. L. (1996). Residue–residue
KLIMOV, D. K., BETANCOURT, M. R. & THIRUMALAI, D.
potentials with a favorable contact pair term and an
(1998). Virtual atom representation of hydrogen bonds
unfavorable high packing density term, for simulation
in minimal off-lattice models of alpha helices: effect on
and threading. J. Mol. Biol. 256, 623.
stability, cooperativity and kinetics. Folding Des. 3,
MONTICELLI, L., KANDASAMY, S. K., PERIOLE, X., LARSON,
481–496.
R. G., TIELEMAN, D. P., & MARRINK, S.-J. (2008). The
KOGA, N. & TAKADA, S. (2001). Roles of native topology
MARTINI coarse-grained force field: extension to
and chain-length scaling in protein folding: a simulation
proteins. J. Chem. Theory Comput. 4, 819–834.
study. J. Mol. Biol. 313, 171–180.
MUKHERJEE, A. & BAGCHI, B. (2002). Correlation between
KORKUTA, A. & HENDRICKSON, W. A. (2009). A force field
rate of folding, energy landscape and topology in the
for virtual atom molecular mechanics of proteins. Proc.
folding of a model protein HP-36. J. Chem. Phys. 118,
Natl. Acad. Sci. U.S.A. 106, 15667–15672.
KUNDU, S., SORENSEN, D. C., & PHILLIPS, JR., G. R. (2004). 4733–4747.
Automatic domain decomposition of proteins by a MUKHERJEE, A., BHIMALAPURAM, P. & BAGCHIA, B. (2005).
Gaussian network model. Proteins 57, 725–733. Orientation-dependent potential of mean force for
LEVITT, M. & WARSHEL, A. (1975). Computer simulation of protein folding. J. Chem. Phys. 123, 014901.
protein folding. Nature 253, 694–698. NAKAGAWA, N. & PEYRARD, M. (2006). Modeling protein
LEVITT, M. (1976). A simplified representation of protein thermodynamics and fluctuations at the mesoscale. Phys.
conformations for rapid simulation of protein folding. Rev. E 74, 041916.
J. Mol. Biol. 104, 59–107. NOID, W. G., CHU, J.-W., AYTON, G. S., KRISHNA, V.,
LIU, P., IZVEKOW, S. & VOTH, G. A. (2007). Multi-scale IZVEKOV, S., VOTH, G. A., DAS, A. & ANDERSEN, H. C.
coarse graining of monosaccharides. J. Phys. Chem. B (2008a). The multiscale coarse-graining method. I. A
111, 11566–11575. rigorous bridge between atomistic and coarse-grained
LIWO, A., OLDZIEJ, S., PINCUS, M. R., WAWAK, R. J., models. J Chem. Phys. 128, 244114.
RACKOWSKY, S. & SCHERAGA, H. A. (1997a). A united- NOID, W. G., LIU, P., WANG, Y., CHU, J.-W., AYTON, G. S.,
residue force field for off-lattice protein structure IZVEKOV, S., ANDERSEN, H. C., & VOTH, G. A. (2008b).
simulations. I. Functional forms and parameters of long The multiscale coarse-graining method. II. Numerical
range side chain interactions potentials from protein implementation for coarse-grained molecular models.
crystal data. J. Comput. Chem. 18, 849–873. J. Chem. Phys. 128, 244115.
LIWO, A., PINCUS, M. R., WAWAK, R. J., RACKOWSKY, S., NYMEYER, H., GARCIA, A. E. & ONUCHIC, J. N. (1998).
OLDZIEJ, S. & SCHERAGA, H. A. (1997b). A united- Folding funnels and frustration in off-lattice minimalist
residue force field for off-lattice protein structure protein landscapes. Proc. Natl. Acad. Sci. U.S.A. 95,
simulations. II. Parameterization of short-range 5921–5928.
Minimalist models for proteins 371

OKUR, A., STROCKBINE, B., HORNAK, V. & SIMMERLING, C. Biomolecular Systems (ed. G. A. Voth), p. 285. Washington,
(2003). Using PC clusters to evaluate the transferability DC: CRC Press.
of molecular mechanics force fields for proteins. TOZZINI, V., ROCCHIA, W. & MCCAMMON, J. A. (2006).
J. Comput. Chem. 24, 21–31. Mapping AA models onto one-bead coarse grained
ONO, S., NAKAJIMA, N., HIGO, J. & NAKAMURA, H. (2000). models: general properties and applications to a mini-
Peptide free-energy profile is strongly dependent on mal polypeptide model. J. Chem. Theory Comput. 2,
the force field: comparison of C96 and AMBER95. 667–673.
J. Comput. Chem. 21, 748–762. TOZZINI, V., TRYLSKA, J., CHANG, C.-E. & MCCAMMON,
REITH, D., PÜ TZ, M. & MÜ LLER-PLATHE, F. (2003). J. A. (2007). Flap opening dynamics in HIV-1 protease
Deriving effective mesoscale potentials from atomistic explored with a coarse-grained model. J. Struct. Biol. 157,
simulations. J. Comput. Chem. 24, 1624–1636.
606–615.
RUSSELL, D., LASKER, K., PHILLIPS, J., SCHNEIDMAN-
TROVATO, F. & TOZZINI, V. A. (in preparation). Coarse
DUHOVNY, D., VELASZQUEZ-MURIEL, J. A. & SALI, A.
grained model for the dynamic of the aggregation of the
(2009). The structural dynamics of macromolecular
green fluorescent proteins, in preparation.
processes. Curr. Opin. Cell Biol. 21, 1–12.
TRYLSKA, J., TOZZINI, V., CHANG, C.-E. & MCCAMMON,
SHERWOOD, P., BROOKS, B. R. & SANSOM, M. S. (2008).
J. A. (2007). HIV-1 protease substrate binding and
Multiscale methods for macromolecular simulations.
Curr. Opin. Struct. Biol. 18, 630–640. product release pathways explored with coarse-grained
SHI, Q., IZVEKOV, S., & VOTH, G. A. (2006). Mixed atom- molecular dynamics. Biophys. J. 92, 4179–4187.
istic and coarse grained molecular dynamics: simulation TRYLSKA, J., TOZZINI, V. & MCCAMMON, J. A. (2005).
of membrane a bound ion channel. J. Phys. Chem. B. 110, Exploring global motions and correlations in the ribo-
15045–15048. some. Biophys. J. 89, 1455–1463.
SILVERSTEIN, K. A. T., HAYMET, A. D. J. & DILL, K. A. VAN AALTEN, D. M. F., DE GROOT, B. L., FINDLAY,
(1998). A simple model of water and the hydrophobic J. B. C., BERENDSEN, H. J. C. & AMADEI, A. (1997).
effect. J. Am. Chem. Soc. 120, 3166–3175. A comparison of techniques for calculating protein
SOHEILIFARD, R., MAKAROV, D. E. & RODIN, G. J. (2008). essential dynamics. J. Comput. Chem. 18, 169–181.
Critical evaluation of simple network models of protein VOET, D. & VOET, J. G. (2005). Biochemistry. 3rd edn.
dynamics and their comparison with crystallographic New York: Wiley.
B-factors. Phys. Biol. 5, 026008. VOLTZ, K., TRYLSKA, J., TOZZINI, V., KURKAL-SIEBERT, V.,
SORENSON, J. M. & HEAD-GORDON, T. (2002a). Protein LANGOWSKI, J. & SMITH, J. (2008). Coarse-grained force
engineering study of protein L by simulation. J. Comput. field for the nucleosome from self-consistent multi-
Biol. 9, 35–54. scaling. J. Comput. Chem. 29, 1429–1439.
SORENSON, J. M. & HEAD-GORDON, T. (2002b). Toward WANG, Y., NOID, W. G., LIU, P. & VOTH, G. A. (2009).
minimalist models of larger proteins: a ubiquitin-like Effective force coarse-graining. Phys. Chem. Chem. Phys.
protein. Proteins 46, 368–379. 11, 2002–2015.
THORPE, I. F., ZHOU, J. & VOTH, G. A. (2008). Peptide
WU, Y., LU, M., CHEN, M., LI, J. & MA, J. (2007). OPUS-
folding using multiscale coarse-grained models. J. Phys.
Ca: a knowledge-based potential function requiring only
Chem. B 112, 13079–13090.
Ca positions. Protein Sci. 16, 1449–1463.
TIRION, M. M. (1996). Large amplitude elastic motions in
YAP, E.-H., FAWZI, N. L., & HEAD-GORDON, T. (2008). A
proteins from a single-parameter, atomic analysis. Phys.
coarse-grained a-carbon protein model with anisotropic
Rev. Lett. 77, 1905.
hydrogen-bonding. Proteins 70, 626–638.
TOZZINI, V. (2005). Coarse grained models for proteins.
Curr. Opin. Struct. Biol. 15, 144–150. ZACHARIAS, M. (2003). Protein–protein docking with a
TOZZINI, V. (2010). Multi-scale modeling of proteins. Acc. reduced protein model accounting for side-chain flexi-
Chem. Res. 43, 220–230. bility. Protein Sci. 12, 1271–1282.
TOZZINI, V. (in preparation). The phase diagram of a ZHOU, H. & ZHOU, Y. (2002). Distance-scaled, finite ideal-
minimalist polypeptide model, in preparation. gas reference state improves structure-derived poten-
TOZZINI, V. & MCCAMMON, J. A. (2005). A coarse grained tials of mean force for structure selection and stability
model for the dynamics of flap opening in HIV-1 pro- prediction. Protein Sci. 11, 2714–2726.
tease. Chem. Phys. Lett. 413, 123–128. ZHOU, J., THORPE, I. F., IZVEKOV, S. & VOTH, G. A. (2007).
TOZZINI, V. & MCCAMMON J. A. (2008). One-bead models Coarse-grained peptide modeling using a systematic
for proteins. In Coarse Graining of Condensed Phase and multiscale approach. Biophys. J. 92, 4289–4303.

You might also like